I've blogged before about the need to use the right tools to hold and manipulate data as data quantity increases (Data Handling the Right Tool for the Job). But, I really want to get to some value-enhancing analytics and as data grows it becomes increasingly hard to apply analytical tools.
Let’s assume that we have a few Terabytes of data and that it's sat in an industrial-strength database (Oracle, SQL*Server, MySQL, DB2, …) - one that can handle the data volume without choking. Each of these databases has its own dialect of the querying language (SQL) and while you can do a lot of sophisticated data manipulation, even a simple analytical routine like calculating correlations is a chore.
Here's an example:
(COUNT(*)*SUM(x.Sales*y.Sales)-SUM(x.Sales)*SUM(y.Sales))/( SQRT(COUNT(*)*SUM(SQUARE(x.Sales))-SQUARE(SUM(x.Sales)))* SQRT(COUNT(*)*SUM(SQUARE(y.Sales))-SQUARE(SUM(y.Sales))))
correlationFROM BulbSales x JOIN BulbSales y ON x.month=y.monthWHERE x.Year=1997 AND y.Year=1998
extracted from the O'Reilly Transact SQL Cookbook
This calculates just one correlation coefficient between 2 years of sales. If you want to calculate a correlogram showing correlation coefficients across all pairs of fields in a table this could take some time to code as you are re-coding the math every time you use it with the distinct possibility of human error. It can be done, but it’s neither pretty nor simple. Something slightly more complex like regression analysis is seriously beyond the capability of SQL.
Currently, we would pull the data we need into an analytic package (like SAS or R) to run analysis with the help of a statistician. As the data gets bigger the overhead/delay in moving it across into another package becomes a more significant part of your project, particularly if you do not want to do that much with it when it gets there. It also limits what you can do on-demand with your end user reporting tools.
So, how can you bring better analytics to bear on your data in-situ? This is the developing area of in-database analytics: Extending the analytical capability of the SQL language so that analytics can be executed, quickly, within the database. I think it fair to say that it’s still early days but with some exciting opportunities:
- SAS, the gold standard for analytical software, has developed some capability but, so far, only for databases I'm not using (Teradata, Neteeza, Greenplum, DB2) SAS in-database processing
- Oracle recently announced new capability to embed R (an open source tool with a broad range of statistical capability) which sounds interesting but I have yet to see it. Oracle in database announcement
- It’s possible to build some capability into Microsoft’s SQL Server using .NET/CLR and I have had some direct (and quite successful) experience doing this for simpler analytics. Some companies seem to be pushing it further still and I look forward to testing out their offerings. (Fuzzy Logix, XLeratorDB).
No doubt there are other options that I have not yet encountered, let me know in the feedback section below.
For complex modeling tasks, I am certain we will need dedicated, offline analytic tools for a very long time. For big data that will mean similarly large application servers for your statistical tools and fast connections to your data mart.
For simpler analysis, in-database analytics appears to be a great step forward, but I’m wondering what this means in terms of the skills you need in your analysts: when the analysis is done in a sophisticated statistics package, it tends to get done by trained statisticians who should know what they are doing and make good choices around which tools to deploy and how.
Making it easier to apply analytical tools to your data is very definitely a good thing. Applying these tools badly because you do not have the skills or knowledge to apply them effectively could be a developing problem.