My most recent post on Demand Signal Repositories bemoaned their general lack of ability to rapidly ingest new and interesting data sources (e.g.: promotions, Twitter feeds, Sentiment analysis, Google trends, Shipment history, master data, geographic features, proximity to competitor stores, demographic profiles, economic time series, exchange rates etc.).
As a result, analysts spend far too much time collecting/copying data into ad-hoc data marts to enable useful modeling work. At the extreme, you can move Terabytes of low-level data from a perfectly good database into another one (probably on a lower powered machine) so as to manually merge it with a few hundred records of new data you need for analysis. This is slow (actually very slow), error prone and leaves very little time to do valued added work.
Based on questions from blog readers via email, I think that I failed to call out how big the gap is between where we are now and where we should be. Let me spell it out. If I go to my (or your) IS department now and ask "how long would it take to integrate XXX data into the DSR so it is loaded, cleaned, gap-filled, matched to appropriate dimensions and ready for some interesting analytic work." I would expect to hear back "between 6 and 12 months" and that's assuming they have both some developer availability and the necessary access to add/modify data stuctures - some DSRs are locked down tight. If I went to the DSR vendor, it may be a little faster, depending on just how tightly the data structure is tied into their product release schedule. But here's the thing - I want to do this, myself, in real-time and certainly in less than a day.
Tools such as Alteryx re designed to do data blending. Alteryx in particular, seems to do especially well handling geo/demographical data, some of which comes with it as standard. They also have a number of pre-defined macros to help you get at standard data source like Google Trends and Twitter. If I understand it correctly, it does this by loading all data sources into memory. Perhaps it constructs it's own data repository on the fly, but, certainly, it does not touch the source database's data structure at all.
This would work well for relatively small quantities of data. Let's say you are examining annual sales for a product group by store - you aggregate that to a few thousand records of POS data in the DSR, load it into Alteryx, geocode the locations, match up the geo/demographic data you are interested in and you are ready to run some math. I doubt that would take more than a couple of hours. There is also some analytic power in the platform and at least some level of R integration if you wish to extend it further. For ad-hoc small (sub 10 million record?) data analytics this looks really good.
What if you want to do your modeling at a much lower level of detail though? Do you have the capacity to match across billions of records outside the DSR? Perhaps, but it's going to cost you and why move it all into another database on another expensive server when you've already paid for those in your DSR? What if you to run analytics repeatedly, do you really want to do geocoding and ad-hoc matching every time you want to use census data in an analysis? Chances are the stores haven't moved :-) and the most frequently updated census data, I think, isn't updated any more often than annually.
Better to do it once, load it into new data structures in the DSR and enable it for ongoing reporting/analytics or... did you want to force even basic reporting through the data blending platform because that's the only place you can match up multiple data sources ? I didn't think so.
If would definitely look at something like Alteryx for ad-hoc work. If you can also use it to source, transform, handle dimensional matching, deal with missing data etc. and load the results back into your DSR (where you just defined new data structures to receive it), I think you might have something.