Data Warehousing and Analytics are Changing Fast. You Need to Keep Up…
Relational database management systems (RDBMSs) have had a dominant role in data warehousing for the last 20+ years. Many software architectures, modelling techniques, organizations’ hardware and software investments, RDBMS and Business Intelligence (BI) vendors’ wellbeing and peoples’ skills and careers have been based on that fact. Now things are changing. The world of Data Warehousing and Analytics is getting more complex – due to the demands of Big Data and the emergence of new technologies, such as the Hadoop framework. However, it is important to say that these new technologies should be seen as complementary to existing and future RDBMS investments.
In his Harvard Business Review article “Preparing for Analytics 3.0” Tom Davenport talks about Analytics 3.0. He characterizes traditional BI and reporting applications (on RDBMSs) as Analytics 1.0, the analysis of large, fast moving, external, and unstructured data from new data sources (on Hadoop and NoSQL) as Analytics 2.0 and the combination of these environments as Analytics 3.0. My ex colleagues at Gartner talk about the Logical Data Warehouse – “Mark Beyer, Father of the Logical Data Warehouse, Guest Post” – to describe the new multi database/ file system data warehousing architecture. At the Teradata Universe conference that I went to in Prague earlier this week Teradata promoted its version of this – the Teradata Universal Data Architecture, comprising the Teradata RDBMS, the Teradata Aster Discovery Platform and Hadoop frameworks from partners like Hortonworks.
Essentially, it is no longer about having to choose just one set of technologies or the other. It is about creating an architecture that embraces both – in a way that they complement each other. Why you might ask? Can’t we do everything with the RDBMS that we know and love? To illustrate the kind of new business questions that need answering Teradata gave the following Customer Relationship Management (CRM) example relating to customer data. By looking in the RDBMS you can see the value of the customer, but the sentiment that the customer has regarding your company and its products sits in the Hadoop system (probably in a relatively unstructured format). And if you want to determine the influencer effect of that customer you would need graph database functionality to model their social network. That means two or three different databases or file systems to manage and to interrogate. But by getting them to work together you can build a much better picture of your customer and determine the optimum next best action to take.
That makes sense. Although, it might not be quite so easy. Obviously there is going to be a skills challenge. People who have spent their lives working with structured data and using a set of mature technologies that are based on a “schema on write” philosophy, will need to adapt to also working with “unstructured” data and using a set of so far relatively immature tools based on a “schema on read” philosophy. See “Schema-on-Read vs Schema-on-Write” from Cloudera CTO, Amr Awadallah.
Another challenge is the issue of how do you create a query that crosses these multiple databases in your logical data warehouse and brings back the data that you need in a sensible timeframe without killing the network. Enter Data Virtualization technology like Teradata’s new QueryGrid.
So that hopefully solves the multi database / file system access issues, but then how do you know that the data in the multiple databases can or should be combined? Do these different structured, semi structured and unstructured data sets really refer to the same identifiable customer and does the definition of customer have the same semantic meaning across the multiple data sources? This is starting to get trickier… In an RDBMS data mart world people would follow Ralph Kimball’s advice on having conformed dimensions – “Dimensions in a Data Warehouse.” How will you do that in this new world?
And seeing as I have a background in Master Data Management (MDM), how can master data and MDM technologies help in resolving identities and attributes in these unstructured data sources? Also, how can they help in “informing” the management of dimensions and hierarchies in the analytical world to achieve semantic consistency across both 1) the new analytical world and also 2) the operational world and the analytical world?
Much to ponder on…
By John Radcliffe of Radcliffe Advisory Services
This blog post was originally published on www.radcliffeadvisory.com.