"Big data" and the impact of analytics on large quantities of data is a persistent meme in today’s information technology market. But what exactly is big data? The most common definition was coined by Forrester Research defining big data in terms of “The 4 V’s” -- volume, velocity, variety, variability.
The first V is volume and that is the obvious one, right? In order for “data” to be “big” you have to have a lot of it. And most of us do in some form or another. A recent survey published by IDC claims that the volume of data under management by the year 2020 will be 44 times greater than what was managed in 2009. But volume is only the first dimension of the big data challenge; the others are velocity, variety, and variability. Velocity refers to the increased speed of data arriving in our systems along with the growing number and frequency of business transactions being conducted. Variety refers to the increasing growth in both structured and unstructured data being managed. And the fourth V, variability, refers to the increasing variety of data formats (as opposed to just relational data). Others have tried to add more V’s to the Big Data definition, as well. I’ve seen and heard people add verification, value, and veracity to this discussion.
Management and Administration
One of the big questions looming in IT departments about big data is what, exactly, does it mean in terms of management and administration. Will traditional data management concepts such as data modeling, database administration, data quality, data governance, and data stewardship apply in the new age of big data?
Well, according to analysts at Wikibon, big data refers to datasets whose size, type and speed of creation make it impractical to process and analyze with traditional tools (see http://t.co/awsPyuqXjZ). So, given that definition, it would seem that traditional concepts are at the very least “impractical,” right?
But, of course traditional data management concepts should apply in the age of big data management. Failing to apply these concepts will result in poor data quality and analytics performed on bad quality data will produce bad results. And the whole purpose of big data is to glean intelligence from the large amounts of data we accumulate.
Issues and Adaptations
Yet, there are issues and adaptations that will be required as we apply data quality, data governance and data stewardship to big data management. For example, with traditional data quality, some amount of cleansing can occur as humans eyeball the data. But most raw big data is not eyeballed because there is simply too much of it.
In some applications big data is generated from automated machinery. In those cases (e.g., medical devices, automated metering, etc.) only rudimentary cleansing (if any) may be needed. At least as long as the meters are calibrated and maintained properly.
The bottom line with big data management is that the speed of data accumulation and the overall data volume can make traditional data management techniques challenging. Policies, procedures, automation and education are needed to ensure that the big data makes its way to the right systems and people.
But let’s not burden big data management with things we have yet to master and incorporate into all of our traditional data systems. Sometimes we forget that - in practice - many organizations do not follow a traditional data lifecycle, practice data governance, ensure data quality, and so on. So yes, big data should do these things, but it is not necessarily failing if it does not.
Data Stores for Big Data Processing
Another consideration is the data stores used for big data processing and how they are to be managed. Frequently, big data is coupled with NoSQL database systems. The biggest difference between a NoSQL DBMS and a relational DBMS is that NoSQL does not rely on SQL for accessing data. Additionally, a NoSQL DBMS typically does not require a fixed table schema, does not provide ACID properties (instead delivering “eventually consistent” data), and are highly scalable. There are no hard-and-fast rules as to how NoSQL databases store data. Many are Hadoop-based. Some of the more popular NoSQL storage mechanisms include key-value stores, graph databases, and document stores.
Hadoop-based products and NoSQL database systems will need to be augmented with mission-critical DBMS capabilities before they become a trusted component of the IT infrastructure. Many of these systems do not have the robust ACID qualities and data protection capabilities of the traditional RDBMS. But I think DB2 (and other RDBMS products) are likely to be extended with big data capabilities before that happens. Witness the work that IBM is doing with DB2 for z/OS and Netezza.
With big data we may be shifting into a new paradigm and we need to take advantage of that shift to implement the data management practices that will ensure success. In other words, ensuring that we treat (big) data like the corporate asset it is.