The Coming Datapocalypse

Bookmark and Share

With the rise of object-oriented programming, the world was transformed forever, and indeed largely for the better. Previously, application creation moved forward in a plodding fashion. It was analysis, design, code, test, and deploy. Things took time. The time and the resources necessary were valuable, so things had to be done surely and correctly—measure twice, cut once. But then, the world changed; people were enabled to create complex processing more quickly than ever before. Things that were considered mind-bogglingly-too-complicated to be done via software started becoming much easier to do. Almost overnight, the world evolved from green screen applications tabbing along and entering text to high-resolution graphics, touch-screens, gesture controls, and voice recognition while watching movies. It was so quick that often it was easier to do it over until it was “right.” More complicated things arose, but now a much larger allowance for errors became the norm. Things could freeze up, and unexpected results might arise here or there. But often, where the problems arose from, or how to resolve them, was unknown. And ever since, software vendors are keen to point fingers to some component other than theirs across the many layers now comprising each solution, and sadly they cannot be proven wrong.

In a similar fashion to the arrival of object-oriented programming, the current wave of NoSQL/big data/alt-DBMS tools, packages, and distributions holds the potential to have an equally world-shaking impact across the persistence-of-data landscape. Each of these tools encourages users to slap together more files. Users are emboldened to capriciously add in anything under the sun. Many of these tools are so “flexible” that an unintended typo from a developer simply creates a new attribute, file, or array without even a warning. Everyone is a DBA; specialized skills are downplayed as being of little particular importance. The important aspect is that everyone creates new data stores without any hindrance. This results in more people creating more heaps of data stores more quickly than ever before, proliferating more files, tables, and JSON documents.

Without some functional level of governance we are headed toward a “datapocalypse”—having lots of content but little of value. Petabytes, and more, of data are meaningless without guides to what it all is. Future data users need to know where it came from, how it was filtered, and how it was transformed. In an environment lacking proper metadata and documentation, those few individuals who understand the data, those cloaked mystical “data whisperers” off in the fringes, are the ones with power. Everyone must wait in the knowledge-bearers queue before embarking on any new project. To prevent this future, we all need to make sure that we are practicing a functional level of data hygiene. It must no longer be acceptable for analysts to create data stores with meaningless names, such as MyData, that contain attributes named with keywords such as KEY, NAME, and DATE.

Tomorrow, when the data store originators no longer work there, what they have done will be ambiguous. Resources will be wasted attempting to reverse-engineer leftover logic, assuming that the initial logic creating a particular store is identifiable and still available.

Functional data management is not burdensome. Having simple descriptions of files and good standards for naming objects really does not slow anything down. It may only be a developer’s fear of using English rather than Python that prevents the existence of more adequate documentation across many organizations. A hybrid mix of many tools and approaches may be a very good choice, but it most certainly is not a healthy choice if everything involved is a jumbled muck and mire that barely can be identified and explained. Take care of the present and the future will take care of itself. Consistently practice good data hygiene across the things you do, and you may find yourself as the one among your peers with the true competitive edge.

Todd Schraml has more than 20 years of IT management, project development, business analysis, and database design experience across many industries from telecommunications to healthcare. He can be reached at