Needed? Unnecessary? Yes!

Data modeling has always had a push-me/pull-you relationship in the IT world. The idea of complete logical and physical data models, insightful definitions, coherent data flow diagrams, and other related informational caches is always desirable.

However, far too often, the expectation is that these activities take virtually no time to create. When the expenditure of resources goes beyond a certain level, the activities are jettisoned or otherwise given the short shrift. Actually, getting code in place is viewed as the only evidence of true progress. Under the right conditions, data modeling-related activities can be completed quickly. Those conditions require that knowledge of the data exists and is accessible. Often, data knowledge is absent, and leaders expect data modelers or data architects to simply intuit everything in some magical fashion like a fortune teller reading tarot cards.

The unwanted delays in creating logical or physical data models arise from the need to profile existing data stores, evaluate values found, interview any number of people to learn about the true relationships across the data, and continued efforts to make reasonable assumptions about the meaning and use of every data item.

Data modeling matters become more tenuous as the tools we use advance. More and more tools allow the data pipeline builder/engineer to become the de facto data modeler. Often the pipeline builder has far less concern about the data modeling function than a separate and explicit data modeler.

These pipeline builders may not even have the necessary skills or licenses to generate a data model diagram in a formal tool. Physical models as generated by an ETL tool may be the only documentation one has. Logical data modeling and advance planning are cast aside. These circumstances lead to data structures remaining largely unchanged from source to target. Maybe a few new items are bolted on, an item or two are thrown away, but no effort is expended in tracking down meaning and thinking through normalization or dimensional rules. The data sits there, people can execute whatever queries they want, and users can get answers, therefore all is well. Right?

In the dim past, there were performance reasons in support of why data needed to be well-structured. Semantically disharmonious data structures offered poor query performance. Semantically valid structures aligned so well with how queries were formed that queries would execute more smoothly. But modern databases and similar tools often provide good to great performance even using the poorest of data structures. Under these new conditions, it is easy to understand why many in management see a diminished need for worrying about the data modeling function.

But more than ever, understanding our data has become critical to business success. Just as AI applications “hallucinate” when bad data is input, a lack of thorough data understanding prevents data scientists from providing useful insights. Enterprise perspectives have values; therefore, enforce consistent standards in naming and encoding enhance data usefulness. And suboptimal data structures do cause excessive redundant code to be executed that require cleaning up issues, jumping around bad data, and enforcing standardization.

While these extra bits of code may execute quickly, why waste the cycles, why replicate junky code when a transformation and cleanup in building a more proper dimension, fact, or normalized table can make everything simpler and much more direct?

While modeling and formatting our data well may not matter so much for query performance, it is a great convenience that allows our access to be easy and understandable. Easing the coding needs also frees up the mind of the engineer/scientist to explore deeper meaning and options. Spending the necessary time to gain an understanding of one’s data is one of the most valuable assets to achieve. Data modeling is still of value to the organization in order to help everyone understand the data, so that they can work faster, better, deeper.