Dreams of Mining Data Science Gold Begin With Knowing Your Data

Bookmark and Share

Data lakes, data warehouses, data marts, relational database management systems, Hadoop, Cassandra, MongoDB, Neo4j, and so on, and so forth. There are plenty of approaches for discussion. Arguments arise over third normal form, XML, complex data values, simple data values, multidimensional structures, denormalization, and a thousand other things. New terms emerge over which one may argue every day: Have you heard mention of data munging, data wrangling, or data ingestion? It appears that the ground of reason for deliberation remains ever fertile.

But, long before one needs to worry over any of those terms and ideas, there is a first step that must be taken. First, and organizationally, it must be decided that, “Yes, data is important.” And from that decision, steps must follow that express that importance. How does an organization acknowledge that data is important? An organization does so by enabling and supporting efforts for gathering and persisting information about the organization’s data resources. In other words, in some way, shape, or form, they must persist in establishing metadata. And alongside the growing pool of metadata initiatives, they must initiate some basic data governance. If an organization has little or no commitment to a coherent, functional, and useful metadata policy, the rest is moot. The result will be a hollow solution that is more a so-called success than an actual success, i.e., the data lake appears as a swamp or the data warehouse becomes a data junkyard.

Tools are fairly unimportant. Therefore, whether one purchases a metadata repository or simply publishes a bunch of spreadsheets out to a SharePoint site does not matter. Certainly, one or another tool can help optimize workloads or administration activities; but it is more important to gather the actual information than it is to find the shiniest tool you can afford. Also, it is important that the gathered information is available across the organization. Not everyone can open up the source code from any application across an organization to find out what each file, table, or field was intended for, or how to decode values. Therefore, the organization is responsible for providing the needed transparency into the data and the business rules surrounding it.

Regardless of what any vendor may say, metadata cannot be created via automation. Spooling together lines of code that have moved a single field from point A to point B, even when done correctly, does not express any business rule in a coherent description. Tools that provide such versions of data lineage expose pieces of code, but that is all it is, i.e., raw code, encrypted, potentially misleading or repetitive without any real insight. What is needed are precise details of what each field means, how it changes over time, as well as how it relates to other fields and their changes over time. Whether one is coding in SQL or in MapReduce, one needs to know how to recognize the elements one is seeking. Once clear metadata exists, then the business becomes more enabled to ask the right questions of the right data, as opposed to playing a guessing game that wastes everyone’s time. And, as the right questions become more obvious, it becomes easier to establish those data marts to support the usual and expected issues. And, it also becomes easier to establish reporting and analysis tools that provide users more capabilities for self-service. This frees up the more technical resources to explore the more complex issues to help uncover the next generation of right data and right questions to continue the cycle.

 If you do not know your data, thoroughly, how can you possibly decide how best to put it to use? Data is a resource, not a burden. Knowledge is power; so power up your data workers whether they be data scientists, data analysts, report builders, whatever. Give them data knowledge. Once the information is exposed and understood, all sorts of good things can start.