How to Keep Your Data Lake From Becoming a Swamp

In the big data world of today, issues abound. People discuss structured data versus unstructured data; graph versus JSON versus columnar data stores; even batch processing versus streaming. Differences between each of these kinds of things are important. How they are used can help direct how best to store content for use. Therefore, deep understanding of usage is critical in determining the flavor of data persistence employed.

These are details that must be explored each and every time new data is acquired, and each time new data uses arise. But a more important discussion is avoided—documented versus undocumented data.

For our context here, “documented data” means that some level of metadata exists that provides enough information for the stored content to be understood, to be useful, and used by data scientists and others within the organization. On the opposite pole, “undocumented data” is content that has no known metadata helping to explain what that undocumented data content is or how it is used.

Technologists prefer to avoid metadata discussions and instead dive into the other issues, such as deciding whether to use data frames or Hadoop partitions. Focusing on the purely technical can be far more interesting to IT staff than dealing with the sticky ooze of a situation that may arise in working through the process of obtaining proper metadata. On a very practical level, metadata is extremely important. JSON files are considered “self-documenting”; but really, an organization might want to consider what any new person must deal with whenever he or she wishes to process a file. In other words, any new person wanting to process a file has to spend time crawling through record after record looking at the existing tags while attempting to guess what each tag may mean, then guessing at value sets based on seeing just a subset, and only recognizing that tags are no longer used once they disappear inside content, or recognizing new tags only after they magically appear. These tasks can quickly become the lion’s share of each data scientist’s job. It is hard to focus on finding new and useful interrelationships inside the data when one’s time is consumed just figuring out the basics, for the 20th time.

Once metadata is acknowledged as important, the next question becomes, just what metadata does an organization wish to make mandatory? Additional questions also become relevant with metadata. Who owns the incoming data? What items exist within the incoming data and what do they mean (data dictionary)? Who needs to approve internal access to the data? Where is the data (server names, table names, files names, folder names, etc.)?

Just as the content of the data lake may be structured or unstructured, metadata also may have an unstructured component. A folder name or table name and a subject matter expert name can each have its own bucket in a spreadsheet or table, and a link to a profiling report on the items within the dataset can profile useful information for a data scientist to read through. Regardless of what information is flagged as mandatory, how does the organization wish to respond when a mandatory item is not  available? Does everything stop? Issues such as missing requirements are where the rubber meets the road on how committed an organization is to data governance.

Metadata issues are critical because without minimal information, development cycles are wasted simply tracking down whether or not a data element exists consistently among the transactions about to be processed. Having a coherent ontology for depositing data profiling can provide a treasure trove that saves people many hours of work. Lack of coherent and useful metadata is the simplest way to turn one’s data lake into a muddy data swamp where things just go to die.