How the Term “Unstructured Data” Confuses People

Unstructured data is a misleading term. It doesn’t mean that the data is truly unstructured. Some folks are swayed by the sales-pitch-siren-song for unstructured data ingestion . . . no need to understand the data, no need for data modeling, and faster speed-to-storage. In extreme circumstances, individuals may falsely believe that once they go “unstructured” they can toss the data modeling and other related functions away. If it were true that a dataset was truly unstructured, it would mean that the content of that dataset was incoherent gibberish. Consequently, there definitely is a “structure” even within unstructured data. But that structure is loose, possibly very non-repetitive, as opposed to the more formal fixed record structures where each specific data element starts and stops at very explicit and known positions.

Unstructured data’s structure is subtler and needs more looking to find something. A given data element may start only after a specific constant string is found, or some item may only exist occasionally. Some “unstructured data” may need a specific kind of tool to interpret the data’s meaning, as for example, when that content is a video clip.

When a tool ingests unstructured data, then the meaning of “unstructured” is that the ingestion process and the initial storage of such data does not require a detailed understanding of the data’s internal structure to actually save the data. This “saving without comprehension” is very useful when little is known about an incoming data stream, or if an incoming feed is notorious for changing its content drastically without informing downstream users. By not needing to understand the internal structure, data can be saved, and any possible content loss prevented.

The post-ingestion use of this unstructured data has completely different requirements. All the previously missing knowledge now needs to exist. If a user cannot find the detailed items they are looking for, then the data is truly useless and there was no point in saving such data in the first place. Regardless of how complex the rules may be to determine the existence and placement of a critical piece of data, each unstructured data querier must code those rules every time they have a new circumstance using said data items. This circumstance means that, obviously, each user must know and understand the data at a detailed level in order to work successfully through any explorations.

Organizationally, it is in everyone’s best interest for such data knowledge to be documented and shared. And when true enterprise value is found in uses of such data, then more likely than not, those items will need to be extracted, data modeled, and placed into shared corporate structures to help inform the organization. In fact, those items might end up in a relational database supporting enterprise reporting and analytics.

Don’t let the term “unstructured data” confuse you. Structure exists, somehow and somewhere, within unstructured source data. It is that subtle, possibly even encrypted, structure that contains the gems of knowledge that an enterprise seeks. Sometimes those gems only shine when extracted and combined with other little gems from other data sources.

Unstructured data ingestion only slightly delays the need for understanding the data, and delays the need for data modeling. The process of bringing the data into the organization is the only element that does not need complete data knowledge; but even then, one needs to know enough to be certain the data store contains data that will likely be of value and use to the enterprise. But immediately upon arrival, either the details about the data content must be known, or the first job of the data scientist team will be to profile and evaluate said data to build up and create that detailed data knowledge.