Often data is categorized into very high-level groupings of structured or unstructured. Generally, structured data is considered data that conforms to an easily identifiable pattern and as part of this conforming, that data may be easily loaded into a relational database table “as is.” Examples of this might be fixed-format files, or comma-separated files having an agreed upon pattern to each record within it. Unstructured data supposedly cannot be loaded “as is” into a relational table. Unstructured data is, by name, lacking an identifiable structure to make sense of the data, right? Not exactly.
Unstructured data is data that contains more complex patterns that must be potentially evaluated closely to extract meaningful bits. Consider JSON or XML, each of these kinds of files fall under the unstructured umbrella. JSON and XML data sources have patterns, specifically the tags that identify the data items within the file. Based on parsing through these tags, one may pick up the individual pieces of data that are of immediate use and value. Email or other basically text documents, more unstructured data, can be understood clearly by an individual reading their content. Language, while certainly subject to misunderstanding, does convey meaning. Even photographic files can be viewed and understood by someone looking at them using the necessary tool.
Extracting Meaning From Data
Tools that persist unstructured data simply store that content without fuss. These same tools do not provide users with the data’s meaning. It is not that the meaning isn’t there, but that the meaning must be identified by the users, in how they view it, or how they write algorithms to parse through it. While these unstructured data storage tools allow data loading with little definition of the content, when the data is retrieved, each retriever must attach his or her own meaning to the data. If dealing with a document-management type tool, the meaning is within the minds of the users as they view the content.
Under other circumstances, data scientists may be analyzing any number of issues. Analyzing may involve simple or complex rules. Parsing through the pixels of an image is only worthwhile if one is finding specific patterns, and in the finding, applying an interpretation, an understanding, “Look, I found this…” One can only imagine the intricate enormity of digital face recognition software. Similarly, just as much logic may need to be applied to convert encrypted files into something legible. Does that mean that encrypted fixed format data is automatically unstructured, and only becomes structured once unencrypted?
Obvious and Non-Obvious Data?
And if identifiable data items can be programmed for, uncovered, and/or extracted from the unstructured content and used, is the data really unstructured? Sadly, this lack-of-structure cannot be true of “unstructured data” because if there genuinely was no identifiable structure, then how could one write code to parse it intelligibly? Truly unstructured content is not data; there is only garbage and noise to be tossed away. Perhaps in a more rational view, instead of using the terms structured and unstructured, we might use the terms obvious and non-obvious data structures?
While behavioral economists and other clever folks may write their own extractions and statistical models to determine if some kernel of useful information may exist within a non-obvious structure, most business personnel require all their data to be “obvious.” And largely, the often found, really useful bits must become pieces of new information within the structured world for use in reporting and analysis by the more operational decision-makers. Unstructured data simply means that someone must apply additional logic to determine what is within, or many people must apply many alternative algorithms to extract the different and unique items within that are of importance to each of those people.