The Big Unstructured Data Lie

As a data professional reading this publication I am sure you have heard the term “unstructured data.” And you probably know what is meant by that term, as well. For those who do not, unstructured data is a general term used to define data that is not numbers, letters, and dates stored or viewed as rows and columns.

But it is a horrible term. In fact, unstructured data is a lie. Let me tell you why.

In order for any type of data to be read or processed by a computer, it must have some form and structure. If there is no structure, that is, it is unstructured, then the data is not capable of being understood. But let’s back up a moment and examine what is going on with unstructured data these days.

Ever since they were first designed, computerized systems have been used to store, manage, review and analyze traditional data, meaning:

  • character text such as names, addresses, notations, etc.
  • numeric data such as price, salary, counters, etc.
  • dates and times (originally stored as character text, only later as their own data type in relational database systems)

This type of traditional data is also referred to as structured data. This is an accurate description in that there are specific structures and instructions for storing and accessing each type of data.

Data that falls outside of this scope has been referred to as unstructured data. Common examples of unstructured data include word processing documents and spreadsheets. But think about what unstructured really means. Have you ever tried to read a spreadsheet document using a text browser? Or read a file that is not formatted for Microsoft Word? In both cases, the attempt either fails or the “data” is displayed as a bunch of random characters.

So, what is the point? Well, the point is that there is a structure to this data. If the structure is not correct, then it cannot be read. So calling it unstructured data is not an accurate description. It is structured, albeit in a different way than traditional, structured data.

Today, more and more types of data are being ingested and processed by our computer and database systems. It is a common practice for multimedia data—such as video, audio, and image files—to be processed and analyzed. And people refer to this as unstructured data. But it is anything but unstructured! For images, there are various structures such as JPG, TIF, GIF, and PNG. The same is true for audio with formats such as WMA, AAC, FLAC, and MP3. For video, we have MPG, MOV, WMV, and RM formats. It is not possible to access any of this multimedia data using software that does not understand the structure. So how can it be called unstructured?

And yes, there are other types of data that are commonly referred to as unstructured, including log files, social media data, and more. But this data is not unstructured either. It has a structure, and if you don’t know that structure, you cannot make sense of the data. If you don’t believe me, just try to read the log files of your favorite DBMS without referring to the manual and see how much of it you understand!

The Bottom Line

While there is a benefit to having a term that everybody understands—such as unstructured data—there is also a detriment when the term is inaccurate. It is easy to summarily dismiss unstructured data because, well, it doesn’t have a structure—so how can it be of any value? A better, though more unwieldy term, would be differently structured data. But that doesn’t have much chance of catching on. So, whenever you hear the term unstructured data, try to translate that in your head to “differently structured data” and work to figure out how you can use that data to the benefit of your organization.

In order for any type of data to be read or processed by a computer, it must have some form and structure.