Data lakes have helped organizations deal with the massive amounts of data generated daily. They are intended to serve as a central repository for raw data, a treasure trove for data scientists to analyze and gain actionable insight. They also serve as the foundation for many “self-service” analytics initiatives. The promise of a data lake includes a pristine environment that is easy to use and gives organizations the opportunity to use machine learning to mine the true value of their data. The insight gleaned will help increase competitiveness and improve decision making.
One of the main attractions of data lakes is flexibility. Getting data into a lake is simple. Getting insight and value from all of that data, however, has proven to be challenging. A recent Forrester report found that 60%–73% of all enterprise data goes unused for analytics. This statistic exposes some of the harsh realities of data lakes.
The biggest hurdle with a data lake takes place during its inception. It’s quite easy, and common, to misalign on the purpose and content of the lake. In users’ eagerness to aggregate data, they have often overlooked establishing the processes and controls that address key questions about the data: What is it? Who owns it? How should it be defined? Does it really belong in the lake? Instead, every bit and byte of data has found its way into the lake, and this lack of oversight has muddied the waters. As a result, it’s difficult for users to know what’s in the lake, or if the data they do find can be trusted.
Bigger Is Not Always Better
Many of us who manage a lake know that more data does not always mean greater insight. A recent MIT Sloan Management Review article highlighted research showing that the more data we hold onto, the more difficult it becomes to gain insight from that data. We are only tapping a fraction of the potential value of data.
If you’ve inherited a swamp, or are just starting out, implementing the right level of governance is critical to extracting value. Data governance creates a framework and provides transparency across the organization regarding what data it has, who owns it, and, of course, whether it belongs in the lake at all.
A good data governance framework should be combined with a data catalog. A data catalog offers a single source of intelligence for data users to discover and consume. Users can collaborate to understand the data’s meaning and use, to determine which data is fit for purpose, and which is unusable, incomplete, or irrelevant. A data catalog should contain data for all of the categories comprising the lake, and the catalog should identify the most valuable datasets. For example, if 80% of users consume 10% of the data, the catalog needs to be able to detect this and label it as the most valuable data.
Data catalogs are helping organizations come of age in the big data era. One example is Cox Automotive, a subsidiary of Cox Enterprises. Cox Automotive was formed in 2014 to consolidate Cox’s global automotive businesses. The consolidation involved many brands and divisions, and as expected, a massive amount of data. After the consolidation, there was growing confusion among Cox Automotive teams and departments regarding the veracity of reports that were critical to the business. Multiple data sources with different definitions of metrics created more questions than expected value.
Cox Automotive faced a question familiar to many of us: How do you effectively gain insight from a vast pool (or pools) of raw data and get that information into the hands of the groups that need it? The company took a centralized approach in creating a comprehensive data catalog to discover what data it had. The company then categorized and segmented items into searchable buckets for end users, similar to the Yellow Pages.
For Cox Automotive and any other company, the data catalog helps organizations know:
- What data is available
- What the data means
- The origin of the data
- The data’s consistency
- How to access the data
- How complete and accurate the data is
Business units with Cox Automotive can easily control the catalog, helping users understand what data they have, access that data, and even conduct gap analyses to understand which data is needed. With strong data governance and catalog functionality now in place, Cox Automotive business users spend less time concerned about the accuracy of the data and can focus on analyzing the data to gain new insights and better meet customer needs.
Data lakes have presented us with a conundrum: We are simultaneously drowning in data and dying of thirst (for information). But it’s not too late to unlock the value hidden in our data lakes. Governance, combined with a data catalog, can help us curate a pristine lake filled with the most valuable data.