Data Lakes: Pluses, Pitfalls, and the Path Forward

New data sources such as sensors, social media, and telematics along with new forms of analytics such as text and graph analysis have necessitated a new data lake design pattern to augment traditional design patterns such as the data warehouse.  Unlike the data warehouse - an approach based on structuring and packaging data for the sake of quality, consistency, reuse, ease of use, and performance – the data lake goes in the other direction by storing raw data that lowers data acquisition costs and provides a new form of analytical agility. 

Companies are beginning to realize value from Data Lakes in four areas

The first two areas lead to better corporate effectiveness, while the last two areas improve IT efficiencies:

New insights from data of unknown or under-appreciated value

Prior to the big data trend, there was a single approach to data integration whereby data is made to look the same or normalized in some sort of persistence such as a database, and only then can value be created.  The idea is that by absorbing the costs of data integration up front, the costs of extracting insights decreases.  While still true and valuable, this is no longer sufficient as a sole approach to manage all data in the enterprise, and attempting to structure all of this data undermines the value – particularly newer data where the value and extent of reuse is unknown. 

For this reason, dark data is often never captured.  Usually it’s because the data is unstructured and in large quantities.  No one has sifted through it to find out if it has gold nuggets of insight or not.  Since no one has ever explored it, there is no demand for that data or ability for business users to construct a credible justification for investing in the management of the data.  Yet many times, a data scientist digs through dark data and finds a few facts worth repeating every week.  That is, they find gold nuggets.

New Forms of Analytics

One of the most under-talked about aspects of big data innovation is how new technologies like Hadoop, Spark, Aster and other innovations enable the parallelization of procedural programming languages and how that has enabled an entirely new breed of analytics.  SQL is a sequential language and MPP RDBMSs can parallelize this fantastically; but MPP RDBMSs can not efficiently parallelize procedural code.  That is to say, procedural analytics does loops and branches in logic whereas SQL does a single roundtrip to get an answer.  This enables new forms of analytics to be efficiently processed at scale, like graph, text, and machine learning algorithms that get an answer, then compares that answer to the next piece of data, and so on and so on until a final output is reached.

Corporate Memory Retention

Archiving data that has not been used in a long time can save storage space in the data warehouse.  Many companies have compliance requirements that dictate their retention policies, and instinctively retain data to do longitudinal analysis, restating KPIs with new more effective calculations, or creating more effective machine learning programs by having more observations in which to train the algorithms. Until the data lake design pattern came along, there was no other place to put the colder data for occasional access except the high performing data warehouse or the offline tape back-up.  Moving it to a data lake is a cost effective alternative, and now with virtual query tools like Teradata QueryGrid, users can easily access the data in conjunction with the data in the data warehouse through a single query.

Data Integration Optimization

The industry has come full circle on how to best squeeze data transformation costs.  Twenty-five years ago, the standard architecture for data integration was ETL, whereby data was extracted from source systems, transformed on a dedicated server, and loaded into the data warehouse.  This architecture leveraged less expensive hardware on the ETL processing, and separated the processing drudgery of ETL from the more strategic workloads on the data warehouse.  Later, organizations faced with growing data volumes and more complex data transformations pushed the data integration function into the more capable data warehouse, in what was called ELT (extract, load, and then transform).  This actually reduced costs by minimizing data movement and using available processing windows in the data warehouse, usually during the evening of the part of the world where the majority of users resided. Now, the data lake offers greater scalability than traditional ETL servers, at a lower cost, so once again organizations are rethinking their data integration architecture. 

There is a misconception that organizations are doing wholesale ETL migrations from the data warehouse to the data lake.  We have only seen one customer offload more than 50% of the data warehouse ELT (and they were loading 150 billion records per day). In many scenarios, the data integration rules benefit from the complex joins, reference data, and data type safety provided by the relational data warehouse.  

Smart organizations are rebalancing the hundreds of data integration jobs by using the data lake, the data warehouse, and plain old ETL servers to get data integration done, as each has its own set of capabilities and economics.  Rebalancing the data integration portfolio periodically provides the best price/performance, and SLA optimization.  Any CIO who doesn’t rebalance some number of workloads every year is probably losing money. 

While the afore mentioned benefits are real, the unfortunate truth is that most data lake initiatives fail to meet their original expectations or have yet to get out of the proof-of-concept phase.  This is backed up by a variety of analyst surveys, and something on display recently at a Hadoop conference where a mere estimated 15% of attendees raised their hand in the general session when asked who had a production data lake that is returning significant value.  In fact, one of the most common requests we’ve received from clients over the last year is, “Can you fix my data lake?”

Common pitfalls include:

  • A proliferation of data silos, stemming from the low barriers to entry to spin up new data lakes.  Ultimately, such data silos impair cross functional analytics and drive up costs through duplication of hardware, software and people costs. 
  • Conflicting objectives between providers and consumers on topics such as security, governance, and self-service.
  • Project delays resulting from limited commercial grade tools.  Most capabilities, despite having Apache projects around them, are still mostly do-it-yourself, bespoke solutions. 
  • Most importantly, a glaring lack of end-user adoption.  Users accustomed to neatly packaged, easy to navigate, and high performing data warehouses often reject the data lake even with its new data sets and new kinds of analytics if it is too hard to use.  Data lakes won’t deliver their full value unless users expand beyond users with programming skills. 

These pitfalls abound in the absence of a large body of well understood best practices. Drawing upon many sources as well as on site experience with leading data driven customers, Teradata defines the data lake as a collection of long-term data containers that capture, refine, and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream facilities may draw upon.

A design pattern is an architecture and set of corresponding requirements that have evolved to the point where there is agreement and best practices for implementations. How you implement it varies from workload to workload, organization to organization. While technologies are critical to the outcome, a successful data lake needs a plan.  A data lake design pattern is that plan.

The data lake definition does not prescribe a technology, only requirements. While data lakes are typically discussed synonymously with Hadoop – which is an excellent choice for many data lake workloads - a Data Lake can be built on multiple technologies such as Hadoop, NoSQL, S3, RDBMS, or combinations thereof.

The Starting Place for Data Lake Initiatives

The starting place for any data lake initiative is a data lake architecture deliverable.  The data lake architecture is a document that defines what the organization is seeking to achieve, where the value will come, and the policies and principles that drive decisions that keep the initiative on course.  The technology recommendations logically follow. 

If this sounds onerous, it does not have to be.  This approach of creating a design pattern before building a data management environment has served the data warehousing marketing for decades.  The key is to find a partner with experience, templates, and re-usable IP to be your Sherpa on your journey to the data lake. 


Subscribe to Big Data Quarterly E-Edition