The New Data Lakehouse: An Overdue Paradigm Shift for Data

By Dave Langton, VP of Product, Matillion

Feb 8, 2022

Fundamental changes in the way we work with data come along very rarely. For example, the database model that has remained the industry standard for decades—the relational database—was first conceived in 1970. While there have been many database-related innovations over the years, the same old data paradigm has been shoehorned into a modern context that looks very different from yesteryear. Changes to data storage and computation have expanded what data teams can accomplish, but without a paradigm shift, the data world is left with the same core challenges.

The different arms of data teams are accustomed to working separately in their own domains, with their own data and their own tools. But this creates inefficiencies that ultimately lead to information gaps within an organization. Businesses can no longer afford to operate in these silos. They now realize the critical role data can play, and obtaining and leveraging the data generated across the entire business requires these information gaps to be minimal.

In order to realize the full business value of data and unlock its potential, data management needs to become a collaborative environment. A complete culture change driven by new technology can transform businesses—bringing together data engineers, data scientists, business analysts, and anyone else who depends on quality data, to work together to reduce costs, drive innovation, and shorten time to market. This transformation requires breaking down barriers between data teams and centering the data challenges and tools of today. And this paradigm shift is already underway.

Enter the data lakehouse. There’s been a lot of hype recently around the concept of a data lakehouse, and for good reason. Essentially, it is a new data management paradigm that combines the capabilities of data warehouses and data lakes, changing the way data teams operate together. This new architecture represents a significant fundamental shift in the way we work with data.

The lakehouse has huge potential for the enterprise, with the power and flexibility to handle modern analytics and enable businesses to be descriptive, predictive, and prescriptive with their insights. This new paradigm will move organizations into the future by solving some of those core challenges remaining from holding too tightly to the status quo. What’s needed is the ability to prepare data, derive insights and make transformative decisions in a timely manner; equipping engineers, data scientists, and business users with quality data that can be easily accessed; and bringing different types of data workers together by providing a truly collaborative environment in which data culture can flourish.

Bridging the Gap Between Data Engineers and Data Scientists

One of the most pervasive of the challenges not yet solved by older paradigms is eliminating silos and bringing different types of data workers together in a collaborative environment to build a thriving data culture. This pain point may get less attention than things processes such data quality and preparation, but it is perhaps the most important to the foundation of modern analytics.

For successful analytics in a modern context, it’s critical for data engineers and data scientists to be on the same page. But until the recent introduction of the lakehouse, data teams worked in their own domains, with their own data. Data engineers played mostly in data warehouses, where their structured data lived and could be used for reporting, analytics, and business intelligence. Data scientists preferred the data lake for its ability to combine both structured and unstructured data in its raw form, where it could be used to find new opportunities through deep insights, predictive analytics, and machine learning and AI pattern recognition.

This lack of collaboration between data engineers and data scientists is a critical barrier to business productivity and innovation. The division of labor needlessly duplicates effort and creates extra steps that slow the ability to find value in that data. As just one example, data scientists often create experimental data products that then have to be rebuilt by data engineers before they can be used in production.

The data lakehouse brings these two worlds together in a dynamic way. Equipped with both the data structure and management features of a data warehouse and the ability to store data directly on the kind of low-cost storage used in traditional data lakes, the lakehouse unifies data engineers and data scientists into one true data team in the same system, using the same tools. When data teams no longer operate in cloud silos, they can work together much faster while reducing risks to data fidelity. Plus, with a single, consolidated location for data, teams always have the most complete and up-to-date data available for all of their data science, machine learning, and business analytics projects.

Improving Data Management

In addition to the need to evolve data team structures, the type of data collected is evolving. The rise of IoT sensors and devices, as well as video and audio tools, necessitates data teams being able to work with different types of structured, semi-structured, or unstructured data. Even existing datasets are different from one moment to the next with constant schema changes. Dealing with all of these different data types isn’t just time-consuming—it’s costly. It requires paying for and managing multiple data infrastructures and the operational costs associated with each.

Because a lakehouse enables teams to manage both structured and unstructured data, it creates greater resiliency when responding to new trends in data. The lakehouse moves with data types and schema changes, blurring the line between structured and unstructured and allowing all raw data to be stored in one central location while maintaining a storage layer on top. Data diversity is no longer a concern because businesses can manage all data formats and keep costs down in the process.

Combining structured and unstructured data also decreases susceptibility to data loss. Data recovery and high availability are simpler when all data is managed in a unified solution. These days, a strong data posture has become imperative for improving overall organizational readiness and resilience. By adopting a lakehouse architecture, organizations are future-proofing their data needs.

Deriving Value From Data—Fast

The lakehouse paradigm not only addresses how data is stored and collaborated on, but what types of insights and actions can come as a result. Modern data teams want to move beyond describing the present state through descriptive reporting, and even past predictive reporting which forecasts the future. Prescriptive reporting, which advises businesses on possible outcomes and what to do next, is becoming the ultimate goal.

In a lakehouse, where data and data practices can be shared across teams, it’s possible to build both quality data and data science agility that are essential to prescriptive analytics. As both data engineers and data scientists gain increasingly faster access to shared, secure, and connected data, enterprises can better align with modern analytics and see faster time to insights.

Arriving at insights more quickly also means faster time from data science experiments to production, a must-have for businesses to stay agile. The need for speed in development and productization is especially urgent for businesses wanting to get value from their data scientists. Today’s data scientists spend the majority of their time prepping data, rather than doing what they’re paid to do: model the data and derive insight from it. Speed and collaboration are the vital ingredients for organizations on the data journey who wish to mature their business reporting and analytics practices.

Older data paradigms are also no longer relevant for machine learning or AI operations—functions that promise big returns but were mostly seen as science fiction concepts the last time the foundation of how we work with data shifted. This technology is here and has become a reality for many organizations due to the volume and evolution of data diversifying.

With both the volume and diversity of data rapidly increasing, it’s no longer possible for humans to analyze all of it themselves. Organizations are turning to machine learning and AI to keep up with the unprecedented volume of data and make sense of it. For data scientists to keep up with the increasing demand and speed at which data needs to be analyzed, the lakehouse provides a “data playground,” empowering them to access large quantities of structured and unstructured data and build advanced analytics models.

The business world is moving faster than ever, and if organizations are to have any chance of keeping pace, they’ll need to ditch the old ways of thinking about data that are slowing them down. The fewer barriers that data teams have to face when seeking value from the massive amounts of data they work with, the quicker and more agile they can be in making an impact on the bottom line. It is high time for a paradigm shift to accommodate these modern data needs, and the data lakehouse offers that new vision. The path to mass adoption and success with the lakehouse—and the path to real innovation—lies in the cultivation of a true data culture for those who work and benefit from data daily. The lakehouse provides the foundation of a unified environment where the whole organization can more effectively use data and unleash its true business value.