Building a Data Lakehouse at Data Summit 2022

May 18, 2022

By Joyce Wells

Bs data storage and analytics in the cloud continue to grow, organizations are evaluating the essentials for success, such as scalability, efficiency, affordability, and security.

Data consumers need data for BI and analytics to make business decisions. But for most organizations, their current data infrastructure isn’t keeping up with demand. In a presentation at Data Summit 2022, titled “Building the Open Data Lakehouse,” Mark Lyons, senior director, product management, Dremio, explained why more organizations are moving their analytics and BI to an open data lakehouse and how you can build a successful lakehouse strategy.

Today, everyone has many of the same data goals they have always had, but it’s harder achieving them. Goals are data democratization, support for new initiatives and projects, achieving faster time to value, and ensuring security and governance. Challenges are presented by the fast growth of data, rapidly increasing access requests, flat budgets, and scarce talent.

Looking at the current state of data management, Lyons said, customer needs haven’t changed but the tools have. The explosion of data has caused a lot of new thinking on how to process it, way more data, including unstructured.

Lyons presented a deep dive into the data lakehouse, the newest tool in the box, which Lyons said he believes is here to stay. It represents a convergence of a warehouse and a lake and an attempt to get the best of both worlds. The lakehouse is now available from multiple vendors and the field is growing. It is a new open data management architecture that combines the flexibility, cost efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling BI and ML.

Lyons outlined Data Lake 1.0 and how data lakes started. The data tier of the data lake was very basic, requiring significant engineering work from customers. Newer versions have added table formats, which offers additional functionality above the file, said Lyons. The next layer that matters in the lakehouse world, he added, is at the metastore layer and more innovation can be expected at this layer from multiple vendors to add functionality that would people expect from a data warehouse such as data branching and data version control.

Lyons concluded with a high-level comparison of the Apache Iceberg to the Databricks Delta Lake and their features and relative benefits.

Many Data Summit 2022 presentations are available for review at www.dbta.com/DataSummit/2022/Presentations.aspx.