“It’s the best of the data lake and the best of a warehouse,” said anyone who has encountered a data lakehouse. However, what is the intersection between the data lake and the warehouse? How can you leverage these mediums both effectively and as a single entity? How do I even get started?
DBTA hosted a webinar, “Building a Modern Lakehouse,” featuring speakers Tom Nats, director of customer solutions at Starburst; Ryan Blue, CEO of Tabular, co-creator of Apache Iceberg; and Prabhu Kapaleeswaran (PK), creator of SEMOSS Analytics Platform and managing director at Deloitte Consulting, to venture into the context and best practices of modern data lakehouse management and architecture.
With shifting mindsets and an unending plethora of new tools and systems, a modern lakehouse adapts to openness at the risk of bad practices. The best of a warehouse, meaning its guarantees and performance, paired with the best of a data lake, in its ability to blend use cases and employ different projects, necessitates an open structure so as not to lock-in with any one vendor. Varying data types, new technologies, endless vendors, and a shift towards standards-based, rather than proprietary-based, enterprises leans towards the freeing qualities of data lakehouses. Yet, while open and providing freedom to developers, the modern data lakehouse is only as proficient as the decisions and choices being made on behalf of its implementation.
The speakers addressed varying tenants that dictate how enterprises can best prepare for and manage lakehouse utilization. Blue remarked that table format is crucial; the format dictates the capabilities of the table, such as enabling atomic transactions; it may require maintenance over a long-term period; and must be compatible with processing technology to ensure proper sync and copy creation. Nats emphasized compute engines and processing patterns as a significant prospect toward lakehouse technology, as common stressors involve having data sit in lakes with different computing engines. A modern data lakehouse can pick and choose which compute engines to direct at what data, aiding in shareability.
Lakehouses introduce a particular strategy toward data management: variety and velocity first, followed by volume. Meaning, when employing a lakehouse, users must first focus on incorporating unstructured to structured data variety within the same architecture (variety), and then concentrate on selecting the right tools and processing patterns that are right for the job from the get-go, thereby accelerating outcomes (velocity). Finally, once those preliminary attributes are sorted, users can begin focusing on the volume of data migrating to the lakehouse.
The speakers urged enterprises to cease endless data pouring into lakehouses without taking the necessary measures to ensure data remains actionable, discoverable, and shareable once there. Cataloging becomes essential in trying to locate particular data to later become useful, transforming into a data product, further enabling discoverability, shareability, and optimized value once moved to the lakehouse. Ultimately, utilizing strategies, such as data mesh principles or cataloging, help break down data into digestible, shareable segments—which accelerates data product generation, a means towards straining the data lake.
There are numerous other factors to take into consideration when adopting a lakehouse, such as governance, access, security, and engine and storage choice. Data should be made accessible everywhere based on format—and that format needs to be made secure. Blue implored listeners not to “hope for the best” in terms of format; instead, think in advance how applications are going to access data and how it will be restricted. Ultimately, Blue argued that employing best practices when it comes to governance is the keenest way towards optimization.
Access and security, PK offered, should be fitted to propagate throughout the lifecycle of the data product. It’s important to recognize that the current state of data lakehouse access and security is in progress and may still lack in its performance. Nats emphasized that open-source engine components are optimal for data lakehouses to avoid rigorous training, maintenance, and tech debt via unrecyclable code.
Storage for data lakehouses must be fundamentally integrated with governance, Blue explained; if you tie security and governance to the engine, it will only be applicable for the people using that particular engine or component. Security should be integral to storage and planned ahead of time. As the lakehouse grows, its success will depend on early-stage storage and governance choices.
To learn more about modern lakehouse strategies and implementation, you can view an archived version of this webinar here.