There are so many new buzzwords lately, including the data lakehouse, data mesh, and data fabric, just to name a few. But what do all these terms mean, and how do they compare to a data warehouse? This presentation covers all of them in detail and explains the pros and cons of each, with suggested use cases so attendees can see what approach will really work best for their big data needs.
James Serra, Data & AI solution architect, Microsoft, presented his session “Data Lakehouse, Data Mesh, & Data Fabric (the Alphabet Soup of Data Architectures),” at Data Summit 2022.
The annual Data Summit conference returned in-person to Boston, May 17-18, 2022, with pre-conference workshops on May 16.
No matter what architectures companies use there will always been a copy of the data, Serra explained. And a data lake is no more than a glorified file holder within the computer. It can be a huge dumping ground for data. Because of this proliferation, there are a plethora of solutions to choose from when storing, securing, and unveiling data within the organization.
The data fabric adds onto the modern data warehouse. It is additional technology to source more data, secure it, and make it available.
“We’re going back to the idea of going back to a data lake,” Serra said. “What’s new about this is the Delta Lake which adds more features like a relational database.”
Data mesh is decentralized and a concept, not a product. Data mesh is an intentionally designed distributed data architecture, under centralized governance and standardization for interoperability, enabled by a shared and harmonized self-serve data infrastructure, Serra explained.
Data mesh tries to solve four challenges including:
- Lack of ownership: who owns the data – the data source team or the infrastructure team?
- Lack of quality: the infrastructure team is responsible for quality but does not know the data well
- Organizational scaling: the central team becomes the bottleneck, such as with an enterprise data lake/warehouse
- Technical scaling: current big data solutions can’t keep up with additional data requirements
“In the end, I predict data mesh will become an extension to a centralized data solution for a small percentage of solutions,” Serra said. “There will be a very small percentage of solutions that are 100% true to the pure data mesh concept (assuming mesh type 1 and 2 are true to the data mesh concept).”
Matt Fuller, VP of product, Starburst continued to discuss how important Data Mesh is during his session, “Accelerate Data Mesh With First-Class Data Products.”
Data Mesh aims to prescribe that the ownership of data products should live in business domains, ensuring that data is treated as a first-class product across the organization. This is a drastic strategic shift, wherein data is no longer treated as a by-product of activities in which the business engages, but as a key value-driver that should direct business decisions.
“Things are getting more messy, not less,” Fuller said. “Data sprawl is real. That’s why we believe data mesh to the rescue.”
There are four concepts that data mesh is built on. This includes:
- Doman-drive data ownership architecture
- Data as a product
- Self service infrastructure as a platform
- Federated computational governance
“This creates cross organizational and cross company collaboration,” Fuller said. “Data mesh is a journey and each step builds on the last one.”
Many Data Summit 2022 presentations are available for review at https://www.dbta.com/DataSummit/2022/Presentations.aspx.