Architecting Data Lakes for the Modern Enterprise at Data Summit Connect Fall 2020

Hudi (Hadoop Upserts Deletes and Incrementals) is a storage abstraction library that improves data ingestion.

Nishith Agarwal, engineering manager, Uber, explained what Hudi offers and why it is needed during his Data Summit Connect Fall 2020 session titled, “Building Large-Scale, Transactional Data Lakes using Apache Hudi.”

Videos of presentations from Data Summit Connect Fall 2020, a free series of data management and analytics webinars presented by DBTA and Big Data Quarterly, are available for viewing on the DBTA YouTube channel.

Additionally, he discussed more advanced primitives, such as restore, delta-pull, compaction and file sizing required for reliability, efficient storage management, and building incremental ETL pipelines.

A data lake is a centralized repository that allows users to store all their structured and unstructured data at any scale, he explained.

When building a data lake platform organizations need peripheral tooling around collection data, he said. Companies should monitor performance, pattern of ingestion, and correctness.  The platform also needs to integrate with many different solutions.

Requirements include:

  • Managed ingestion
  • Unify stream and batch
  • Write once (ish), read many?

“You want to adapt your data lake,” he said. “Recognize your data based on your query needs.”

Apache Hudi is a system that helps to solve these requirements. Users can ingest data, creaete incremental streams, be stored on any clod HDFS, and more.

A Hudi-based data lake can include features such as file count, file sizing, data layout, and more. The platform can expose different kinds of queries on data tailored for use cases. It can give full access to the organization’s data across a variety of needs.

By using Hudi, Uber was able to power its global network analytics in near real-time using HoodieDeltaStreamer, according to Agarwal.

Hudi can help organizations recover from data corruption, utilize incremental processing from Spark or Hive, take query snapshots to “time travel,” and can convert existing table into Hudi tables.

Inessa Gerber, director of product management, Denodo, discussed how Data Virtualization ensures consistency, quality, and governance during her session after Agarwal titled, “Data Virtualization: Modernizing Data Access in Hybrid Environments.”

As Data Lakes grow, the traditional and cloud sources in the enterprise have not disappeared. Most companies have a hybrid environment where the data resides across Data Lakes, traditional on-prem sources, and in the cloud. It is also common for Data Lakes to be used for gathering all types of data, as such the quality and consistency of the data at times is questionable.

“Throughout history as we look at data we have multiple storage places,” she said. “What about connecting all of the data? What about creating the fabric that gives you access to all of the data?”

It may not be feasible to move all data into centralized data lake. However, it is possible to thread the data together through virtualization.

Hybrid systems and data quality make data virtualization an essential and critical component for providing and managing data access services for consumers, analytics, and presentation layers. Data virtualization ensures trusted and governed data access.

The benefits of data virtualization gives organizations overall look at enterprise information combining web, cloud, streaming, and unstructured data. Businesses can see an ROI realization within 6 months, with the flexibility to adjust to unforeseen changes and more.