Achieving Data Warehouse Query Performance on the Data Lake with CelerData

Apr 30, 2024

By Sydney Blanchard

The move to the data lakehouse was full of promises—speed, agility, and cost-effective query performance, to name a few. Yet, many enterprises find it difficult to realize all these benefits at once; as a result, efforts to get rid of the data warehouse in an effort to supercharge the lakehouse have become more prevalent. How exactly can businesses ensure their lakehouse investments pay off?

Sida Shen, product marketing manager at CelerData, joined DBTA’s webinar, Why It's Time for Lakehouse Users To Ditch Their Data Warehouse, to guide viewers through the latest advancements in the lakehouse space that can alter the way these lakehouses perform—warehouse-free.

The idea behind the lakehouse, Shen explained, is to provide many of the features of a data warehouse in an open and standardized format. This open, standardized format allows lakehouses to unify all workloads on a single source of truth. Though it may seem like “reinventing the wheel” of data warehouses, data lakehouses are more suited to the massive, complex workloads that enterprises must manage, further benefiting from easy governance, simple architectures, and cost-effectiveness.

This is not the reality that most enterprises are experiencing, as users are still copying their data out of the lakehouse to accelerate queries. According to Shen, existing data lake query engines are not optimized for high concurrency, low latency workloads, or are still on older technologies, forcing businesses to search for other means of query acceleration. In the end, these data lake query engines—which are fundamentally unable to support intensive analytics workloads—cause enterprises to:

Overengineer or overspend on an existing query engine to barely get passable performance, which is not sustainable or future-proof
Move workloads to a high-performance data warehouse purely for query acceleration, as a work around

When these workloads are moved to high performance data warehouses, organizations are forced to deal with the cost of maintaining infra-level software and data ingestion pipelines for terabyte-level data, as well as the assortment of challenges from matching schema, data type, and data governance due to data duplication.

Why is there disharmony between data warehouses, data lakehouses, and query engines?

Despite advancements and updates to the table formats of data lakehouses, users are still utilizing the old query engines that are not built for data warehouse workloads. These engines are optimized for long-running batch workloads, not low-latency high concurrency queries, and are built to connect to all possible data sources, not optimized for performance.

With StarRocks, a high-performance analytics database that uses an open source, purpose-built query engine that can function as a warehouse or as a lakehouse, enterprises can fulfill the variety of promises that the data lakehouse poses. StarRocks enables enterprises to get data warehouse performance on the data lakehouse, leveraging techniques such as a hierarchical caching framework, MPP in-memory data shuffling, and system level optimizations to accelerate query speed on the lakehouse possible.

While StarRocks as a data warehouse offers 12% faster query power, StarRocks as a data lakehouse offers enterprises the chance to have their data function as a single source of truth from a single location—with very little impact on performance.

For the full discussion about adopting modern data lakehouse strategies in favor of the data warehouse, featuring examples, demos, and more, you can view an archived version of the webinar here.

Newsletters

Achieving Data Warehouse Query Performance on the Data Lake with CelerData

White Papers

Sponsors