Adopting a ‘Pipeline-Free’ Architecture for Real-Time, Large-Scale Analytics

Complex data pipelines are seemingly a necessary evil for any data-driven business striving for real-time analytics—until now. “Pipeline-free” architectures have broken out on the data scene, offering ways to not only boost overall performance, but reduce maintenance.

Sida Shen, product marketing manager at CelerData, joined DBTA’s webinar, Go Pipeline-Free With Real-Time Analytics, to explore how to build pipeline-free platforms with open source software, regardless of experience level, while still delivering real-time analytics.

Shen explained how, despite the benefits of multi-table JOINs, data becomes normalized into flat tables through pre-join and pre-aggregation processes. While this may be suitable for batch analytics, it cannot perform within the time constraints that real time requires. Data practitioners are then forced to build complex preprocessing pipelines, which creates a large financial drain on the organization.

How can this be solved? Shen pointed to the StarRocks Project, CelerData’s high-performance, real-time OLAP database that can conduct high-speed data analytics in a multitude of scenarios without complicated data preprocessing. Query speed—especially multi-table JOIN queries—is accelerated through StarRocks’ streamlined architecture, full vectorized engine, cost-based optimizer (CBO), and modern materialized views.

Diving deeper into what StarRocks can provide, Shen began to discuss how to go pipeline-free in JOIN environments.

Unlike scatter/gather and map reduce compute architectures, MPP (the architecture of StarRocks) streamlines pipeline execution with no wait time, offering in-memory shuffle for scalability and performance—a great option for sub-second JOINs and aggregations at scale, according to Shen.

Query planning is another crucial component of going pipeline-free, as the search space of possible ways to execute a query is impossibly enormous. To address this front, StarRocks’ CBO collects statistics on queries to estimate the cost of plans and determine the best course of action. Natively built and deeply integrated with the execution and storage layer, CBO’s open source, community-built technology is a large reason that StarRocks can conduct on-the-go JOINs, said Shen.

StarRocks’ columnar storage and fully vectorized operators offers enhanced speed for not only JOINs but also for large-scale aggregations. Efficient query execution is critical for real-time, pipeline-free architectures, and in-memory columnar processing is innately suited for OLAP workloads. This results in the ability to process larger batches and SIMD-optimized, which allows StarRocks to process multiple data points with one instruction, reducing memory sink.

Shen acknowledged that while StarRocks can achieve on-the-fly JOINs, JOINs are still very expensive for most organizations, regardless of how optimized the query layer is. Fortunately, StarRocks has a solution: partial update for high concurrency scenarios. Instead of doing denormalization in stream pre-processing, StarRocks can support a partial update operation directly on the columnar storage. This means that two streams of data can update to the same table at the same time, requiring less computation and less external processing tools.

For an in-depth discussion of going pipeline-free, featuring use cases, demos, and more, you can view an archived version of the webinar here.