New Hadoop Frameworks Feed the Need for Speed

In many ways, Hadoop is the most concrete technology underlying today’s big data revolution, but it certainly does not satisfy those who want quick answers from their big data.  Hadoop – at least Hadoop 1.0 – is a batch-oriented framework that allows for the economical execution of massively parallel workloads, but provides no capabilities for interactive or real-time execution. 

Hadoop 1.0 is based on the MapReduce pattern – pioneered at Google and battle-tested within Hadoop at Yahoo! and Facebook.  MapReduce partitions - or “maps”- processing across many chunks of data, and then combines - or “reduces” - the results into a single answer.  Complex programs chain multiple MapReduce steps to achieve their end result.

Today’s Hadoop ecosystem is mostly built on top of this MapReduce foundation.  For instance Hive – Hadoop’s SQL engine – compiles SQL statements to MapReduce code.

It’s long been acknowledged that while MapReduce is a broadly applicable model that can support a wide range of job types, it is not the best model for all workloads.  In particular, MapReduce has a very large start-up cost so even the simplest “Hello World” MapReduce jobs typically take minutes rather than seconds. This alone makes MapReduce a very poor choice for interactive workloads and low latency operations .

Hadoop 2.0 introduced the YARN framework which promises to allow Hadoop to run workloads based on other processing patterns – of which MapReduce is just one (we looked at YARN last December).  

The Apache Tez project (Tez is Hindi for “speed”) is one of a number of YARN-based initiatives that aim to provide Hadoop with a processing framework that can support both the massive batch processing characterizing current Hadoop workloads, and low latency operations that would allow Hadoop to support a wider variety of solutions.

Tez is based on a flexible processing paradigm known as Directed Acyclic Graph (DAG).  This intimidating term actually describes a familiar processing model.  Anyone who has examined a SQL execution plan will have encountered a DAG.  These graphs describe how a complex request is decomposed into multiple operations that are executed in a specific order, and can arbitrarily feed into each other.  MapReduce itself is a DAG, but the MapReduce paradigm severely limits the types of graphs that can be constructed.  Furthermore, MapReduce requires that each step in the graph be executed by a distinct set of processes, while Tez allows multiple steps in the graph to be executed by a single process, potentially on a specific node of the Hadoop cluster.

The Apache Stinger initiative builds on Tez to provide a more responsive Hive.  The MapReduce-based SQL executions of Hive in Hadoop 1.0 resemble successions of massively parallelized full table scans and sort merge joins.  This isn’t the best plan for all SQLs, but since the next-generation “Stinger” version of Hive can utilize the Tez framework, it can employ a wider variety of execution plans, including some that will lead to much faster SQL executions.  Indeed, the aim of Stinger is to provide Hive with 100x performance improvements.  Tez developers hope to similarly improve the performance of Pig, Cascading and Java-based workloads. 

While Tez and Stinger are Apache projects, they are primarily sponsored by Yahoo! spinoff Hortonworks.  Cloudera - the other major Hadoop vendor - sponsors some alternative technologies, most notably Apache Impala.  Like Stinger, Impala aims to provide vastly improved SQL-like execution.  It may be disturbing to see such obvious fragmentation emerge in the Hadoop open source community; but, it’s in the nature of open source to encourage innovation and experimentation.  Whichever of these frameworks prevail, it’s clear that we can expect a faster, more interactive experience from Hadoop in the future.