Hadoop's Next-Generation YARN

Bookmark and Share

As the undisputed pioneer of big data, Google established most of the key technologies underlying Hadoop and many of the NoSQL databases. The Google File System (GFS) allowed clusters of commodity servers to present their internal disk storage as a unified file system and inspired the Hadoop Distributed File System (HDFS). Google’s column-oriented key value store BigTable influenced many NoSQL systems such as Apache HBase, Cassandra and HyperTable. And, of course, the Google Map-Reduce algorithm became the foundation computing model for Hadoop and was widely implemented in other NoSQL systems such as MongoDB.

The simplicity and flexibility of MapReduce has provided massively parallel solutions for many big data problems, and many of the tools in the Hadoop ecosystem, such as Pig, Hive, and Mahout, work by generating MapReduce code.

However, MapReduce is not well suited for all tasks. Some algorithms translate poorly to Map-Reduce—the partitioning of data and computation to individual nodes makes some computations (graph processing for instance) difficult. And, the implementation of MapReduce in Hadoop is not infinitely scalable—a limit of around 4,000 machines has been observed.

In response to these concerns, a group at Yahoo! designed a next-generation framework, sometimes called MapReduce2, but more often YARN (Yet Another Resource Negotiator or, recursively, YARN Application Resource Negotiator). 

YARN lifts the scalability ceiling in Hadoop by splitting the roles of the Hadoop Job Tracker into two processes. A resource manager controls access to the clusters resources (memory, CPU, etc.), and the application manager (one per job) controls task execution.  

YARN provides much more than just improved scalability, however; it treats traditional MapReduce as just one of the possible frameworks that can run on the cluster. YARN, therefore, will allow Hadoop clusters to execute non-MapReduce workloads.   Work is underway on a number of such applications, including Spark, Storm, Giraph, and Hama.

Spark is a parallel processing framework that aims to provide low latency analytics by allowing data to be loaded into memory, facilitating iterative or interactive processing. Routines that repeat many times across a data set—such as machine learning algorithms or interactive real time Business Intelligence applications—might find Spark a more suitable vehicle than disk-based MapReduce.

Storm is a framework for building massively parallel real-time systems on large clusters. Traditional Hadoop makes no claims for real-time capabilities, since even the shortest MapReduce job will take many seconds to initiate and distribute. Storm is suited to processing large data streams where immediate computation is required. For instance, a webpage that needs to provide a real-time count of Twitter hashtags could be implemented in Storm.

Graph databases—such as the popular Neo4J database—can store and process data in which relationships between objects are as important as the objects themselves. These sorts of systems are ideal for handling social network data. MapReduce struggles with graph data because each map process runs on a single node, which may not have access to all the necessary links in the graph. Apache Giraph builds on attempts to define a parallel processing model for graph processing that can work on Hadoop-style clusters, and is being implemented on YARN. Apache Hama is a somewhat similar initiative that also is being implemented on YARN.

Only the very largest Hadoop users are likely to benefit from YARN’s scalability improvements. For the rest of us, the ability to extend the range of Hadoop analytics is far more significant. At the end of the day, a Hadoop cluster is only as valuable as the analytic insights it can provide. By extending the range of possible analytic models, YARN should contribute to the long-term success of Hadoop.