Is the Berkeley Data Analytics Stack the Future of Hadoop?

Back in the early 1990s, you would sometimes hear this gag: “Two major products that came out of Berkeley: LSD and UNIX. We don’t believe this to be a coincidence.” Although wildly inaccurate, this joke does reflect Berkeley’s reputation for innovation and revolution. Now from U.C. Berkeley comes what might be the most significant new big data technology since Hadoop: The Berkeley Data Analytics Stack (BDAS).

Hadoop became the de facto foundation of today’s big data stack by providing a flexible, scalable, and economical framework for processing massive amounts of structured, unstructured and semistructured data. The Hadoop 1.0 MapReduce algorithm is a relatively straightforward, but powerful, approach to parallel processing. MapReduce is not the most elegant or sophisticated approach for all workloads, but it can be adapted to almost any problem, and can usually scale through the brute force application of many servers.

However, it’s long been realized that MapReduce is not a sufficient solution for emerging big data analytic challenges. MapReduce excels at batch processing but falls short in real time and near-real-time scenarios. Even the simplest MapReduce task takes significant ramp-up time, and, for some machine learning algorithms, execution time is simply inadequate.

About 3 years ago, the AMP (Algorithms, Machines, People) lab was established at Berkeley to attack the emerging challenges of advanced analytics and machine learning on big data. The resulting Berkeley Data Analytics Stack—particularly the Spark processing engine—has shown rapid uptake and tremendous promise.

BDAS consists of a few core components:

  1. Mesos, a cluster management layer somewhat analogous to Hadoop’s YARN. However, Mesos is specifically intended to allow multiple frameworks—including BDAS and Hadoop—to share a cluster.
  2. Spark, an in-memory, distributed, fault-tolerant processing framework. Implemented in Scala, it provides somewhat higher level abstractions than MapReduce, improving developer productivity. As an in-memory solution, it shines during tasks that bottleneck on disk I/O in MapReduce. In particular, tasks that iterate repeatedly over a dataset—as do many machine learning workloads—show very significant improvements.
  3. Tachyon, a fault-tolerant, Hadoop compatible, memory-centric distributed file system. The file system allows for disk storage of large datasets, but promotes aggressive caching to provide memory level response times for frequently accessed data.

Other BDAS components build on top of this core. Shark is a Hive-compatible engine built on Spark that allows acceleration of SQL queries running on data either in Hadoop or in Spark. BlinkDB is a database that provides real-time queries on massive datasets through dynamic sampling: you can decide on how best to balance query time with accuracy, and the sampling will adjust accordingly. Spark streaming provides a stream-oriented processing paradigm using the Spark foundation, and, similarly, GraphX provides graph database computation on Spark.

The BDAS doesn’t attempt to replace Hadoop, but shows the probable shape of the next generations of solutions built on that core.

Many of these BDAS features correspond fairly directly with key components of Hadoop, and, to some extent, represent memory-based optimizations (Hive versus Shark, MapReduce versus Spark, etc.). However, BDAS goes beyond what is offered in Hadoop by providing integrated machine learning algorithms within the stack. The MLbase component contains both low-level machine learning implementations, as well as more high-level abstractions that improve programmer productivity. Particularly interesting is the ML Optimizer component, which attempts to determine the appropriate machine learning algorithm for a given learning outcome. This layer would allow users to concentrate on the business or logical problem, rather than on specific algorithms.

The Berkeley Data Analytics Stack is relatively young. Yet, it is an impressively coherent framework, and Spark, in particular, is showing very rapid uptake. The Berkeley Data Analytics Stack doesn’t attempt to replace Hadoop as the foundation for big data—rather, it shows the probable shape of the next generations of solutions built on that core.