Real-Time Big Data With the Lambda Architecture

One feature of the big data revolution is the acknowledgement that a single database management system architecture cannot meet all needs. For instance, if you need a system that can reliably process ACID (Atomic-Consistent-Independent-Durable) transactions, then the relational database is the way to go. If, however, you require a system that can maintain availability even when it’s split in two by a network outage, then you probably need a non-relational eventually consistent system such as Cassandra. And, of course, for economical storage and processing of large amounts of raw unstructured data, Hadoop is the preferred solution.

As the tradeoffs and sweet spots of each technology become increasingly well understood, it becomes apparent that, for most enterprises, multiple technologies need to be used in conjunction with each other to satisfy enterprise requirements.

The Lambda Architecture, first proposed by Nathan Marz, attempts to provide a combination of technologies that together can provide the characteristics of a web-scale system that can satisfy requirements for availability, maintainability, and fault-tolerance.

The Lambda Architecture consists of three layers: the batch layer, the serving layer, and the speed layer.

As new data is introduced to the system, it is processed simultaneously by both the batch layer, and the speed layer. The batch layer is an append-only repository containing unprocessed raw data. The batch layer periodically or continuously runs jobs that create views of the batch data—aggregations or representations of the most up-to-date versions. These batch views are sent to the serving layer, where they are available for analytic queries.

At the same time that data is being appended to the batch layer, it is simultaneously streaming into the speed layer. The speed layer is designed to allow queries to reflect the most up-to-date information—necessary because the serving layer’s views can only be created by relatively long-running batch jobs. The speed layer computes only the data needed to bring the serving layer’s views to real time—for instance, calculating totals for the past few hours that are missing in the serving layer’s view.

By merging data from the speed and serving layers, low latency queries can include data that is based on computationally expensive batch processing, and yet include real-time data.

As you might imagine, the batch layer is usually implemented in Hadoop, which excels at append-only data ingest and batch processing. Serving and speed layers are usually implemented with a non-relational transactional database such as HBase, Cassandra, or ElephantDB (a specialized key-value database for Hadoop results). Kafka—the popular open source messaging middleware —is often used as a common staging layer for new data, while the open source Storm technology is frequently used to process streaming data flowing into the speed layer.

In the Lambda Architecture, the raw source data is always available, so redefinition and re-computation of the batch and speed views can be performed on demand. The batch layer provides a big data repository for machine learning and advanced analytics, while the speed and serving layers provide a platform for real-time analytics.

The Lambda Architecture is not the only proposal for a big data architecture: Jay Krepes of LinkedIn praises many aspects of the architecture, but feels the duplicate processing of input data leads to an inherent inefficiency that eventually must be overcome.

The Lambda Architecture provides a useful pattern for combining multiple big data technologies to achieve multiple enterprise objectives. But it definitely isn’t the final destination for big data architectures. We can expect to see more of these types of architectural patterns evolve as big data crosses the chasm from early adoption to mainstream.