Kafka Is the New Standard in Big Data Messaging

It’s become almost a standard career path in Silicon Valley: A talented engineer creates a valuable open source software commodity inside of a larger organization, then leaves that company to create a new startup to commercialize the open source product. Indeed, this is virtually the plot line for the hilarious HBO comedy series, Silicon Valley.

Jay Krepes, a well-known engineer at LinkedIn and creator of the NoSQL database system, Voldemort, has such a story. While at LinkedIn, he developed the Kafka software—which was open sourced and became a top-level Apache project—and he is now the co-founder of Confluent, a company focused on Kafka.

At first glance, Kafka appears to implement a well-established software pattern—that of message-oriented middleware. Message-oriented middleware systems follow a publish/subscribe model in which data can be added to “topics” or “queues” by publishers and consumed from those queues by subscribers. Message-oriented middleware allows for asynchronous communication between elements in a software architecture, thus achieving the often desirable “loose coupling” that allows for a greater level of availability, scalability, and independence in complex multi-tier systems.

The best known message-oriented middleware products are probably MQSeries—an IBM mainframe product developed more than 20 years ago—along with products based on the Advanced Message Queuing Protocol (AMQP) standard and those based on the Java Messaging Service (JMS) specification.

Superficially, Kafka provides services similar to JMS or AMQP products. However, Kafka is more specifically oriented toward scenarios that typify big data systems and are ubiquitous in modern services-oriented software architectures. In particular, the Kafka architecture is explicitly designed to be distributed across a large number of nodes, capable of dealing with large data backlogs, and oriented toward streaming real-time data feeds.

Since the earliest days of formal big data—in other words, since the earliest days of Hadoop—machine-generated data has been one of the most fruitful sources of data. The “data exhaust” generated by many applications can be found in a variety of log files, and these can be loaded into Hadoop using the Flume tool. However, Flume requires that the data be loaded into Hadoop before analysis. Kafka, conversely, can accept streaming input data and split the stream toward both a real-time processing application as well as to Hadoop for offline analytics. Kafka also integrates well with “post-Hadoop” technologies such as Spark and Storm.

Modern web applications—particularly those following the 12–factor application guidelines (—often rely on streaming log file output as a primary source of application state. Modern web applications, particularly those employing a service-oriented architecture, may be generating dozens or more log file streams. These streams need to be collated and processed to manage application state; Kafka is ideal for this task.

Kafka has rapidly become the messaging system of choice for companies building large systems based on modern open source software stacks. As well as continuing use within parent LinkedIn, it is used at Twitter, Netflix, Spotify, Pinterest, AirBnB, Cisco, and elsewhere. Kafka also has a strong affinity with big data technologies such as Hadoop, Spark, and Storm.

Kafka seems ideal as a framework for handling the massive streams of data that are increasingly generated by Internet of Things applications, and its orientation toward massive volume and distributed processing are well-suited to the IoT age. Kafka appears likely to become as important to the modern data architecture as other key Apache technologies such as Hadoop and Spark.