<< back Page 3 of 3

The Legacy of Hadoop

All along the way what people have been attempting to do is to get to real-time solutions—applications that can operate as events happen. They require low latency and predictable persistent storage. NoSQL databases and stream processing engines are on the uptick. Stream processing engines have been blown sky-high in the last 3 years. Apache Storm, Flink, Apex, Samza, Gearpump, and Spark Streaming have all shown up competing for mindshare and they have all done a fairly decent job at attaining it. Adding to the influx of stream processing engines, are the number of alternative NoSQL databases.

When adding a bunch of different data sources and destinations, it is inevitable that people want to leverage higher level constructs to work with those data sources. Workflow or pipeline management has a few options in Cask, Streamsets, and Nifi. Nifi tends to gain the most attention due to the marketing budget put behind it, but is truly the most concerning of the choices in this category. There are many claims of “full back pressure support,” implying that there is no reason to worry about what happens when there are more events than can be processed in a reasonable timeframe. When these events cannot be processed through a workflow fast enough, sadly the entire workflow can back up to the entry point and data will be lost. While Streamsets may not have the same adapters that Nifi has, it does have more enterprise integrations, and a simplified workflow management and distribution model which puts it above Nifi. Cask has the ability to handle the same workloads as the others, but it also has full back pressure support and moves into real-time processing with the ability to add microservices into the mix.

What’s Ahead

Hadoop absolutely pushed the envelope and caused the industry to rethink how workloads of different types could scale out. The projects surrounding Hadoop are what give us hope in this technology revolution. The realization that an API can cause a revolution is what is remarkable. It is all too easy to forget that interfaces such as ODBC and JDBC helped to simplify interacting with relational databases. The first implementation of an API shows people what can be done, but innovations under an API are what drive industry change. NFS and POSIX, which were introduced long ago, are standard APIs that have driven the data center for more than 30 years. The Kafka API has caused a major shift in creating flexible infrastructure enabling those stream processing engines and the move away from traditional message queuing systems that can’t scale for a low cost.

The HDFS API is the legacy of Hadoop. It will remain standing because after a series of iterations it became a well-thought-out abstraction for accessing data in a distributed fashion. It abstracted the underlying implementation and enabled innovation to occur. The projects in this space do not depend on HDFS, they depend on the API. These projects took off fairly quickly in terms of adoption, all because they were enabled by leveraging an API.

<< back Page 3 of 3


Subscribe to Big Data Quarterly E-Edition