Since other applications were showing up on the scene, which were not Hadoop, it was identified that the resource management capabilities built into Hadoop should be separated into a more flexible implementation. This was a key element that had to be ripped out of the HDFS storage model and MapReduce processing model. They were too tightly coupled and this capability had to be teased out. The changes required to accomplish this endeavor led to a branch in the file system. Users that wanted the benefits of MapReduce v2 and its loosely coupled resource management had to reinstall HDFS and throw away MapReduce v1 and its tightly coupled resource management model.
YARN was born out of a need to separate resource management from the HDFS and MapReduce models. While YARN succeeds at the job of resource managing Hadoop-based workloads, it is at odds with the rest of the industry. The rest of the industry sees an entire data center or even multiple data centers as the resource pool that needs to be managed to solve all of the business problems that exist. Hadoop acts as an island unto itself where it has created a wall between it and the rest of the data center resources. This is exactly the opposite of the problem Hadoop was meant to solve in the first place: taking down walls between data silos. Data is just one type of resource. Compute and memory management are critical to ensuring maximum utilization of a business’ capital investment.
YARN took a myopic, self-serving approach to try to solve the resource management issues within Hadoop and looked internally only, when in fact it should have looked outward first. Early adopters of Hadoop had clamored for support for real-time applications and to support the rest of the business running in unison with the likes of Hadoop. Along the way, YARN continued to duct-tape capabilities for which it was never designed and has now created another wall between Hadoop’s resources and the rest of the resources of the data center. All of the web servers and databases that run the business are more well-suited to run with resource managers such as Kubernetes and Mesos since they were built to solve data center management issues, not just Hadoop resource management issues.
Now, consider the previous couple of years and look at Apache Spark. Spark is a general-purpose computation engine and is a viable competitor to the data processing model of MapReduce. It brought with it the ability to perform many of the same algorithms that were built within Mahout and in a way that allowed them to execute more quickly than MapReduce ever could. This caused the industry to flip on its side and many new questions were raised: What exactly is Hadoop? Does Spark require Hadoop? Is Hadoop dead? How are these other platforms growing so quickly when it took Hadoop so long to gain a foothold?
The reason success has been had in this space has more to do with the surrounding and enabling technology projects than just that of Hadoop. These surrounding technologies were key pieces to enable scalability. The same concepts that helped Hadoop to attain the ability to operate in a linearly scalable and distributed fashion have been leveraged by other projects. For example, Apache Zookeeper is perhaps one of the most important open source projects. It not only helped the Hadoop ecosystem flourish, but it has been a major supporting force behind nearly every other scalable system since its creation—yet it is not Hadoop.
The Hadoop API
Arguably, the most important contribution to come from Hadoop has been its API. Standard APIs are critical to the success of any piece of software. It is the way to interface with an application. When everyone uses a different API, there is no agreement on a standard. And, because the HDFS API showed people how to interact with data in a distributed fashion, it has been the linchpin holding up what has been called the Hadoop ecosystem.
The API was abstracted from the data storage in the very early days of Hadoop. This enabled HDFS alternatives to be created. The first implementation of the HDFS API was of course HDFS itself, but the second implementation was created by Amazon to distributed computation on data stored in S3. This opened up the doors for those who wanted to run the tools built to work with Hadoop to instead run them in the cloud on Amazon. Shortly thereafter, the MapR Converged Data Platform was created which supports the HDFS API in both on-premise and cloud installations.