Page 1 of 3 next >>

The Legacy of Hadoop


In a world where new technologies are often presented to the industry as rainbows and unicorns, there is always someone in a cubicle trying to figure out how to solve business problems and just make these great new technologies work together. The truth is that all of these technologies take time to learn, and it also takes time to identify the problems that can be solved by each of them. Recently, while in a discussion with a Gartner analyst, I was told that a survey found that only 17% of all Hadoop installations were successful and in production. Let’s break this down and try to understand how we have gotten to where we currently are in the industry when it comes to linearly scalable distributed systems.

In the early days of Hadoop, people were excited because what sat in front of them was a distributed computational technology that scaled linearly. Hadoop exploited a data processing model in the form of MapReduce and a data storage platform.

Hadoop Data Storage

The data storage part, the Hadoop Distributed File System, otherwise known as HDFS, was built with the knowledge that hardware fails—and fails often. It, however, was built for storing copies of webpages and logs and was created as a write-once file system, meaning that files couldn’t be edited once written—much akin to a CD-ROM. Disaster recovery was a complete afterthought with the data storage because the belief was that all of the accumulated data could just be re-accumulated. It wasn’t really focused on meeting the complete set of enterprise needs that often arise with other enterprise systems. There have been a variety of issues with HDFS over its life—ranging from data corruption issues to major security concerns.

For enterprises to adopt HDFS as a central storage facility for all their data, they need to ensure that the data is secure. Recently, it has come out in the news that insecure Hadoop clusters exposed more than 5,000 terabytes of data to malicious individuals and organizations. This has opened up the door to malware infecting all the data stored in Hadoop clusters and allowing nefarious sources to ransom the data on the clusters which were infiltrated. This type of technology needs to be foolproof when it comes to security and data resiliency.

The data processing model on the other hand, MapReduce, has emerged as a pretty successful compute engine that handles processing large volumes of data in batches. It is exceedingly fault-tolerant, but historically known to be slow when built into processing pipelines because data must be written to the disks in between batches only to be read from the disk again in the next step in the pipeline.

Evolution of Projects Surrounding Hadoop

Upon testing Hadoop with some batch use cases, people’s brains were buzzing with possibilities of applying this technology stack to solve real-time and machine learning based problems. Limitations of the technology took time to work out, but in the meantime HBase—a key-value database—was created to provide real-time access to data stored on HDFS. The design of HBase had to work around a write-once file system, which creates a rather significant hurdle when implementing updates and deletes. The HBase design has been shown to be suboptimal due to the inherent limitations of HDFS. Numerous other projects sprung up on top of HBase all suffering the same limitations and workarounds, but this was the closest thing that showed up for real-time application development on Hadoop.

Machine learning took a while to really mature on Hadoop, with the standout being Apache Mahout. It was effective with certain algorithms, but not so much with others. This gave way to other projects that wanted to play in the space.

Users wanted to apply the technology stack to a diverse set of analytics use cases, in addition to data being stored in physically disparate locations (e.g., different data centers). To top that off, they had a desire to build a wide variety of business applications on a linearly scaling technology. Lack of complete dominance and success is due to a relatively narrow foundation for how data is physically stored and accessed. Because of this, the door was left open to other competing applications.

Page 1 of 3 next >>


Newsletters

Subscribe to Big Data Quarterly E-Edition