Reflections on 10 Years of Hadoop and Big Data

It is hard to think of a technology that is more identified with the rise of big data than Hadoop.  Since its creation, the framework for distributed processing of massive datasets on commodity hardware has had a transformative effect on the way data is collected, managed, and analyzed – and also grown well beyond its initial scope through a related ecosystem of open source projects. With 2016 recognized as the 10-year anniversary for Hadoop, Big Data Quarterly took the opportunity to ask technologists, consultants, and researchers to reflect on what has been achieved in the last decade, and what’s ahead on the horizon.

Hadoop at the center of a reinvention of enterprise computing:  "After 10 years, Hadoop could hardly be more significant,” said Edd Dumbill, vice president, marketing & strategy at Silicon Valley Data Science.  According to Dumbill, two major changes have taken place in data processing and Hadoop has embodied both. The first is a move to distributed, scale-out computing, where the cost of processing data scales linearly with volume and the second is the ability to process and reprocess data without being bound to a single schema. “We’re seeing nothing less than a reinvention of enterprise computing, where it becomes more agile and ready to scale and flex with business needs. Most importantly, all this has happened under the aegis of open source, which has accelerated the change and provides users with the assurance of stability and longevity."

Beyond Hadoop HDFS:  “The big data movement was sparked by a unique combination of business and technology drivers.  Google, Amazon, and others demonstrated decisive competitive advantages that could be achieved through the smart analysis of large amounts of fine-grained data,” notes Guy Harrison, an executive director of R&D at Quest and the architect of Dell's Spotlight family of diagnostic products. “Hadoop represented the technology enabler that allowed companies of smaller scale to economically store and process such data at volumes that previously would have been unthinkable.  Hadoop was never the end of the journey though; technologies such as Spark represent enhancements on the theme of economical and timely processing of mass unstructured and semi-structured data.  Big data revolution technologies continue to evolve from Hadoop’s foundations."

Hadoop grows up: “In 2009, I predicted companies that consider data insights to be their differentiator would fully adopt Hadoop as their data platform.  Unfortunately, the initial immaturity of Hadoop and its inability to provide interactive queries delayed immediate adoption by the enterprise,” observed Joe Caserta, president and CEO of Caserta Concepts, a consultancy specializing in data warehouse architecture and design, BI, and master data management.  “Luckily, there have been many advances in recent years and Hadoop now supports prevailing corporate requirements. I’m bullish on Hadoop as it has found its place within the corporate data ecosystem; not to replace the data warehouse, but to fit as a complementary component called the data lake.” According to Caserta, Hadoop-based data lakes will help make data management and data governance more efficient, allowing businesses to defer expensive, time-consuming data preparation work by letting trusted users immediately explore data in a semi-governed, semi-structured environment until a business value is identified. “Hadoop allows data to be more agile and enables the business to draw more meaningful conclusions faster,” he noted.

The move toward convergence:  “Hadoop 10 years ago was ground zero and Hadoop’s anniversary is significant, especially when you realize the 10-year market adoption of Hadoop has far outpaced that of Linux/Unix and relational databases,” said Jack Norris, chief marketing officer, MapR Technologies.  “The robust and dynamic Hadoop ecosystem has transformed how companies store, analyze, and process data, but more importantly, it has transformed their businesses.   While no one has a crystal ball to predict where Hadoop will be 10 years in the future, what we see now is the move towards convergence – combining Hadoop with Spark, storage, NoSQL, and streaming capabilities on one unified cluster – to create global real-time data applications. 

Hadoop ushered in the era of analytics for all: “As we look back over these past 10 years, we see a vastly changed landscape from 2006 and the pre-Hadoop days,” said Joe McKendrick, Unisphere Research lead analyst. “There has always been some form of ‘big data’ in existence – after all, ‘big’ is a relative term – and we have long had networks of sensors, devices, embedded applications, remote systems, and log files providing gobs of streaming data.” Prior to Hadoop,  said McKendrick, capturing and doing analysis on this data required proprietary tools, and was an expensive and resource-intensive undertaking. “Hadoop and the open source ecosystem that accompanied it has made big data analytics an extremely cost-effective option within reach of everyone.”

Beyond all expectations:  “When I first saw Hadoop, my initial thought was ‘Who writes system software in Java?!’ reflects Arun C. Murthy, cofounder of Hortonworks, an Apache Hadoop PMC (project management committee) member and full-time contributor to the project since the inception in 2006. “Hadoop needed a lot of work back in those early days. When we started on Hadoop at Yahoo, it worked on barely 2-3 machines. We invested a lot in improving Hadoop, so much so that, at one point, we had 100-plus people working on it.” 

In those early days, said Murthy, “We thought Hadoop could be similar to the Apache HTTP Server Project in some ways, but Hadoop has become much larger. Many enterprises - financial institutions, retailers, insurers – are basing their data strategy on Hadoop. The pace of innovation in the community has been mind-blowing. Projects like Spark are gaining so much momentum, and the sheer number of projects has been fun to watchHadoop has redefined what is possible with data. As you look ahead, it is getting cheaper and cheaper to store and process data. With sensors and LTE, it is cheaper and easier to move data around the world. This combination is leading to a lot of new business use cases for the enterprise, and Hadoop is part of it. Forrester recently estimated that 100% of all large enterprises will adopt Hadoop for big data analytics within the next 2 years. That’s a pretty big shift in the IT world.”

New applications on Hadoop: With its heritage in batch processing, most of the emphasis in Hadoop has been on analytics. SQL-on-Hadoop projects like Hive, Impala and Drill focused on analytical-only workloads by improving ease-of-use with SQL and analytical performance on read-only or append-only data, commented Monte Zweben, CEO of Splice Machine, a transactional RDBMS powered by Hadoop and Spark. However, as Hadoop becomes a key data management platform in many companies, it needs to support operational applications and mixed workloads with simultaneous operational and analytical queries, he noted. Using HBase, projects like Phoenix, Trafodion and Splice Machine are leading the way to support operational applications on real-time data. “I expect that a majority of the new applications in 10 years will be built on Hadoop,” said Zweben.

The next 10 years of the Hadoop ecosystem:  “Inspired by Google, Hadoop was born out of a need for everyone to manage and gain insights from vast amounts of data at a scale that no one had tackled before. Welcome to the Information Age where, going forward, we will see an even broader data technology convergence where analytics systems will be equally good at transactions with the same data,” said John O’Brien, principal analyst and CEO of Radiant Advisors, a research and advisory firm. It is still early days for this hybrid transaction model in Hadoop, he notes, but the long-term goal is a single data management layer to reduce duplication with an application layer to support operational transactions, streaming data (IoT), and advanced analytics. 

“Hopefully, in 10 years," said O'Brien, "the Hadoop ecosystem will equally represent both operational and analytic needs at scale for the business with a single managed set of all the data. There will undoubtedly be the next hyped technology, unforeseen shiny object down the road, but it will be within this paradigm.”



Newsletters

Subscribe to Big Data Quarterly E-Edition