How Businesses Are Driving Big Data Transformation

Jan 12, 2015

By John O’Brien

<< back Page 2 of 4 next >>

While the HDFS will continue to evolve as the caretaker of data in the distributed file system architecture with improved name node high availability and performance, YARN, introduced in Hadoop 2, completely changes the paradigm of data engines and access. Though the primary role of YARN is still that of a resource negotiator for the Hadoop cluster and focused on managing the resource needs of tens of thousands of jobs in the cluster, it has also now established a new framework.

The YARN framework serves as a pluggable layer of YARN certified engines designed to work the data in different ways. Previously, MapReduce was the primary programming framework for developers to create applications that leveraged the parallelism of the data nodes. As other project and data engines could work with HDFS directly without MapReduce, a centralized resource manager was needed that would also enable innovation for new data engines. MapReduce became its own YARN engine for existing Hadoop 1 legacy code, and Hive decoupled to work with the new Tez engine. Long recognized as ahead of the curve, Google caused quite a fury when it announced that “MapReduce was dead” and that they would no longer develop in it. YARN was positioned for the future of next-generation engines.

For more articles on big data technologies and trends, download the Free Big Data Sourcebook: Second Edition

Sometimes in 2014 we felt that the booming big data drum was starting to die down. And, sometimes we wondered if it only seemed that was because everyone was chanting “Storm” just a bit louder. Another major driver in the Hadoop implementations was that “big” data didn’t mean “fast” data. The industry wanted both big and fast: The Spark environment is where both early adopters were writing new applications, and the development community was quickly developing Spark to be a high-level project to meet those needs. The Spark community touts itself as “lightning-fast cluster computing” primarily leveraging in-memory capabilities of the data nodes, but also a newer, faster framework than MapReduce on disk. While Spark was in its infancy in 2013, we saw this need for big data speed being tackled by two-tier distributed in-memory architectures. Today, Spark is a framework for Spark SQL, Spark Streaming, Machine Learning, and GraphX running on Hadoop 2’s YARN architecture. In 2014, this has been very exciting for the industry, but many of the mainstream adopters are patiently waiting for the early adopters to do their magic.

Two Camps: Early Adopters and Mainstream Adopters

For years, overwhelming data volumes, complexity, or data science endeavors were the primary drivers behind early big data adopters. Many of these early adopters were in internet-related industries, such as search, e-commerce, social networking, or mobile applications that were dealing with the explosion of internet usage and adoption.

In 2014, we saw mainstream adopters become the next wave of big data implementations that are expected to be multiple times larger than the early adopters. We define mainstream adopters as those businesses that seek to modernize their data platforms and analytics capabilities for competitive opportunities and to remain relevant in a fast changing world, but are tempered with some time to research, analyze, and adopt while maintaining current business operations. Mainstream adopters have had pilots and proof of concepts for the past year or two with one or two Hadoop distributors and now are deciding how this also fits within their overall enterprise data strategy.

Leading the way for mainstream adopters is, by consequence, meeting enterprise and IT requirements for data management, security, data governance, and compliance in a new, more complicated, set of data that includes public social data, private customer data, third-party data enrichment, and storage in cloud and on-premises. Over the past year, it has often felt like the fast-driving big data vehicle hit some pretty thick mud to plow through, and some in the industry argued that forcing Hadoop to meet the requirements of enterprise data management was missing the point of big data and data science. For now, we have seen most companies agree that risk and compliance are things that they must take seriously moving forward.

Mainstream Adopters Redefining Commodity Hardware

As mainstream adopters worked through data management and governance hurdles for enterprise IT, next up was the startling exclamation: “I thought you said that was cheap commodity hardware?!” This has become an interesting reminder of the roots of big data and the difference with IT enterprise-class hardware.

<< back Page 2 of 4 next >>