Hadoop and the Big Data Revolution

Oct 10, 2012

By Guy Harrison

It’s in the nature of hype bubbles to obscure important new paradigms behind a cloud of excitement and exaggerated claims. For example, the phrase “big data” has been so widely and poorly applied that the term has become almost meaningless. Nevertheless, beneath the hype of big data there is a real revolution in progress, and more than anything else it revolves around Apache Hadoop.

Let’s look at why Hadoop is creating such a stir in database management circles, and identify the obstacles that must be overcome before Hadoop can become part of mainstream enterprise architecture. To do that, it helps to look at the factors that drove the last revolution in databases – the relational revolution of the 1980s.

The Last Data Revolution

The database revolution of the 1980s led to the triumph of the relational database (RDBMS). The relational model was based on well-formulated theoretical models of data representation and storage. But it was not the theoretical advantages of the RDBMS that drove its rapid adoption. Rather, the RDBMS gave business easy access to the data held in production systems for the first time, enabling the birth of modern business intelligence systems.

Prior to the relational database and relatively easy and flexible SQL language, virtually every request for information required the attention of a professional programmer, who would typically satisfy the request by writing hundreds or thousands of lines of COBOL program. The ability of the relational database to accept ad hoc SQL query requests opened up the database to those outside the IT elite. The log jam of report requests in MIS departments was eliminated, and relatively unskilled staff could quickly extract data for decision making purposes.

Hadoop is revolutionizing database management for a similar reason – it is unlocking value of the masses of enterprise data that are not stored in an RDBMS. And it’s this non-relational data – data generated by weblogs, point-of-sale devices, social networks, and mobile devices – that offers the most potential competitive differentiation today.

In the 1980s, the relational paradigm became powerful so that almost every vendor described their database as relational, regardless of the underlying architecture. In a similar way, almost every data technology on the market today is described as a “big data” solution. But the reality is that of all open source and commercial technologies that can claim to offer a “big data” solution, Hadoop is far ahead in terms of adoption, fitness for purpose, and pace of innovation.

Why The Fuss Over Big Data?

The volumes of data in database systems have been growing exponentially since the earliest days of digital storage. A lot of this growth is driven simply by Moore’s law: The density of digital storage doubles every year or two, and the size of the largest economically practical database increases correspondingly. “Because we can” explains much of the growth in digital data over the last generation.

But something different is behind today’s “big data” revolution. Prior to the internet revolution, virtually all enterprise data was produced “by hand”: employees using online systems to enter orders, record customer details and so on. Entire professional categories once existed for professionals whose duty was to enter data into computer systems – occupations such as Key Punch Operators and Data Entry Operators.

Today, only a small fraction of a company’s data assets are manually created by employees. Instead, the majority of data is either created by customers or generated as a by-product of business operations. For instance, customers generate “click streams” of web navigation as a by-product of their interactions with the online business. Supply chain systems generate tracking information as orders are fulfilled. And often customers and potential customers post their reviews, opinions and desires to the internet through systems like Twitter, Yelp, Facebook, and so on.

This “data exhaust” and social network noise would be of only minor interest if it weren’t for the simultaneous - and not entirely coincidental - development of new techniques for extracting value from masses of raw data – techniques such as machine learning and collective intelligence.

Collective Intelligence Beats Artificial Intelligence

Prior to the big data revolution, software developers attempted to create intelligent and adaptive systems largely as rule-based expert systems. These expert systems attempted to capture and emulate the wisdom of a human expert. Expert systems had limited success in fields such as medical diagnosis and performance optimization but failed dismally when applied to tasks such as language parsing, recommendation systems, and targeted marketing.

The success of Google illustrated the way forward: Google provided better search results not simply by hard coding better rules, but by using their increasingly massive database of past searches to refine their results. During the same period, Amazon demonstrated the power of using recommendation engines to personalize the online shopping experience. Both Google and Amazon benefited from a virtuous cycle in which increasing data volumes improved the user experience leading to greater adoption and even more data. In short, we discovered that the “wisdom of crowds” beats the traditional rule-based expert system.

Amazon and Google solutions are examples of collective intelligence and machine learning techniques. Machine learning programs modify their algorithms based on experience while collective intelligence uses big data sets to deliver seemingly intelligent application behavior.

Web 2.0 companies such as Google and Amazon benefited the most from these big data techniques. Today we can see competitive advantage in almost every industry segment. Big data and collective intelligence techniques can increase sales in almost any industry by more accurately matching potential consumers to products. They can also be used to identify customers at risk of “churn” to a competitor, who might default on payments or who otherwise warrant some personalized attention. They are also critical in creating the personalized experience that consumers demand in modern ecommerce.

Why Hadoop Works

Hadoop is essentially an open source implementation of the key building blocks pioneered by Google to meet the challenge of indexing and storing the contents of the web. From its beginning, Google encountered the three challenges that typify big data – sometimes called the “three Vs”:

Massive and exponentially growing quantities of data (Volume)
Unpredictable, diverse and weakly structured content (Variety)
Rapid rate of data generation (Velocity)

Google’s solution was to employ enormous clusters of commodity servers with internal disk storage. The Google File System (GFS) allowed all the disks across these servers to be treated as a single file system. The MapReduce algorithm was created to allow workloads to be parallelized across all the members of the cluster. By using disks in cheap commodity servers rather than disks in expensive SANs, Google achieved a far more economic and scalable data storage architecture than would otherwise have been possible.

By duplicating this classic Google architecture, Hadoop provides a practical, economic and mature platform for the storage of masses of unstructured data. Compared to the alternatives – particularly to RDBMS – Hadoop is:

Economical: Per GB, Hadoop costs an order of magnitude less than high-end SAN storage that would typically support a serious RDBMS implementation.
Mature: the key algorithms of Hadoop have been field tested at Google, and massive Hadoop implementations have been proven at Facebook and Yahoo!. The Hadoop community is vibrant and backed by several commercial vendors with deep pockets – especially now that IBM, Microsoft and Oracle have all embraced Hadoop.
Convenient: RDBMS requires that data be analysed, modelled and transformed before being loaded. These Extract, Transform and Load (ETL) projects are expensive, risk-prone and time consuming. In contrast, the Hadoop “schema on read” approach allows data to be captured in its original form, deferring the schema definition until the data needs to be accessed.

Hadoop is not the only possible or existing technology that could potentially deliver these benefits. But almost all significant vendors have abandoned development of alternative technologies in favor of embracing and embedding Hadoop. Most significantly, Microsoft, IBM, and Oracle now all deliver Hadoop integrated within their standard architectures.

Delivering on the Promises of Big Data

A technology like Hadoop alone doesn’t deliver the business benefits promised by big data. For big data to become more than just promise we’ll need advances in the skill sets of IT professionals, software frameworks that are able to unlock the data held inside Hadoop clusters and a discipline of big data best practice.

It’s standard practice in a Hadoop project today to rely on highly skilled Java programmers with experience in statistical analysis and machine learning. These developers are in short supply and – given the complexity inherent in collective intelligence solutions – probably always will be. Nevertheless, universities should be constructing syllabuses less focused on the relatively routine world of web-based development in favor of course structures that include an emphasis on parallel data algorithms such as MapReduce together with statistical analysis and machine-learning techniques.

Given the shortage of skilled programmers higher-level abstractions on top of the native Hadoop MapReduce framework are extremely important. Hive – A SQL-like access layer for Hadoop – and PIG – a scripting data flow language – both open Hadoop up to a wider range of users and to a wider range of software tools. Hive in particular is a key to enterprise Hadoop adoption because it potentially allows traditional BI and query tools to talk to Hadoop. Unfortunately, the Hive SQL dialect (HQL) is a long way from being ANSI-SQL-compliant. The Hadoop community should not underestimate how much enterprise adoption is depending on increasing Hive maturity.

Hive can open up Hadoop systems to traditional data analysts and traditional Business Intelligence tools. But as a revolutionary technology, Hadoop and big data are more about breaking with existing traditions. Their unique business benefits lie in more sophisticated solutions that can’t be delivered by Business Intelligence tools.

Statistical analysis becomes particularly important as data granularity and volumes exceeds what can be understood through simple aggregations and pivots. Long-time commercial statistical analysis vendors are rushing Hadoop connectivity to market, but by and large it’s been the open source R package that has being used most successfully with Hadoop. R lacks some of the elegant graphics and easy interfaces of the commercial alternatives, but its open source licensing and extensibility make it a good fit in the Hadoop-based big data stack.

Beyond statistical analysis lies the realm of machine learning and collective intelligence that powered much of Google and Amazon’s original success. Open source frameworks such as Apache Mahout provide building blocks for machine learning – low level techniques such as clustering, categorization and recommenders. But it takes a very skilled team of developers to build a business solution from these building blocks.

Frameworks that can be used to deliver packaged big data solutions for the enterprise are only just emerging. A big opportunity could await the software vendor who can deliver a packaged application that brings collective intelligence and machine learning within the reach of mainstream IT.

Vive la Révolution!

The relational database is a triumph of software engineering. It defined database management for more than 25 years and it will continue as the dominant model for real-time OLTP and BI systems for decades to come.

However, it’s apparent that the volume and nature of today’s digital data demands a complementary, but non-relational storage technology – and Hadoop is the leading contender. For many organizations, the majority of digital data assets will soon be held not in an RDBMS, but in Hadoop or Hadoop-like systems.

Hadoop provides an economically viable storage layer without which the big data revolution would be impossible. The revolution will not be complete until we have practical techniques for turning this big data into business value. Completing the big data revolution is going to generate demand for a new breed of IT professional and hopefully foster a new wave of software innovation.

About the author:

Guy Harrison can be found on the internet at www.guyharrison.net, on email at guy.harrison@quest.com and is @guyharrison on Twitter.