Hadoop Sets its Sights on Enterprise Computing

Google's first "secret sauce" for web search was the innovative PageRank link analysis algorithm which successfully identifies the most relevant pages matching a search term.   Google's superior search results were a huge factor in their early success. However, Google could never have achieved their current market dominance without an ability to reliably and quickly return those results.

From the beginning, Google needed to handle volumes of data that exceeded the capabilities of existing commercial technologies. Instead, Google leveraged clusters of inexpensive commodity hardware, and created their own software frameworks to sift and index the data. Over time, these techniques evolved into the MapReduce algorithm.  MapReduce allows data stored on a distributed file system - such as the Google File System (GFS) - to be processed in parallel by hundreds of thousands of inexpensive computers.  Using MapReduce, Google is able to process more than a petabyte (one million GB) of new web data every hour.

Google published details of GFS in 2002, as well as the MapReduce approach in 2004.  Around that time, the Apache Nutch project, attempting to build an open source web search engine, decided to build an open source implementation of Google's algorithms. The resulting technology was spun off into a separate project - Apache Hadoop - the following year.

The Hadoop project gained significant early assistance from Yahoo! and, subsequently, from Web 2.0 companies such as Facebook and  Today, the largest Hadoop clusters can apply thousands of computer cores and Terabytes of RAM at workloads involving petabytes of data on disk.

Hadoop seems poised for wider adoption; many financial institutions, government bodies and other large organizations are piloting the use of Hadoop for large scale data processing.  For example, at the recent Hadoop World conference, Visa revealed details of a pilot project to apply Hadoop to the analysis of roughly 200 million Visa transactions per day. 

Of course, to become enterprise-ready, you need more than just an open source stack: an enterprise considering Hadoop wants manageability, security, monitoring, and interoperability with existing data stores. Cloudera is a relatively young company that is providing software and services around the Hadoop core software with the aim of commercializing the software stack. Cloudera offers certified distributions of Hadoop along with utility software, such as a desktop management console. 

For certain workloads, MapReduce and Hadoop can outperform even the most expensive commercial RDBMS software and associated hardware - and can do so using much cheaper commodity hardware, and without expensive software licenses.  However, writing MapReduce routines in Java or another programming language is certainly not as easy as crafting a SQL statement that performs the same task. Providing a more user-accessible interface to Hadoop has, therefore, been a high priority for the Hadoop community.

Today, there are two popular ways to exploit Hadoop without having to code up a formal MapReduce program.  Pig - originally developed by Yahoo! - allows Hadoop jobs to be expressed in a scripting language that simplifies common MapReduce operations such as filtering, joins and aggregation. Hive, developed at Facebook, provides a SQL-like interface to Hadoop. If you are familiar with SQL, you can submit a SQL statement to Hive that will then be compiled to MapReduce jobs executed by Hadoop.

Just as MapReduce has been pivotal at Google, Hadoop seems poised to become a critical technology in the wider world of data processing.  MapReduce and Hadoop will never replace the relational database for all workloads, but, there is little doubt they represent an economically compelling alternative for many "big data" applications.