Does the World Really Need Another Hadoop Distribution?

It’s easy to become fatigued by the sheer number of vendors offering their own distribution or packaging of Hadoop. You can buy Hadoop from Cloudera, Hortonworks, Greenplum, MapR, Oracle, Dell, and Microsoft just to name a few.

Not all of these offer a unique distribution of the Hadoop core, but all attempt to offer a unique value proposition through additional software utilities, hardware, or cloud packaging. Against that backdrop, Intel’s distribution of Hadoop appears, on the surface, to be an odd duck. Intel is not in the habit of offering software frameworks, and the brand, while ubiquitous, is not associated specifically with Hadoop, databases or big data software.

Intel’s fundamental argument is that it’s in a unique position to provide a Hadoop distribution that is optimized “from the silicon up” for Hadoop. This is a somewhat debateable claim since Hadoop workloads—running in a Java Virtual Machine (JVM)—do not run particularly close to the metal. However, Intel does offer a number of highly optimized configurations for Hadoop, incorporating the latest Xeon processors, SSD accelerated storage and 10 Gigabit networking. However, it’s not clear that a tailored distribution of Hadoop is required to take advantage of these commendable hardware improvements: these reference architectures would accelerate any distribution.

On the software side, Intel has assembled most of the usual suspects from the Hadoop ecosystem: Mahout and R for data mining, and Lucene for full text search. They also have integrated GraphLab and GraphBuilder into their distribution to support processing of graph datasets. A graph dataset is one in which the relationships between objects or entities are the primary concern. GraphLab and GraphBuilder provide tools to create graphs from HDFS data, and run graph processing workloads using Hadoop.

Project Rhino—an open source contribu­tion sponsored by Intel and included within itsdistribution—aims to provide enterprise-level security around Hadoop.

Project Gryphon is currently unique to the Intel distribution, and is yet another option for running SQL queries on top of Hadoop. While Apache HIVE is fully integrated with Hadoop and offers SQL-like queries, it does not provide full ANSI SQL compliance and cannot easily satisfy low latency queries. Gryphon is an attempt to provide ANSI-92 compliant SQL syntax, and uses caching, optimizer enhancements, and HBase integration to deliver faster execution time for queries demanding a quick response. Gryphon shares objectives with many of the other SQL on Hadoop projects, such as Cloudera’s Impala, Hadapt, and Greenplum’s HAWQ. Meanwhile, Hortonworks and others are working hard to bring similar capabilities to Hive via the Stinger initiative.

Probably the most significant contributions Intel has made in its distribution are in the realm of security. Project Rhino—an open source contribution sponsored by Intel and included within its distribution—aims to provide enterprise-level security around Hadoop. Rhino adds a common token-based authentication layer that allows for uniform authentication to all components of the distribution and single sign-on. Rhino also provides transparent hardware-enhanced encryption of data in Hadoop, and an auditing service. And, like the Accumulo project, Rhino provides for granular access control in HBase down to the individual data cell.

The objectives of the Rhino initiative are well thought-out, and directly address one of the key obstacles to enterprise adoption of Hadoop. As an open source project, it potentially might become a standard part of other distributions, although I wouldn’t count on instant adoption by the other Hadoop vendors.

Not surprisingly, given its excellent partnerships across the computer industry, Intel has support from a variety of vendors, including Oracle and SAP, and many of the innovations in its distribution show real promise. While it seems unlikely that the Intel distribution will become dominant, it does constitute a serious contribution to the Hadoop ecosystem.

About the author:

Guy Harrison is an executive director of R&D at Dell and ?author of the Oracle Performance Survival Guide (Prentice Hall, 2009). Contact him at