At Data Summit 2015 in New York, James Casaletto of MapR and David Teplow of Integra provided deep dives into the world of Hadoop, past, present, and future.
In his presentation, titled “The Hadoop Ecosystem,” Casaletto recounted how Hadoop which is now the predominant big data platform for storing and analyzing data with a broad ecosystem, was born in 2004 with the release of white papers on the Google File System and MapReduce papers. Casaletto said that in 2006 it was a top level Apache product, and in 2008 came the Hbase and Mahout projects, went through the steps that led to its place in the data world today, and the most recent releases in 2015 of Drill and Myriad.
“Drill is all things data analysis through an ANSI SQL interface to analyze multiple data sources, including Hive, RDBMSs, Hbase, so you can get the all data in one query. “That is one of the promises of big data, joining with a single query or set of queries, multiple data sources at the same time,” he said. “And then Myriad is a manager of managers. While Yarn is a single cluster, Myriad allows manage the resources of multiple clusters.”
Why Spark is Hot
Commenting on Spark, Casaletto said the reason Spark for its name and why it is so hot is that it does everything in memory. The first read is from disk and every successive transformation can be done in memory and then the last transformation if you want to persist that data to the disk you do. The more transformations you have in Spark, the faster it is, and the more it shines.
The problem the ecosystem addresses is how usable is Hadoop and how manageable is Hadoop by making it easy to use. “The MapReduce framework is kind of tricky to use. It is not immediately intuitive, and it requires programming. Analysts come from the SQL world largely and that is why there is a movement to make SQL available on any different format so that more people can analyze this data.”
Casaletto also went through a number of the trends in the Hadoop ecosystem and use cases beyond traditional BI and data warehousing. “The use cases are growing.” Overall Hadoop adoption is growing, he said, from thinking about it and kicking the tires to actually deploying it in production.
Playing the Dating Game with Hadoop – Choosing Hadoop Vendor Number 1, 2, or 3
Teplow followed with comparison of the three major Hadoop distributions in terms of the history and benefits of each, including what they provide, their management tools, and the performance benchmarks they have achieved.
In his presentation, titled “Whose to Choose,” Teplow looked at the distributions from Hortonworks, Cloudera, and MapR. “There are four significant differences in my view between the three,” he said. They are from different companies, offer different management and administration tools, they are each pushing a different SQL on Hadoop offering as their primary approach to perform SQL on Hadoop, and lastly they perform differently in some benchmarks.
Both Casaletto's and Teplow’s slide decks from their Data Summit presentations are available at www.dbta.com/DataSummit/2015/presentations.aspx.