Learning How to Get Started with Big Data Analytics Using the Hadoop Ecosystem

At the center of the new big data movement is the Hadoop framework, which provides an efficient file system and related ecosystem of solutions to store and analyze big datasets. The Hadoop ecosystem was addressed from two points of view in a session at Data Summit 2016.

James Casaletto, principal solutions architect, Professional Services at MapR, presented a talk titled “Harnessing the Hadoop Ecosystem,” and Tassos Sarbanes, mathematician/data scientist, Investment Banking at Credit Suisse, covered the advantages of HBase in a talk titled “HBase Data Model – The Ultimate Model on Hadoop.”

According to Casaletto, people who are new to big data often lack a comprehensive view of how end-to-end solutions are actually constructed. As a result, they have an incomplete understanding of how Hadoop may be used to solve their problems. The session addressed that knowledge and experience gap.

According to Casaletto, the easiest way to get started with Hadoop is to download your own sandbox and start playing with it.  “In general, log analytics is a sweet spot for Hadoop, and Hadoop is really good at that and it doesn’t require a lot of experience or money,” said Casaletto, who walked attendees through a configuration with free open source tools that users can build on their own.

Casaletto used an example of a big data analysis use case, analyzing web traffic, and showed how data travels from Apache Web Servers and HAproxy, using tools such as RSyslog, Flume, Kafka, and Spark Streaming, ultimately to Kibana, in order to enable visualizations, transformations, and analytics of traffic patterns and important pages on the website.  

“It is a lot like plumbing,” said Casaletto. “A lot of data pipelines and architecting these things is just plumbing.  You are getting the data and you are plumbing the data from the source to a sink where you can finally let users do their analysis.”

Commenting on another aspect of the Apache Hadoop ecosystem, Sarbanes covered the advantages of HBase and the role it plays as a database for Hadoop HDFS.

According to Sarbanes, there are many limitations on Hadoop HDFS. HBase, the database of Hadoop, is a NoSQL database that helps overcome these issues. Modeled after Google’s BigTable, HBase is a column-oriented database management system that runs on top of HDFS and is suited for hosting very large tables to store semi-structured datasets. There are two things in particular that HDFS cannot do, said Sarbanes. It cannot do random reads and writes fast enough and it is also unable to change a file without completely rewriting it. HDFS is not a database. HBase fills in that gap, explained Sarbanes.

Data Summit is an annual 2-day conference, preceded by a day of workshops. Data Summit offers a comprehensive educational experience designed to guide you through all of the key issues in data management and analysis today. The event brings together IT managers, data architects, application developers, data analysts, project managers, and business managers for an intense immersion into the key technologies and strategies for becoming a data-informed business.

Many presentations from Data Summit 2016 have been made available for review at