Cloud and Hadoop - Keys to a Perfect Marriage Explored in New DBTA Webcast On-Demand

DBTA recently presented the third in a series of educational webcasts focused on managing and leveraging big data. The webcast, "Using SQL to Explore Any Data on Hadoop in the Cloud," showed how Amazon Elastic MapReduce, a hosted Hadoop web service, combined with Karmasphere Analyst, provides a rapid onramp to big data using SQL.

Presented by Adam Gray, product manager, Amazon Elastic MapReduce, and Martin Hall, co-founder, president and CEO, Karmasphere, the webcast was hosted by Tom Wilson, president, DBTA and Unisphere Research.

Setting the context for Gray's and Hall's presentations, Wilson observed that, according to the findings a recent study of data growth conducted by Unisphere Research among members of the Independent Oracle Users Group (IOUG),  data is growing at 9 out of 10 respondents'organizations, and at rates of more than 50% a year at 16% of the respondents' organizations. But it is not simply the growth of data that is the problem; it is the size of the resident data that is thwarting the ability of organizations to not only manage - but extract business value from - these vast and potentially rich repositories of information, Wilson said. For example, nearly two-thirds of respondents to the IOUG survey reported nearly 5 terabytes of data online (disk resident) and 20% reported over 100 terabytes of online (disk resident) data. "In these new big environments, users and advisers have begun to search for new ways of integrating this information to provide a single view of the business and deliver actionable information for business decision makers," Wilson noted.

Gray agreed, pointing out that, in addition to increasing data volumes,  the multiple sources and formats, various levels of structure, and timeliness requirements add to the challenges to the challenges of managing and leveraging data. Organizations, he noted, want to take clickstream data, log data, and join that with conventional customer data, and they want to get insights from that in a timely and responsive manner. "The problem is that traditional systems don't scale and they really weren't built or architected to scale," he said.  In addition, it can take a long time to provision more infrastructure and specialized DB expertise is required.

The presentations highlighted the benefits of a purpose-built tool called Apache Hadoop that helps organizations deal with the challenges of big data; Amazon Elastic MapReduce, a Hadoop tool that allows users to easily leverage the power of Hadoop;  a SQL engine called Apache Hive built on top of Hadoop that allows users to use direct SQL queries; and the suite of tools that Karmasphere provides to make it easier to pull all those pieces together.

According to Gray, dealing with big data requires two things - distributed, scalable storage, along with inexpensive, flexible analytics -  and Apache Hadoop, an open source software platform, addresses both of those needs.

Apache Hadoop includes a fault-tolerant, distributed storage system (HDFS) developed for commodity servers, and uses a technique called MapReduce to carry out analysis over huge distributed data sets. The key benefits are that it is affordable with a cost per terabyte that is a fraction of traditional options; and it is also proven at scale, with numerous petabyte implementations in production that have seen linear scalability, says Gray. And, it is flexible - allowing data to be stored with or without schema, and schema can be added after the fact. Hadoop use cases include clickstream data analysis for targeted advertising; data warehousing; bio-informatics; financial simulations; file processing and things like image resizing; as well as data mining and BI, said Gray.  "Anytime you have a large data set where you need to match or find similarities across factors in that data, Hadoop and MapReduce are going to be strong candidates as a solution."

Hadoop and the cloud really are "a perfect marriage" particularly with Amazon Web Services, for several reasons, said Gray, citing Amazon's Simple Storage Service (S3), which allows users to store as much data as they want at low cost; elastic servers provided by the Elastic Compute Cloud (EC2) so users can add the capacity they need the moment they need it; and a purpose-built managed Hadoop tool called Elastic MapReduce (EMR) which "handles a lot of the challenges around tuning and configuring and monitoring your Hadoop cluster."

Amazon Web Services is an organization Karmasphere has worked with for almost two years as it has been developing software specifically focused on giving developers and analysts the power they need to mine and  explore big data on Hadoop, said Karmasphere's Hall. Karmasphere works very closely with Amazon Web Services' in-the-cloud, Hadoop-based and storage services.

Hall explained the benefits of the Apache Hive project and how the Karmasphere Analyst product "wraps that up in a very easy to use piece of software that you can install on your desktop and access your big data in Hadoop from that software application."

Hive is an open source project that was built specifically for Hadoop and was started at Facebook, said Hall, explaining that it is a compiler, optimizer and executor "that allows SQL to be thrown at a Hadoop cluster." According to Hall, "what the Facebook guys started was a project to deliver SQL access to Hadoop that they put under the Apache umbrella and have called Hive.  Hive essentially is a project that takes any type and any size of data sitting on Hadoop and turns it into a virtually limitless data warehouse." Perhaps most importantly, it gives SQL-like access not just to structured data but also to unstructured data, he noted.

"Karmasphere Analyst makes extensive use of the open source Hive project and combines it with two other things. Number one, based on the work that we have done building a Hadoop-specific technology platform at Karmasphere, we provide access to  any Hadoop cluster as well as Amazon Elastic MapReduce and allow users to prototype their SQL to monitor what is going on when the SQL commands are running on the cluster in a distributed fashion, to profile what is going on with those SQL queries, to visually view query plans, to integrate result sets both within the data enterprise infrastructure and within desktop applications, and to get access to the file system on the Hadoop cluster. And, we combine all of that with a graphical user interface. Our philosophy at Karmasphere is very much to provide software applications for developers and analysts that are as familiar as possible to things that they have used in the past and allow them to harness skills that they have already got."

To listen to the webcast, access results of quick polls among the webcast attendees, as well as download slides used in the webcast presentations, go here