Big Data Is Transforming the Practice of Data Integration

Feb 21, 2014

By Stephen Swoyer

<< back Page 2 of 4 next >>

Enter the category of the “NoSQL” data store, which includes a raft of open source software (OSS) projects, such as the Apache Cassandra distributed database, MongoDB, CouchDB, and—last but not least—the Hadoop stack. Increasingly, Hadoop and its Hadoop Distributed File System (HDFS) are being touted as an all-purpose “landing zone” or staging area for multistructured information.

ETL Processing and Hadoop

Hadoop is a schema-optional platform; it can function as a virtual warehouse—i.e., as a general-purpose storage area—for information of any kind. In this respect, Hadoop can be used to land, to stage, to prepare, and—in many cases—to permanently store data. This approach makes sense because Hadoop comes with its own baked-in data processing engine: MapReduce.

For this reason, many data integration vendors now market ETL products for Hadoop. Some use MapReduce itself to perform ETL operations; others substitute their own, ETL-optimized libraries for the MapReduce engine. Traditionally, programming for MapReduce is a nontrivial task: MapReduce jobs can be coded in Java, Pig Latin (the high-level language used by Pig, a platform designed to abstract the complexity of the MapReduce engine), Perl, Python, and (using open source libraries) C, C++, Ruby, and other languages. Moreover, using MapReduce as an ETL technology also presupposes a detailed knowledge of data management structures and concepts. For this reason, ETL tools that support Hadoop usually generate MapReduce jobs in the form of Java code, which can be fed into Hadoop. In this scheme, users design Hadoop MapReduce jobs just like they’d design other ETL jobs or workflows—in a GUI-based design studio.

The benefits of doing ETL processing in Hadoop are manifold: For starters, Hadoop is a massively parallel processing (MPP) environment. An ETL workload scheduled as a MapReduce job can be efficiently distributed—i.e., parallelized—across a Hadoop cluster. This makes MapReduce ideal for crunching massive datasets, and, while the sizes of the datasets used in decision support workloads aren’t all that big, those used in advanced analytic workloads are. From a data integration perspective, they’re also considerably more complicated, inasmuch as they involve a mix of analytic methods and traditional data preparation techniques.

Let’s consider the steps involved in an “analysis” of several hundred terabytes of image or audio files sitting in HDFS. Before this data can be analyzed, it must be profiled; this means using MapReduce (or custom-coded analytic libraries) to run a series of statistical and numerical analyses, the results of which will contain information about the working dataset. From there, a series of traditional ETL operations—performed via MapReduce—can be used to prepare the data for additional analysis.

There’s still another benefit to doing ETL processing in Hadoop: The information is already there. It has an adequate—though by no means spectacular—data management toolset. For example, Hive, an interpreter that compiles its own language (HiveQL) into Hadoop MapReduce jobs, exposes a SQL-like query facility; HBase is a hierarchical data store for Hadoop that supports high user concurrency levels as well as basic insert and update operations. Finally, HCatalog is a primitive metadata catalog for Hadoop.

Data Integration Use Cases

Right now, most data integration use cases involve getting information out of Hadoop. This is chiefly because Hadoop’s data management feature set is primitive compared to those of more established platforms. Hadoop, for example, isn’t ACID-compliant. In the advanced analytic example cited above, a SQL platform—not Hadoop—would be the most likely destination for the resultant dataset. Almost all database vendors and a growing number of analytic applications boast connectivity of some kind into Hadoop. Others promote the use of Hadoop as a kind of queryable archive. This use case could involve using Hadoop to persist historical data—e.g., “cold” or infrequently accessed data that (by virtue of its sheer volume) could impact the performance or cost of a data warehouse.

<< back Page 2 of 4 next >>