Data Integration Evolves to Support a Bigger Analytic Vision

<< back Page 4 of 4

Two years ago, for example, the most efficient ways to get data out of Hadoop included:

  1. Writing MapReduce jobs in Java in order to translate the simple dependency, linear chain, or directed acyclic graph (DAG) operations involved in data engineering into map and reduce operations;
  2. Writing jobs in PigLatin for Hadoop’s Pig framework to achieve basically the same thing;
  3. Writing SQL-like queries in Hive Query Language (HiveQL) to achieve basically the same thing; or
  4. Exploiting bleeding-edge technologies (such as Cascading, an API layered on top of Hadoop that’s supposed to make it easier to program/manage) to achieve basically the same thing.

Today, there’s no shortage of mechanisms to get data from Hadoop. Take Hive, an interpreter that compiles HiveQL queries into MapReduce jobs. As of Hadoop 2.x, Hive can leverage both Hadoop’s MapReduce engine or the new Apache Tez framework. Tez is just one of several designs that exploit Hadoop’s new resource manager, YARN, which makes it easier to manage and allocate resources for multiple compute engines, in addition to MapReduce. Thus, Apache Tez—which is optimized for the operations, such as DAGs, that are characteristic of data transformation workloads—now offers features such as pipelining and interactivity for ETL-on-Hadoop. There’s also Apache Spark, a cluster computing framework that can run in the context of Hadoop. It’s touted as a high-performance complement and/or alternative to Hadoop’s built-in MapReduce compute engine; as of version 1.0.0, Spark is paired with “Spark SQL,” a new, comparatively immature, SQL interpreter. (Spark SQL replaces a predecessor project, dubbed “Shark,” which was conceived as a Hive-oriented SQL interpreter.) Over the last year, especially, Spark has become one of the most hyped of Hadoop-oriented technologies; many DI or analytic vendors now support Spark to one degree or another in their products. Generally speaking, most vendors now offer SQL-on-Hadoop options of one kind or another, while others also offer native (optimized) ETL-on-Hadoop offerings.

What’s Ahead in Data Integration for Big Data

Cloud is a critical context for data integration. One reason for this is that most providers offer export facilities or publish APIs that facilitate access to cloud data. Another reason—as I wrote last year—is that doing DI in the cloud doesn’t invalidate (completely or, even, in large part) existing best practices: if you want to run advanced analytics on SaaS data, you’ve either got to load it into an existing, on-premises repository or—alternatively—expose it to a cloud analytics provider. What you do in the former scenario winds up looking a lot like what you do with traditional DI. And the good news is that you can do a lot more with traditional DI tools or platforms than used to be the case. Most data integration offerings can parse, shred, and transform the JSON and XML used for data exchange; some can do the same with formats such as RDF, YAML, or Atom. Several prominent database providers offer support for in-database JSONs (e.g., parsing and shredding JSONs via a name-value-pair function or landing and storing them intact as variable character text), while others offer some kind of support for in-database storage (and querying) of JSON data. DV vendors are typically no less accomplished than the big DI platforms with respect to their capacity to accommodate a wide variety of data exchange formats, from JSON/XML to flat files.

Any account of data integration and big data is bound to be insufficient simply because there is so much happening. As noted, the Hadoop platform is by no means the only—nor, for that matter, the most exciting—game in town. Apache Spark, which (a) runs in the context of Hadoop and which (b) can both persist data (to HDFS, the Hadoop Distributed File System) and run in-memory (using Tachyon) last year emerged as a bona fide big data superstar. Spark is touted as a compelling platform for both analytics and data integration. Several DI vendors already claim to support it to some extent. Spark, like almost everything else in the space, will bear watching. And so it goes.

For more articles on big data technologies and trends, download the Free Big Data Sourcebook: Second Edition

<< back Page 4 of 4