<< back Page 2 of 2

The New, Newly Democratized Data Integration


This makes sense. In spite of the best efforts of RDBMS vendors, the relational data warehouse isn’t an ideal context in which to ingest and persist nonrelational data as nonrelational data: RDBMSs have problems with streaming data, with web data, with text, and with files of any kind. (If they’re mostly in the form of name-value pairs, an RDBMS can efficiently ingest JSON files as well as the log files generated by sensors or machines, e.g., by shredding them into relational rows and columns.) But—and this is the material point—Hadoop, MongoDB, and other NoSQL platforms can cost-effectively ingest content of any type, including relational data.

Moreover, NoSQL platforms such as Hadoop combine a scalable distributed storage layer with a scalable general purpose parallel processing engine and an increasingly versatile set of resource management capabilities. The upshot is that Hadoop, Cassandra, and some other NoSQL platforms—used by themselves or combined with the Apache Spark cluster computing framework—can inexpensively store and process relational and nonrelational data. Increasingly, NoSQL platforms are being used in production as repositories in which to ingest, store, and process data of all kinds.

Thus the so-called “data lake” comprises both a repository for raw source data and a multi-tenant context in which to process and analyze data of any provenance and derivation, in addition to raw data. The data lake is one of the structures by which the New DI aims to simplify and optimize data movement: To this end, it functions as a raw upstream source for a variety of downstream analytic practices, from conventional data warehouse and data mart systems to machine learning practices to analysts and data scientists working in or with analytic sandboxes.

By design, the data lake consolidates the bulk of data integration processing into a single context. At ingest, for example, DI in the context of the data lake might take the form of basic consistency checks and/or light transformations on raw data. Similarly, in preparing data for consumption by a downstream data warehouse, a data lake might perform most of the heavy-duty transformations associated with the “T” portion of the traditional ETL process—although the refined dataset will likely undergo additional ETL passes (either in the data lake itself or, alternatively, via downstream ETL/ELT or data quality tools) as it is cleansed and standardized prior to being loaded into the warehouse. What’s more, the data lake functions as both a site in which to prepare data for data mining, statistical analysis, machine learning, or other data scientific workloads, and—depending on the skills and preferences of data scientists—a context in which to host these workloads.

The data lake is also cloud-ready: NoSQL systems can ingest JSON, XML, YAML, and other data interchange standards as objects and store them in a file system layer. (In Hadoop, this is nominally the Hadoop Distributed File System, or HDFS—although not all Hadoop distributions use HDFS.) This is an ideal means by which to expose data for read and write access by non-BI applications, too: Rank-and-file developers generally disdain working with relational database interfaces (such as ODBC or JDBC), which entails the use of a technique called object-relational mapping, or ORM. (ORM describes the process of recomposing data from the rows and columns of database tables—or of shredding data into relational columns and/or rows.) Think of the data lake as a sop to rank-and-file coders, too.

Another key way in which the New DI departs from its predecessor is with respect to the priority it accords to self-service. Today, there’s no shortage of self-service DI tools, most of which marry data visualization capabilities—useful for data profiling and discovery—with a workbench for designing data integration data/workflows. In most cases, the people who use and interact with these tools aren’t designing data flows for data warehouse systems or for other traditional consumers; instead, they’re practicing self-service data integration to prepare data for consumption by visual data discovery tools, or for use in data-scientific analyses. A resource such as the data lake can likewise be used to feed downstream data discovery and data science practices. (Such practices typically have less stringent data consistency and cleanliness requirements. In this case, the data lake is a superior—and more easily accessible—alternative to extracting data from the warehouse.) Indeed, by serving both traditional and nontraditional information constituencies, a data lake-like resource promotes democratization.

The data lake, in combination with any of several closed or open source offerings, is also an efficient context in which to ingest streaming data (e.g., via Apache Flume, Apache Kafka, or similar message queuing systems), to process and analyze data (via Apache Storm, the Spark Streaming library, Apache Flink, and other open source offerings), and to persist it. (Commercial Hadoop vendors tend to promote Apache HBase for just this use case.)


For more articles on the state of big data, download the third edition of The Big Data Sourcebook, your guide to the enterprise and technology issues IT professionals are being asked to cope with in 2016 as business or organizational leadership increasingly defines strategies that leverage the "big data" phenomenon.


Just to be clear: This is not to conflate the data lake with the New DI, or to argue that something such as a data lake is an essential tool for doing big data integration. It’s rather to advance the data lake as just one example of how data integration at big-data scale departs from the status quo ante.

The salient point is that NoSQL platforms such as Hadoop, Cassandra, MongoDB, and CouchDB are being used in production deployments as all-in-one repositories to persist data of all kinds. These may or may not take the form of formalized data lakes. In some cases, for example, NoSQL systems are being used as the all-purpose equivalents of the venerable ODS. In some cases, too, they’re being used to extend the data warehouse-driven status quo ante; and in still others, they’re supplanting this status quo. This last has a lot to do with their ever-improving ANSI SQL stories: From ongoing efforts to improve Hive (an interactive SQL interpreter for Hadoop that compiles SQL-like queries into Tez, MapReduce, and—inevitably—Spark jobs), to the maturation of Impala (an interactive SQL-like interpreter for Hadoop) to the emergence of next-gen SQL query interfaces such as Presto and Spark SQL, the ANSI-SQL-on-NoSQL story is improving.

What’s Ahead in Data Integration 

Tools, structures, and techniques in the New DI toolkit are still coalescing. The data lake, for example, is an intriguing and potentially valuable proposition, but its mainstream success will depend on the continued (rapid) maturation of NoSQL data management tooling—particularly metadata management and data lineage-tracking tools—to say nothing of the development of a substantive automation tool set for Hadoop, which is lacking today. At present, the process of building and managing a data lake—or, for that matter, of doing too many things with Hadoop—is a disproportionately command-line and script-driven affair. As far as Java programmers or sysadmins are concerned, this might be ideal; from the perspective of veteran data management practitioners, however, it’s unacceptable.

Data virtualization (DV) technology—née data federation—for its part, is also poised to play a critical role in knitting together all of the sources in the big data universe. (DV technology has quietly become more common too: Many BI tools now bundle a data virtualization abstraction layer, as do some RDBMS platforms.)

As for cloud, the boundaries between hosted and on-premises applications are, in key ways, becoming blurred: Hadoop, Cassandra, MongoDB, CouchDB, and other (commercial) NoSQL variants can be spun up in on-premises or hosted cloud contexts; most major RDBMS vendors likewise market cloud versions of their database platforms. (Some have also signaled plans to develop hybrid on-premises and cloud offerings that will be able to integrate one context with or to another.) Lastly, traditional DI vendors market cloud-focused versions of their DI software, and some even position these cloud services as vital to their hybrid on- and off-premises DI strategies. And, specialty vendors market integration-platform-as-a-service (iPaaS) offerings that specialize in integrating cloud and on-premises environments.

This isn’t to say that DI in the context of a cloud is in any sense a done deal; instead, it’s to situate cloud, too, in the context of a diversity of still-coalescing New DI tools, structures, and techniques.

One final note: If any single technology can be said to contextually knit together the lay of the New DI landscape, it’s a semantic one: For example, the graph analytics that establish connections between and among different sources and types of data, or the semantic layers—codified explicitly via data virtualization or derived via a combination of natural language processing (NLP) and machine learning technologies—that synthesize and expose information for analysis. In the Old DI, the data warehouse was enshrined as the single version of truth in an enterprise. Its “truth” was inescapably deterministic—sales of this product come to this amount in this region for this channel—and, for this reason, correspondingly narrow. The world that semantics has the potential to reveal to us is that much bigger; its “truth” is at once richer and more probabilistic. And this is as it should be. 

<< back Page 2 of 2


Newsletters

Subscribe to Big Data Quarterly E-Edition