Over the last half decade, we’ve watched SQL purists butt heads with NoSQL upstarts, Hadoop triumphalists clash with Hadump pessimists, database geeks war with application developers, and so on. In the midst of all this warring, we’ve tried to fit—and, in many cases, to cram—the new into the old, the old into the new, with the result that at one time or another, we’ve asked the impossible of all of the components in our ever-expanding technology portfolios.
It looks as if the data integration (DI) wars might finally be petering out, however. This isn’t to say that rival gangs of SQL and NoSQL champions have put aside their differences, or that the battles they’ve been waging—battles as much ideological as technological—have been resolved, decisively, in favor of any one faction or interest. The warring is diminishing because—at long last—we’ve achieved a hard-won understanding of the lay of the data integration landscape, as well as of the strengths and weaknesses of the technologies that we can and should bring to bear to manage and integrate data.
Meet the New Boss, (Not at All the) Same as the Old Boss
Data integration is never done for its own sake but is always adjunct to some larger purpose.
In a basic sense, we integrate data in order to prepare it for use by applications and services, be they traditional BI front-end tools, self-service visual discovery apps, machine learning (ML) algorithms, analytic sandboxes, or the teeming ecosystem of non-BI applications and services. In another, larger sense, however, we integrate data in order to contextualize information: to establish facts in a certain context and to enrich—or to radically enlarge—that context by incorporating data from connected devices and sensors, social media, open or subscription datasets, and so on. How we integrate data is determined by the characteristics of the sources that first produced it as well as by the applications for which this data, once integrated, gets used. For integrating nonrelational data, the technologies, methods, and concepts we’ve traditionally used are, with critical qualifications, still sufficient; for integrating nonrelational data, we’ve learned to bring new tools, techniques, and structures to bear.
The upshot is that we now have a general sense of just how we’re going to reconcile the old with the new.
We can now distinguish between the Old DI, with its data warehouse-centricity and its much narrower set of requirements, and the New DI, which effectively subsumes it. The New DI doesn’t invalidate its predecessor, in the same way modern physics doesn’t invalidate classical physics; instead, it strictly delimits the range of uses cases with respect to which the Old DI applies.
For more articles on the state of big data, download the third edition of The Big Data Sourcebook, your guide to the enterprise and technology issues IT professionals are being asked to cope with in 2016 as business or organizational leadership increasingly defines strategies that leverage the "big data" phenomenon.
The Old DI was designed and optimized to feed data into the data warehouse. This movement comprised the “E” and “L” phases of the extract, transform, and load (ETL) process that was its linchpin. Because the Old DI was focused on and by the data warehouse, it emphasized the production of clean and consistent data—if not of 100% clean, 100% consistent data, then of something close to that. It wasn’t uncommon for data to go through multiple ETL passes until—cleansed and consistent—it was loaded into the data warehouse. Lastly, the Old DI was optimized for a mostly relational monoculture and predicated on a batch ingest and processing model. It primarily acquired data from OLTP databases, CSV or flat files, spreadsheets, and the like. When it was tasked with integrating nonrelational data, it often did so via out-of-band mechanisms such as script-driven FTP transfers, script-driven transformations, and so on.
The New DI is much bigger. Its overriding priority is to minimize data movement, chiefly because it just isn’t economical or practically feasible to move data around at big data-scale.
With respect to integrating data for the data warehouse, the New DI recognizes that if it’s possible to perform transformations in a source system, or to push transformations down into a target system, as with a technique such as ELT, it makes sense to do so, especially when the systems themselves can easily perform this processing. In contradistinction to its predecessor, the New DI understands that not all applications or use cases require data that is completely cleansed and consistent. Finally, and most importantly, the New DI isn’t focused on the data warehouse; the warehouse is just one of several data consumers in its universe. Instead of being at the center, the warehouse has been shunted to the periphery, where it coexists with a slew of other information consumers—including streaming analytics and operational intelligence systems; relational and nonrelational analytic sandboxes; machine-learning algorithms; visual self-service discovery practices; and the data lake-like repositories of “raw” data (i.e., data that hasn’t been derived, wrangled, or engineered) that can function as platforms for hosting data-
integration and data-analytic workloads.
The New DI: Explained
So what does the New DI look like? From a data warehousing perspective, it looks a lot like the Old DI: Now as ever, the warehouse remains the destination for cleansed and consistent data.
But because the warehouse is no longer the organizing focal point of data integration, the monolithic batch-driven ETL model that was so closely identified with it has been augmented and supplanted by an assortment of tools, structures, and techniques designed to minimize data movement.
In the New DI, the data flows that feed the data warehouse are still critical and, with certain exceptions (for example, the ability to prepare data in upstream sources such as Hadoop; the use of ELT as an alternative to ETL) remain unchanged; they’re just no longer the orienting focus of data integration. They are, instead, accompanying processes.
Image courtesy of Shutterstock