Data Integration Evolves to Support a Bigger Analytic Vision

<< back Page 2 of 4 next >>

This has everything to do with what might be called a “much bigger” analytic vision. Inspired by the promise of exploiting data mining, predictive analytics, machine learning, or other types of advanced analytics on a massive scale, the focus of DI is shifting from that of a static, deterministic discipline—in which a kind of two-dimensional “world” is represented in a finite number of well-defined dimensions—to a polygonal or probabilistic discipline with a much greater number of dimensions. The static stuff will still matter and will continue to power the great bulk of day-to-day decision making, but this will in turn be enriched, episodically, with different “types” of data. The challenge for DI is to accommodate and promote this enrichment, even as budgets hold steady (or are adjusted only marginally) and resources remain constrained.

Automatic for the People

What does this mean for data integration? For one thing, the day-to-day “work” of traditional DI will, over time, be simplified, if not actually automated. This “work” includes activities such as 1) the exploration, identification, and mapping of sources; 2) the creation and maintenance of metadata and documentation; 3) the automation or acceleration, insofar as feasible, of testing and quality assurance; and, crucially, 4) the deployment of new OLTP systems and data warehouses, as well as of BI and analytic applications or artifacts. These activities can and will be accelerated; in some cases (as with the generation and maintenance of metadata or documentation) they will, for practical, day-to-day purposes, be more or less completely automated.

For more articles on big data technologies and trends, download the Free Big Data Sourcebook: Second Edition

This is in part a function of the maturity of the available tooling. Most DI and RDBMS vendors ship platform-specific automation features (pre-fab source connectivity and transformation wizards; data model design, generation, and conversion tools; SQL, script, and even procedural code generators; scheduling facilities; in some cases even automated dev-testing routines) with their respective tools. Similarly, a passel of smaller, self-styled “data warehouse automation” vendors market platform-independent tools that purport to automate most of the same kinds of activities, and which are also optimized for multiple target platforms. On top of this, data virtualization (DV) and on-premises-to-cloud integration specialists can bring intriguing technologies to bear, too. Most DI vendors offer DV (or data federation) capabilities of some kind; others market DV-only products. None of these tools is in any sense a silver bullet: custom-fitting and design of some kind is still required and—frankly—always will be required. The catch, of course, is that even though such tools can likewise help to accelerate key aspects of the day-to-day work of building, managing, optimizing, maintaining, or upgrading OLTP and BI/decision support systems, they can’t and won’t replace human creativity and ingenuity. The important thing is that they give us the capacity to substantively accelerate much of the heavy-lifting of the work of data integration.

Big Data Integration: Still a Relatively New Frontier

This just isn’t the case in the big data world. As Douglas Adams, author of The Hitchhiker’s Guide to the Galaxy, might put it, traditional data integration tools or services are mature and robust in exactly the way that big data DI tools—aren’t.

At this point, guided and/or self-service features (to say nothing of management-automation amenities) are still mostly missing from the big data offerings. As a result, organizations will need more developers and more technologists to do more hands-on stuff when they’re doing data integration in conjunction with big data platforms.

Industry luminary Richard Winter tackled this issue in a report entitled “The Real Cost of Big Data,” which highlights the cost disparity between using Hadoop as a landing area and/or persistent store for data versus using it as a platform for business intelligence (BI) and decision support workloads. As a platform for data ingestion, persistence, and preparation, the research suggests, Hadoop is orders of magnitude cheaper than a conventional OLTP or DW system. Conversely, the cost of using Hadoop as a primary platform for BI and analytic workloads is orders of magnitude more expensive.

<< back Page 2 of 4 next >>