Big Data Is Transforming the Practice of Data Integration

Feb 21, 2014

By Stephen Swoyer

<< back Page 4 of 4

Cloud will change how we consume and interact with—and, for that matter, what we expect of—applications and services. From a data integration perspective, cloud, like big data, entails its own set of technological, methodological, and conceptual challenges. Traditional data integration evolved in a client-server context; it emphasizes direct connectivity between resources—e.g., a requesting client and a providing server. The conceptual model for cloud, on the other hand, is that of representational state transfer, or REST. In place of client-server’s emphasis on direct, stateful connectivity between resources, REST emphasizes abstract, stateless connectivity. It prescribes the use of new and nontraditional APIs or interfaces. Traditional data integration makes use of tools such as ODBC, JDBC, or SQL to query for and return a subset of source data. REST components, on the other hand, structure and transfer information in the form of files—e.g., HTML, XML, or JSON documents—that are representations of a subset of source data. For this reason, data integration in the context of the cloud entails new constraints, makes use of new tools, and will require the development of new practices and techniques.

That said, it doesn’t mean throwing out existing best practices: If you want to run sales analytics on data in your Salesforce.com cloud, you’ve either got to load it into an existing, on-premises repository or—alternatively—expose it to a cloud analytics provider. In the former case, you’re going to have to extract your data from Salesforce, prepare it, and load it into the analytic repository of your choice, much as you would do with data from any other source. The shift to the cloud isn’t going to mean the complete abandonment of on-premises systems. Both will coexist.

Data Virtualization, or DV, is another technology that should be of interest to data integration practitioners. DV could play a role in knitting together the fabric of the post-big data, post-cloud application-scape. Traditionally, data integration was practiced under fairly controlled conditions: Most systems (or most consumables, in the case of flat files or files uploaded via FTP) were internal to an organization, i.e., accessible via a local area network. In the context of both big data and the cloud, data integration is a far-flung practice. Data virtualization technology gives data architects a means to abstract resources, regardless of architecture, connectivity, or physical location.

Conceptually, DV is REST-esque in that it exposes canonical representations (i.e., so-called business views) of source data. In most cases, in fact, a DV business view is a representation of subsets of data stored in multiple distributed systems. DV can provide a virtual abstraction layer that unifies resources strewn across—and outside of—the information enterprise, from traditional data warehouse systems to Hadoop and other NoSQL platforms to the cloud. DV platforms are polyglot: They speak SQL, ODBC, JDBC, and other data access languages, along with procedural languages such as Java and (of course) REST APIs.

Moreover, DV’s prime directive is to move as little data as possible. As data volumes scale into the petabyte range, data architects must be alert to the practical physics of data movement. It’s difficult if not impossible to move even a subset of a multi-petabyte repository in a timely or cost-effective manner.

<< back Page 4 of 4