Big Data Transforms the Practice of Data Integration

Page 1 of 4 next >>

Bookmark and Share

Big data is transforming both the scope and the practice of data integration. After all, the tools and methods of classic data integration evolved over time to address the requirements of the data warehouse and its orbiting constellation of business intelligence tools. In a sense, then, the single biggest change wrought by big data is a conceptual one: Big data has displaced the warehouse from its position as the focal point for data integration.

The warehouse remains a critical system and will continue to service a critical constituency of users; for this reason, data integration in the context of data warehousing and BI will continue to be important. Nevertheless, we now conceive of the warehouse as just one system among many systems, as one provider in a universe of providers. In this respect, the impact of big data isn’t unlike that of the Copernican Revolution: The universe, after Copernicus, looked a lot bigger. The same can be said about data integration after big data: The size and scope of its projects—to say nothing of the problems or challenges it’s tasked with addressing—look a lot bigger.

This isn’t so much a function of the “big-ness” of big data—of its celebrated volumes, varieties, or velocities—as of the new use cases, scenarios, projects, or possibilities that stem from our ability to collect, process, and—most important—to imaginatively conceive of “big” data management. To say that big data is the sum of its volume, variety, and velocity is a lot like saying that nuclear power is simply and irreducibly a function of fission, decay, and fusion. It’s to ignore the societal and economic factors that—for good or ill—ultimately determine how big data gets used. In other words, if we want to understand how big data has changed data integration, we need to consider the ways in which we’re using—or in which we want to use—big data.

Big Data Integration in Practice

In this respect, no application—no use case—is more challenging than that of advanced analytics. This is an umbrella term for a class of analytics that involves statistical analysis, machine learning, and the use of new techniques such as numerical linear algebra. From a data integration perspective, what’s most challenging about advanced analytics is that it involves the combination of data from an array of multistructured sources. “Multistructured” is a category that includes structured hierarchical databases (such as IMS or ADABAS on the mainframe or—a recent innovation—HBase on Hadoop); semistructured sources, such as graph and network databases, along with human-readable sources, including JSON, XML, and txt documents); and a host of so-called “unstructured” file types—documents, emails, audio and video recordings, etc. (The term “unstructured” is misleading: Syntax is structure; semantics is structure. Understood in this context, most so-called unstructured artifacts—emails, tweets, PDF files, even audio and video files—have structure. Much of the work of the next decade will focus on automating the profiling, preparation, analysis, and—yes—integration of unstructured artifacts.)

For more articles on maximizing big data, download DBTA's Big Data Sourcebook.

If all of this multistructured information is to be analyzed, it needs to be prepared; however, the tools or techniques required to prepare multistructured data for analysis far outstrip the capabilities of the handiest tools (e.g., ETL) in the data integration toolset. For one thing, multistructured information can’t efficiently or, more to the point, cost-effectively, be loaded into a data warehouse or OLTP database. The warehouse, for example, is a schema-mandatory platform; it needs to store and manage information in terms of “facts” or “dimensions.” It is most comfortable speaking SQL, and to the extent that information from nonrelational sources (such as hiearchical databases, sensor events, or machine logs) can be transformed into tabular format, they can be expressed in SQL and ingested by the data warehouse. But what about information from all multistructured sources?

Image courtesy of Shutterstock 

Page 1 of 4 next >>