Data Integration and Data Integration Patterns: An Overview

Sep 4, 2018

<< back Page 2 of 2

Data Propagation

The propagation pattern corresponds to the synchronous or asynchronous propagation of updates or, more generally, events in a source system to a target system. Most implementations provide some measure of guaranteed delivery of the update or event notification to the target system. In fact, the data propagation pattern can be applied at two levels in a system architecture. It can be applied in the interaction between two applications or in the synchronization between two data stores. In an application interaction context, we speak of Enterprise Application Integration (EAI). In a data store context, we speak of Enterprise Data Replication (EDR).

The idea of EAI is that an event in the source application requires some processing within the target application. For example, if an order is received in an order handling application, this may trigger the creation of an invoice in the invoicing application. The event in the source system (an order is received) is notified to the target system to trigger some processing there (the creation of an invoice). Besides the triggering of some processing within the target application, such exchange nearly always involves small amounts of data being propagated from the source to the target application as well.

In case of Enterprise Data Replication (EDR), the events in the source system explicitly pertain to update events in the data store. Replication means copying the updates in the source system in (near) real time to a target data store which serves as an exact replica.

Changed Data Capture (CDC), Near Real Time ETL and Event Processing

A technology complementary to ETL, which adds the event paradigm to ETL, is Changed Data Capture (CDC). CDC technology can detect update events in the source data store, and trigger the ETL process based on these updates. In this way, a ‘push’ model to ETL is supported: the ETL process is triggered by any significant change in the underlying data store(s). This is in contrast with traditional ETL, where data extraction occurs on scheduled time intervals or in periods with low system workload, but without considering actual changes in the source data.

Data Virtualization

Data virtualization is a more recent approach to data integration that also aims to offer a unified data view for applications to retrieve and manipulate data without necessarily knowing where the data is stored physically or how it is structured and formatted at the sources. Data virtualization builds upon the basic data integration patterns discussed previously, but also isolates applications and users from the actual (combinations of) integration patterns used. The technologies underlying data virtualization often avoid data consolidation techniques such as ETL: the source data remains in place, and real-time access is provided to the source systems of the data. This approach hence seems familiar to data federation, but an important difference of data virtualization is that, contrary to a federated database as offered by basic EII, virtualization does not impose a single data model on top of the heterogeneous data sources. Virtual views on the data can be defined at will and can be mapped top-down onto relational and non-relational data sources.

In many real-life contexts, a data integration exercise is an ongoing initiative within an organization, and will often combine many integration strategies and approaches (see Figure 4).

Figure 4: Data integration practices often combine a variety of patterns and approaches.

Data as a Service and Data in the Cloud

The pattern of virtualization is often linked to the concept of Data as a Service (DaaS), in which data services are offered as part of the overall Service Oriented Architecture (SOA). The data services can be invoked by different applications and business processes, which are isolated from how the data services are realized regarding location, data storage, and data integration technology. Many commercial data integration suites adhere to the SOA principles and support the creation of data services. Data services can be read-only or updatable, in which case they must be able to map the updates as issued by consumers of the data service to the underlying data stores in an unambiguous way. Most data integration suites also provide easy features for data service composition, in which data from different services can be combined and aggregated into a new, composite, service. Data as a Service is in its turn often related to cloud computing. The ‘as a service’ and ‘in the cloud’ concepts are very related, with the former putting more emphasis on the consumer perspective (invocation of the service) and the latter mainly emphasizing the provisioning and infrastructure aspect.

This article is based on the book, Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, by by Wilfried Lemahieu, Seppe vanden Broucke, and Bart Baesens, www.pdbmbook.com.