Data integration aims to provide a unified and consistent view of all enterprise wide data. The data itself may be heterogeneous and reside in difference resources (XML files, legacy systems, relational databases, etc.). The desired extent of data integration will highly depend upon the required Quality of Service characteristics. Data will never be of perfect quality, so a certain level of inaccurate, incomplete or inconsistent data may have to be tolerated for operational BI to succeed.
This article is based on the book, Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, by by Wilfried Lemahieu, Seppe vanden Broucke, and Bart Baesens, www.pdbmbook.com.
The goal is to logically (and sometimes also physically) unify different data sources or data silos to provide a single unified view which is as correct, complete, and consistent as possible. Different methods exist to provide the unified view as we discuss in what follows.
Figure 1. Data silos and sources
Data Consolidation: Extract, Transform, Load (ETL)
Data consolidation captures data from multiple, heterogeneous source systems and integrates it into a single persistent store. This is typically accomplished using extract, transform and load (ETL) routines (see Figure 2):
- Extract data from heterogeneous data sources
- Transform the data to satisfy business needs
- Load the transformed data into a target system (e.g., a data warehouse)
This approach with an ETL process feeding a store with consolidated data is very suitable to deal with massive amounts of data and prepare them for analysis. There is room for extensive transformation, involving data restructuring, reconciliation, cleansing, aggregation and enrichment steps. Therefore, this pattern has a positive impact on many data quality dimensions such as completeness, consistency, and interpretability. Another important advantage is that data consolidation caters for not only present information, but also historical data, since a changed business state does not result in updates to the data, but in additions of new data. On the downside, the ETL process typically induces a certain measure of latency, so the timeliness dimension may suffer, with the data being slightly out of date. Consolidation also requires a physical target, so additional storage capacity is consumed. Besides the traditional set-up with ETL and a data warehouse, data lakes can also be an implementation of the consolidation pattern.
Figure 2. An extract, transform, load engine (ETL) as a data consolidation solution
Data Federation: Enterprise Information Integration (EII)
Data federation typically follows a pull approach where data is pulled from the underlying source systems on an on-demand basis. Enterprise Information Integration (EII) is an example of a data federation technology (see Figure 3). EII can be implemented by realizing a virtual business view on the dispersed underlying data sources. The view serves as a universal data access layer. No moving or replication of data is needed since all data stays in the source systems. Hence, a federation strategy enables real-time access to current data, which was not the case for a data consolidation strategy.
Figure 3: Enterprise Information Integration (EII) as a data federation solution.
EII can be beneficial as it leaves data in place that otherwise might dramatically increase overall storage requirements if a consolidated approach would be followed. One important disadvantage to remember is the overall worse performance of EII. Since queries performed on the business view must be translated to underlying data sources, a performance hit is unavoidable.