Decomposing Data Integration: Best Practices in Accelerating Pre-ETL Processes

Bookmark and Share

Every data integration initiative-whether it supports better decision making, a merger/acquisition, regulatory compliance, or other business need-requires a set of processes to be completed before the data can be made available to business users. Though this set of processes is fairly well understood by industry practitioners, there are still many areas left unaddressed and, therefore, the process is time-consuming, inefficient, unpredictable, and costly.

This initial portion of the data integration process can be decomposed into an actionable workflow, which may be automated and repeatable across projects. Specifically, the pre-ETL (extract, transform, and load) processes of discovery and mapping can be divided into a series of discrete steps which help drive the final ETL delivery and solution.

The advent and adoption of discovery and mapping tools will do much to ease the burden of pre-ETL steps in data integration projects. However, the actionable workflow which streamlines and executes this vision must be implemented via both best practices and through automation within the tools.

Step 1: Take Inventory of Your Data

Before any data integration project can be undertaken, an organization must first admit that it doesn't know what it doesn't know. While "perfect knowledge" of one's data is unrealistic, an effective discovery process is essential in order to derive an understandable baseline. Such a baseline provides the data knowledge needed to build enterprise applications.

Data discovery has traditionally been a manual and time-consuming exercise, consisting of analysts writing a series of SQL queries against each data source, and performing a copy/paste of the results into a number of spreadsheets. Aside from the inefficiency of this method, queries will only return what is asked for. In other words, if you're not aware that a data issue exists, a query won't be issued, and thus hidden data anomalies won't necessarily be brought to light.

A better practice is to leverage software designed specifically for automating data discovery, allowing it to be completed more thoroughly and in a fraction of the time. This also frees up analysts to focus on greater value-added functions.

Step 2: Understand Your Source Data

When evaluating potential data sources for integration, one must first completely understand how the source data is related to itself, as well as to other potential sources. For example, how does a field named "customer_num" relate to another field named "customer_id" within a given source, and how do both of these relate to "cust_id" in a second source? This part of the mapping process leads directly to the definition of table joins and data lookups, which are important in extracting the correct data to be later consumed by the target. Using automated tools with an emphasis on data matching frees time for analysts to spend evaluating results, instead of spending time manually composing SQL queries.

Step 3: Map Source Data to the Target Model

Once the source-to-source relationships are understood and relationship mappings defined, the source-to-target mappings need to be discovered and analyzed. It is during this phase that business rule discovery and definition takes place, and where target columns are mapped back to the source data columns via transformations and the implementation of business rules. The data analyst should be able to draw from a body of automated discovery and relationship information in a unified interface to accomplish this task. The best practice is usually to map from the target back to the source system(s), as not every field in each source system will need to be mapped.

Step 4: Leverage Previous Work

In many vertical industries, a body of mapping information will remain consistent in a manner which is dependent by vendor. When doing multiple data conversions of a vendor system to a new target, previous experiences and learnings should be maximized. Some tools offer a "reference hub," or some other similar reuse of mapping meta data. This automation will save data analysts and ETL developers many hundreds of hours of otherwise redundant mapping and discovery.

Pre-ETL Automation

Step 5: Leverage What You've Learned

Guided analysis. Actionable workflow. These are areas in which best-of-breed vendors are attempting to incorporate best practices and a codified methodology directly into their products, and this offers much promise to data integration projects, particularly in the areas of automated mapping and discovery. Automation should include heuristics which use both meta data and actual data values as inputs.

Critical to the process is how the results are then shared with downstream ETL tools, so that a maximum reuse of meta data is achieved. The concept of bi-directional meta data is becoming increasingly important. This allows a transformation which is modified in a downstream ETL tool to also be reflected seamlessly as a change to a business rule in the discovery and mapping toolset.

Sharing all of the generated mapping meta data, business rules, and annotations made by analysts during the process should be made available via reporting directly in the chosen toolset, or exportable into a common format such as Microsoft Excel or Word. This reduces the amount of time needed to produce design documentation, and ensures higher quality documentation, since it is based on actual mapping data.


With the explosion of data across the enterprise and in the business world, data integration projects will continue to play an increasingly important role. Combining best practices with "best-in-breed" tools will help organizations to deliver successful solutions in a more timely and cost-effective fashion. Automating "pre-ETL" processes helps to eliminate the costly, error-prone, and resource-intensive efforts of discovery and mapping of unknown data sources. Organizations should not have to rely on an army of analysts armed with only SQL queries, Excel spreadsheets, and disjointed analysis and design tools to perform these important tasks. Picking the right vendors and integrating them into your technology stack and project methodology is the key to success.