Designing an ETL Design Pattern

Mar 1, 2021

By Aaron Segesman, Solution Architect, Matillion

Design patterns are used throughout the computer programming world for numerous reasons, but most resonantly, because they are an informed technique that lends itself to increased innovation and quality, simultaneously with the development of software. This is all pretty abstract, so let’s use an analogy: automobiles.

Cars all have steering wheels, brake and gas pedals, doors, and engines. These characteristics are what combine to create a pattern. Because of this pattern every auto manufacturer can invest more substantially into the realm of innovation—what makes their car different or better. Similar to automobiles, this is a goal of software patterns, to embrace innovation and reap the benefits of reuse.

Get Started with a Strategy

To implement this pattern we need to establish the paradigm that we will be solving for. The most prevalent technique to execute ETL processes is by doing so in batches. Batch processing is not a new concept, and all the major players in the data space have mechanisms that implement batch processing.

A phased strategy to ETL processing is the preferred approach, breaking up processing into Extract, Transform and Publish. To save time (and give users time to change their minds), it’s a good idea to establish the rules/requirements for how enterprises will publish their data. This is because it’s the point of integration, where someone or something will be consuming the data. This will likely need some negotiating, so best to get started with it up front.

Phase One: Extract

We stage our source data as-is, meaning that regardless of the source (database, files, Application Interface) we write the data to our warehouse EXACTLY how it is presented to us. Enterprises need to have the data local to the processing and in its raw state. There are numerous reasons for this, such as data availability and the impacts on the source system and its users. Staging the data as-is isolates the control over the data, prevents data loss and is imperative for debugging.

Companies may be tempted to apply some quick “in-transit” transformations—don’t do it. Following the pattern will ultimately save time during initial development, but also later when troubleshooting. The more that logic is applied, the more that can go wrong, so companies should jot down observations and tackle them in the next phase.

Phase Two: Transform

The transformation phase will have numerous steps, but the first should be to cleanse the newly staged data. Triaging the data up front ensures that our subsequent steps will be processed in the most efficient way possible and give companies an opportunity to measure the quality of their source data set up front. This will save time and bring familiarity with the data at the beginning.

There are a few different approaches to cleansing, and they are often informed by the rules of the publishing step—do you need to quarantine data? Do you need to halt processing with a percentage of bad data? Do you have a requirement to publish the rejected data? One way to accomplish this is by using a “flag” that you set when the data has been cleansed. You might also have a “flag reason” attribute to help with qualifying the records that have been flagged as “bad”.

A company may also have an all-or-nothing requirement. In this case, an “insert into ... select from” will allow them to convert data types and apply clean-up rules in bulk. This is a popular technique because it keeps all data cleansing rules in one place, making it easier to support and test enhancements later.

Transformation Steps 2 through N

After the data is squeaky clean, it’s time to apply the necessary transformations. Transformations come in many forms, from the application of business rules to warehouse integration. A step-by-step approach can help measure progress as well as keep it easy to understand. Be thoughtful and logically group the varying transformations together (references, naming conventions, grain change) in a way that subsequent steps can build upon.

There is a balancing act between processing time and complexity. Combine transforms by condensing logic, but take care to not make the process prohibitively complex. Benchmark and measure how each step performs, then evaluate the cost-benefit of combining steps. Adding comments often will help diffuse complexity, but will likely be ignored if they don’t contain the “why”. We can see there is an outer join, but why is it necessary here?

Beyond debugging, an additional benefit to keeping the steps simple is they become reusable. We all are guilty of a little copy and paste, and fortunately for us code plagiarism isn’t a slight, but rather a compliment. Keeping the steps simple not only ensures that our processes are supportable, but also gives us a leg up on our next ridiculous timeline.

Phase Three: Publish

We finally get to the final stretch, where our transformations are complete and we need to unleash the fruits of our labors for consumers. Similar to the “stage as-is” rule, we also have publish as-is rule. So make sure that while publishing, all you are doing is publishing, not transforming nor applying a last minute data fix. Just publish.

Important as it is to be cognizant of our impact to source systems, we also need to be aware of how our data is consumed. Below are techniques that can be employed to publish data in a sensible way for companies and their target audience.

Truncate and Load

This methodology is the most invasive, destructive and generally the fastest. As the name suggests, remove all previously-loaded data and load in a new, prepared data. There are a number of variations to this depending on what business requirements need to be considered, but this technique is most suitable for fact and aggregate table outputs.

Data Surgery

This doesn’t require a scalpel to be effective, but is best applied when modifying an existing set of data. This makes the necessary changes (precise updates, inserts and deletes) to a target and is good for dimensional tables and rolling aggregates. A good measure to base this technique is the percentage of rows modified. If the percentage modified is less than 40%, try this technique.

Append

Similar to truncate and load, we unapologetically insert our transformed data to the target. It is a best practice to apply a delete before you load, giving you the ability to reload the data if necessary. This delete would only have something to delete in the case of a reload. This is also well-suited for fact and aggregate tables, but also is an easy and convenient way to perform “system snapshots.”

Blue/Green

Employing several possible publishing techniques, this methodology assumes that there are multiple “production” systems available all the time, but only one is active. Our processes to Extract, Transform and Publish will target the passive system until the moment the “switch” is flipped and our newly published data becomes “active”.

Build for the Future

Data has become a critical part of daily life and accuracy and consistency matter more than ever. Establishing a good design pattern as a foundation will help ensure quality, reduce the learning curve, and create space for innovation. Each phase of the pattern has a focus on quality and stability, based on the nature of the phase itself, while leaving room for later enhancements and innovations.