New Data (Almost) Always Rings Twice

Anything worth doing, is worth doing again and again. Right? When building out and standardizing new subject areas or new sources for one’s data warehouse, hub, or other analytics area, a task often initially overlooked is the logic bringing in the data. Obviously, everybody knows that the new data must be processed. What many ignore is the idea that establishing the process to bring the data in often must be done twice, or more.

The issue at hand causing the extra workload is history. The base processing is defined to ingest the data as it looks and comes in from the source today. As such, this processing is useful for running today and going forward. If one has a circumstance in which starting from today is the only data necessary, then data history issues are of no concern and one is done. Alternately, if there is a business need to have historic data available right from the start, then one is not done, and more details need to be worked through.

The first question to be answered when considering historic data for a new source is, “Is the desired history available?” Sometimes the answer to that question is, “No” and as undesirable as that answer may be, the lack of available history is an expectation that must be managed going forward. When history is available, how historic data is available may be in a variety of fashions.

In an ideal world, bringing in history could be as simple as receiving a data store that is a full copy of the source rather than the daily transactions. Ideally, it is in the same format and it may happen that the logic defined for the daily transactions works fine against the full copy. Less ideal, the format may be different, or the context of a full copy requires logic changes. Therefore, a secondary version of the processing logic is established to process the full copy. Alternatively, one may have a circumstance where the only, or best source of history is a set of daily transactions going back to the point-in-time needed. Logic may need to be slightly altered to allow for the “current datetime” to be an input parameter.

The main challenge will be coordinating and stepping through a cycle-by-cycle run of each data set from the oldest to the most recent. Sometimes the vendor providing the daily files cannot or will not offer any history. That lack of assistance does not, in and of itself mean history is not available. History may be available in a previous or alternate data hub or other structure. Obviously, under such differing circumstances, it is virtually guaranteed that the logic for processing the historic data will be completely new.

Sometimes, one may be lucky and only have to write logic only once. But often enough, new source ingestion processing must be written at least twice: Once for pulling in current data and data moving forward in time, and a second time to bring in designated history that must become available right from the start.

It may be that the current and forward is coming from the “legitimate” source, but the history comes from an older version of the solution being built using a different format, or the older legitimate source has multiple formats that were in place at various points-in-time. Either way, new code must be created to pull in this same data for history.

It is because of these dynamics that the ETL, ELT, or however one designates their processing, is created more than once. As one punk band once commented, “Quantity is quality.” Therefore, the more times one does something the better it is. The extra work is helping to teach us better how to get it done.