History Changes, Whether We Want It To or Not

Bookmark and Share

In the dim, dark past of data warehousing, there was a time when the argument was put forward that “history does not change.” It was posited that once a piece of data was received by the data warehouse, it was sacrosanct and nonvolatile. A fact record, once processed, was to remain unchanged forever.  Dimensions, due to their descriptive nature, could be changed following the prescribed Type 1, 2, or 3 update strategies, but that was all. It was the expectation that due to their very nature, fact tables would become huge and in being huge would give poor update performance; performance so poor that updates would be virtually impossible to enact. 

Or maybe we were just lying to ourselves all along. Holding the line on such an approach was a battle that was easily lost once such phrases as “needs of the business” were uttered. Those business needs included incomplete data that was more complete at a later date, or estimates that were recalculated more accurately at a later time, or one of many other possible data integrity sins. Even from the very beginning there have been processes created to allow history to change. Sometimes changing the history was easy; the fact table was truncated and replaced. Occasionally, fact table row volumes were so low that folks actually did create processes whereby row-by-row deletes and inserts were performed. Other times it was as hard as expected to update that history, and weekend processes were created to re-roll enormous cubes with updated content. But easy or hard, and for good or for ill, it was and still is done. Maybe the tremendous number of fact history update violations can be attributed to a lack of vision on the part of data warehouse designers and architects. 

Too often both users and developers think in an overly concrete fashion and can only consider a table in terms of a normalized entity. These users and developers feel the only way a count of occurrences may be done is by counting the number of rows. If many of these fact tables were actually established so that only inserts were allowed, then perhaps the fact tables could be approached like a balance sheet with multiple entries for an occurrence. Rows could be inserted containing a plus here and a minus there. Every query against such a table, even a count of occurrences, would be composed as a summation, so adjustment rows would impact totals but not occurrence counts. Maybe then, under such an accounting-biased approach, less history would change. Totals would alter, but the specific detailed changes also could be identified. However, the processing necessary to recognize and format precise balance sheet changes, in support of “Inserts-only” in this fashion, becomes more costly than simply replacing the data — so changing history becomes the more appealing option.  

Occasionally, users require the ability to view historic values in addition to the current values, causing even more confusion, plus continued resistance to a balance sheet approach. Instead, thoughts go to version numbers. And with this perspective, many times, those version numbers are placed on a fact row, forcing users interested in a current value to perform recursive joins; a first join to obtain the latest version found and a second join to get the appropriate current values. If users and users’ query complexity is considered, a way should be found to place version numbers by grouping within a dimension table, allowing any recursion to be performed against the dimension. But it might be nicer still to have a flag on the dimension making the latest version facts more clear. The harder work should rightfully be on the shoulders of the development staff; leaving heavy lifting for the end users simply harms overall corporate performance and responsiveness.