Newsletters




Updating Fact Rows In a Star Schema Set of Tables is Not a Good Practice


Updating fact rows inside a star schema set of tables is never a best practice. Even so, some organizations travel down this path.

Consultants occasionally will go so far as to embrace the idea of talking about Type 1 and Type 2 slowly changing facts. With Type 1, a fact row is simply updated in place. And with Type 2, the original fact row is end-dated and flagged as logically deleted, then a replacement row is inserted. In doing these Type 1 & Type 2 activities all that has really been accomplished is to translate the dimensional fact table in a more mundane normalized structure. One no longer has a dimensional fact at all.  This same shift into a normalized design occurs very often with dimensions. Reference tables or even collections of normalized reference tables are simply renamed “dimensions” and connected to the fact. Implementers often will try to obfuscate their sins even more by calling the normalized references a “snowflake.” This approach leaves the data model as a surrealistic painting expressing a fact chunk surrounded by several normalized blobs of “dimensional” tables.

Fact tables should be established so that incoming data gets added by insertion, ONLY, rather than by any other method. For those who need complexity, facts based on snapshots do exist, but even those snapshots are inserts only. If one truly has a circumstance where it is believed that history changes, a little extra work within the ETL can address that situation. Simply, the answer is to use a more “accounting based” perspective. The new row is refashioned and instead of that incoming data being a replacement row, that incoming data is reworked into a delta row, e.g., if the value of “10” needs to become “9,” leave the “10” row alone, and add in a “-1” row. If for some reason such manipulations are impractical, simply inserting a nullification of the initial row, and then inserting the replacement row can work too - original "10," insert a negation “-10” row, then insert the replacement ‘9’ row. Summation queries will always arrive at the correct answer, and defining a query retrieving values as they were at a specific point-in-time becomes relatively easy.  As a matter of perspective, if a circumstance is in place where a fact table is truncated and reloaded every cycle, one may consider that arrangement as a “Type 1” fact. But fortunately, with truncate and reload there is no actual update processing.

In the data warehousing world, the best practice is to only insert data into every fact table. Not updating fact rows is understood based on the idea that updates are slower than inserts, and that fact tables are generally so large that updating individual rows is an even slower process when looking up the row takes longer.  And this slower process is further compounded when users do not want history to change, or minimally they want both a perspective on what data looks like today as well as what it looked like yesterday.

Certainly, circumstances may exist where the data is so small (or available processing so vast) that updating fact rows is not too slow, and users may not care about history changing, today. Therefore, updating fact tables is a quick, easy, and obvious solution. But just because such shortcuts can “work” does not mean these techniques are a best practice. While your circumstances may convince you that updates must be done, don’t forget that updating is sub-optimal. An architect of any flavor involved in data warehousing should be automatically pre-disposed to only insert fact rows. Updating fact rows is not the usual, expected, or desired tactic. Doing only inserts will always be the more efficient process; and even if efficiency should not be an issue, why should one be inefficient?


Sponsors