Retaining Historical Change in Data Structures is Often Necessary but Rarely Simple

Tracking change activity over time requires complicating what might otherwise be simple data structures.  One approach that many like to employ involves simply adding a start or effective date onto a table design.  The added date is incorporated as part of the table’s primary key; then when content changes, a new row is inserted.  The existence of the “new” row for the “old” primary key value (sans effective date) obsoletes the “previous” row.  Where things become a bit squigglier is in handling the consequences of such a key change across related table structures. 

One school of thought is to simply leave things alone and make no changes to dependent tables.  This means that there cannot be a formal foreign key defined within the data model between the changed independent table and the unchanged dependent table, because a foreign key must include the full set of columns comprising a composite key.  Developers find this approach acceptable because, functionally, it is still possible to write a query joining the two tables using only the partial primary key and then adding in an appropriate restriction of the start date column to obtain a result. 

Adding to the complexities, if only the start date is added, then identifying the most “current” row involves a recursive join into the same table to select only the row containing the maximum start date value.  Also, enforcement of data integrity for these “almost foreign keys” rests entirely on the shoulders of the application logic instead of within the DBMS engine. 

While queries may be written, should one know the “secret handshake,” what is lost is documenting these relationships.  As mentioned above, because of the lack of a full replication of the columns comprising the referenced primary keys, no relationships may be presented within the data modeling tool.  Under this approach, knowing what tables may be joined and how to join them relies on information being shared outside of the formal methods available for documenting exactly these pieces of information.  Often this means relying on an oral tradition of passing information from one generation of developers to the next.

Adding in a start date, alone, and not embracing such a change across dependent tables is a short cut.  This technique expedites initial development.  People always like a short cut.  Short cuts are in keeping with a faster-cheaper mentality.  And while “faster and cheaper” seems good for the initial development, in the long term the “faster and cheaper” aspects are often questionable.  As demonstrated by every user’s querying becoming complicated by date range tests and recursive joins.

The alternative is a slower path that complicates development to simplify later queries.  Foreign keys are redefined to include the added start dates.  When new rows are added into a parent table, the dependent table may also require new rows to be added, continuing on the effective dependent rows.  But once the work is done, the database user will be able to write and execute simpler and more direct queries.

Historical content within data structures increases their complexity.  It is a matter of design that determines where that complexity comes to rest and who must bear the brunt of the existence of the complexity.  Making complexity “easy” for the initial developers and “harder” for the later users is a short-sighted approach.  When faced with a choice of having the initial developers deal with complexity or having the later users deal with complexity in every query, it seems self-evident that the better solution is to have the developers cope with that complexity once rather that an infinite number of times in the end users’ hands.  This additional complexity exists, regardless of whether natural or surrogate keys are involved.