The Rationality of Surrogate Keys

The original intention of surrogate keys across multidimensional database designs was to help optimize joins by keeping all keys, used for joining between facts and dimensions, both numeric and single-columned. Often such surrogates were generated as simple sequential numbers, 1, 2, 3…. Early versions of many relational database products had sub-optimal performance when joining data together via character strings—especially large quantities of data. Today, regardless of need, numeric surrogate keys are the standard. Well-formed dimensions have a generated surrogate key as their default primary key. Natural key columns are still included as additional columns with everything else in the structure. The natural key might even be defined to the DBMS as an alternate key. Natural keys on the dimension are intended to be used only for filtering the data, not for any joins. In a further effort to be simple and transparent, many standards have these surrogate keys named such that it includes the name of the dimension table followed by ID, or Key, or even SK (for Surrogate Key), like “customer_dim_sk.“ In this fashion, even people who perform very casual database queries can plainly identify the primary key of each dimension; and when that same named column shows up on a fact table, they know exactly how to make a join between any fact and any dimension.

Fact tables can be more of a challenge. Many fact tables also may have a simple generated surrogate key used as their primary key. However, unlike dimension tables, on facts such an arrangement can cause confusion rather than enhance clarity. The reason for the loss of clarity has to do with the idea of granularity. Granularity is important for understanding the nature of a fact table. If there were no single column generated surrogate key defined as the primary key of a fact, and instead the fact primary key was defined as the combination of all, or even some, of the foreign keys into the dimensions, anyone looking at an entity-relationship diagram could easily understand the level of detail found within the rows of the fact—its grain. For many reasons, folks across many organizations prefer to leave things more open, more flexible, i.e., sloppier. This sloppiness may be done with the best of intentions. It starts with an organization’s desire to have the natural keys from the dimensions appear on the body of the fact table. This desire may be driven by a fear of performance problems with joins into the fact, when, in actuality, the feared performance issue may not even exist.

Once heading down this path, practices often degenerate further. Dimensions may be established not using surrogates but using natural keys instead for both filtering and for joins. If any dimensions involve Type 2 Slowly Changing Dimension logic, queries then become more complicated. Users need to understand more, as potentially each dimension added into a query must now have point-in-time logic restricting joins to the single correct row desired for each fact. When done in a sloppy fashion such as this, queries still work, but the original intentions of simplicity and transparency have been ignored.

If one starts duplicating natural keys from the dimensions onto the facts, one has created a circumstance to confuse end users writing SQL SELECTs, because which version do they include in their queries? How does whatever interface present the differences between the two? Why create chaos? Simplicity is always the best, first choice. If queries only have a single option, the user’s job becomes easier. If performance problems actually arise during development, other options may be explored. It seems highly likely that those feared issues might never appear. Having simple and consistent rules provides a huge payback in fueling productivity.