Don’t Overuse Generated Primary Keys in Logical Data Models

One of the principles within relational theory is that each entity’s row or tuple be uniquely identifiable.  This means the defined structure includes some combination of attributes whose populated values serve to identify an individual row within the table/relation.  This, or these, attribute(s) are the candidate key(s) for the structure.  The candidate key is also known as the primary key, or if a structure has multiple candidate keys, then one of them is designated as the primary key.  When building up a logical design, primary keys should be identified by the actual data points in play.  Generated or surrogate keys should be held in abeyance until dealing with the physical design and its implementation.  There are some exceptions to this, as in the case of business data that is expected to be generated.  For example, when building an account management system for a new company, the expectation should be that the business will want to generate an account number.  Alternatively, there may be circumstances where all attributes within an entity truly are necessary to serve as the primary key - but such circumstances should be rare (and hopefully the entity involved in such extreme circumstances has very few attributes).  If one finds this not to be true and is frequently running into entities that have many columns and have a need to use all columns to comprise a unique identifier, then one might best take a step back and review the logical data modeling process.

Under far too many circumstances, those responsible for modeling data are ignoring the above perspective on primary keys and are simply defining surrogate keys for everything in their data model.  One might be kind when confronted with a data model where everything has a generated primary key and assume such models are physical-only data models.  And further assume that were someone to make an effort to establish a logical version of these models, then more semantics-founded primary keys would come to light.  However, the obvious flaw with that line of reasoning is that without establishing the logical model first, who can say that any of the tables defined within the physical-only model are at a proper and defendable level of detail?  Without the proper logical analysis of the functional dependencies across the data, some tables may actually be, semantically speaking, odd collections and very unnatural creatures.  Ubiquitous generated keys ultimately mean that the semantics of the data objects involved are not expressed within the data model at all.  Every object is a fiat object; entries exist simply because some code inserted it, for good or for bad, correctly or incorrectly.  There is no truth being expressed at all. 

It seems that there is often an unnatural fear of exploring truly logical data models.  Too much fuss and bluster spews over implementation issues that are mislabeled as relational theory issues.  This confusion seems to drive designers into physical choices and decisions long before the physical model should be considered.  Logical models are based on functional dependencies between the data elements, and not number of reads.  Once the discussion considers speed of response, or the number of reads, or the count of records retrieved in a single read, or the amount of storage used, and the like, then the issues under discussion are related to implementation and NOT to theoretical issues at all.  Generated keys can be a useful tool in creating logical database designs, but they can be easily over-used.  The designer must consider whether the business truly wishes to drive a function that creates new key values and then use and expose those values to users.  It is the business-related aspects of a generated key’s usefulness that establishes the key as an item within the logical realm.