Meaningless Tables Mean the Data Is Not Understood

Without a defined primary key, a table has no semantic under­pinning; it is meaningless. Most things are unique in that they can be identified. Transactions are unique because even when every­thing else might duplicate something else, there is a unique timestamp of when the transaction event occurred. A sensor reading has a place and a time. True non-uniqueness is rare, often occurring only because one is receiving partial information.

Sometimes, a data structure may have a unique key, but the receiver lacks the knowledge to identify that uniqueness. Data profiling can help analyze the structure’s contents to determine uniqueness. With­out a primary key, a table must be used cautiously. One cannot update individual rows; therefore, changes to such tables’ contents must be less discrete—truncate-and-reload or append. Joins into other tables can create Cartesian products, where a many-ness of the values on one side or the other of the joined tables can multiply results unexpectedly. Such non-keyed tables are best used in complete isolation. Even so, there are occasions when database designers are lobbied to not have a primary key for a table. Many DBMSs support the creation of tables with one or more completely duplicated rows. There are organizations that have many table structures without pri­mary keys. And, sadly, it is not because the structures lack uniqueness of data content; it is because the owners of the data simply do not see a need for even minimal governance. The old saying goes that “knowl­edge is power”; such environments that lack actual governance prefer to either squander such power or want that power to remain in the head of the one engineer who knows.

Generated Keys

Generated keys added to a structure provide the ability to audit and track a row once loaded. These keys are not exactly surrogate keys, as none are stepping in for another value. Data may be “anon­ymous” in the wild, but once data is ingested into an analytics or reporting environment such as a data lake, data warehouse, or something else, data is now moving into a controlled environment where things should be known and labeled. It would not be unreasonable to demand that generated keys be established simply for auditability and no other reason.

Not doing at least this minimal, forced, estab­lishment of a primary key for the data, albeit generated, is in a way saying the data really has no value. And if the data has no value, why is it being ingested in the first place? If one must allow for a table without a primary key, a stag­ing or landing area should be the only place such meaningless tables are allowed. And even in those staging or landing areas, the lack of any primary key only shows that, for these data structures, the data content is still in a process of being learned and is not yet well understood; this is an attitude that fits in with most current big data environments, where the goal is to load and/or ingest data even when it is not understood at all.

Meaningless Tables

Don’t allow meaningless tables to flow into other environments beyond the staging level. The meaning and interrelationships of and between data are important. If you are the designer of a database and being lobbied to allow the creation of a table without a primary key, make sure you understand how the table is to be used, and that peo­ple will not be writing queries against the structure that will poten­tially be multiplying data unexpectedly. Make sure auditing needs are well understood and generated keys are established when necessary. Make sure that the reason such meaningless tables were created is documented so that, as circumstances change, alterations can be made appropriately.