Meaningless Tables Mean the Data Is Not Understood

Aug 2, 2021

By Todd Schraml

Without a defined primary key, a table has no semantic underpinning; it is meaningless. Most things are unique in that they can be identified. Transactions are unique because even when everything else might duplicate something else, there is a unique timestamp of when the transaction event occurred. A sensor reading has a place and a time. True non-uniqueness is rare, often occurring only because one is receiving partial information.

Sometimes, a data structure may have a unique key, but the receiver lacks the knowledge to identify that uniqueness. Data profiling can help analyze the structure’s contents to determine uniqueness. Without a primary key, a table must be used cautiously. One cannot update individual rows; therefore, changes to such tables’ contents must be less discrete—truncate-and-reload or append. Joins into other tables can create Cartesian products, where a many-ness of the values on one side or the other of the joined tables can multiply results unexpectedly. Such non-keyed tables are best used in complete isolation. Even so, there are occasions when database designers are lobbied to not have a primary key for a table. Many DBMSs support the creation of tables with one or more completely duplicated rows. There are organizations that have many table structures without primary keys. And, sadly, it is not because the structures lack uniqueness of data content; it is because the owners of the data simply do not see a need for even minimal governance. The old saying goes that “knowledge is power”; such environments that lack actual governance prefer to either squander such power or want that power to remain in the head of the one engineer who knows.

Generated Keys

Generated keys added to a structure provide the ability to audit and track a row once loaded. These keys are not exactly surrogate keys, as none are stepping in for another value. Data may be “anonymous” in the wild, but once data is ingested into an analytics or reporting environment such as a data lake, data warehouse, or something else, data is now moving into a controlled environment where things should be known and labeled. It would not be unreasonable to demand that generated keys be established simply for auditability and no other reason.

Not doing at least this minimal, forced, establishment of a primary key for the data, albeit generated, is in a way saying the data really has no value. And if the data has no value, why is it being ingested in the first place? If one must allow for a table without a primary key, a staging or landing area should be the only place such meaningless tables are allowed. And even in those staging or landing areas, the lack of any primary key only shows that, for these data structures, the data content is still in a process of being learned and is not yet well understood; this is an attitude that fits in with most current big data environments, where the goal is to load and/or ingest data even when it is not understood at all.

Meaningless Tables

Don’t allow meaningless tables to flow into other environments beyond the staging level. The meaning and interrelationships of and between data are important. If you are the designer of a database and being lobbied to allow the creation of a table without a primary key, make sure you understand how the table is to be used, and that people will not be writing queries against the structure that will potentially be multiplying data unexpectedly. Make sure auditing needs are well understood and generated keys are established when necessary. Make sure that the reason such meaningless tables were created is documented so that, as circumstances change, alterations can be made appropriately.

Newsletters

Meaningless Tables Mean the Data Is Not Understood

Generated Keys

Meaningless Tables

White Papers

Sponsors