The DBMS as File Server

Bookmark and Share

Tables within a database management system (DBMS) need primary keys and defined indexes in order for the DBMS to have the opportunity to provide good query performance.  Without indexing, the worst possible query performance is guaranteed.  The content of these non-keyed tables remains unknown to the DBMS, a black box, where the only possible approach for query execution is to read every row one-by-one.  Under such scenarios, the DBMS is little more than a file server that operates more slowly than usual.  Without the primary key and indexing, the DBMS may expend even more processor cycles in moving data in and out of the DBMS proprietary storage areas than a file server uses in opening files.  Extremely small tables that can fit entirely within a single page read will not be a performance issue, but larger tables may see slowed responses to queries.  The worst offenders in creating such wide-open table structures are in support of ETL staging areas, because batch ETL processing is often designed such that the DBMS is the functional equivalent of a file server.

These types of practices are not a database anathema to be shunned; valid reasons justify this DBMS-as-file-server approach.  However, the lack of primary keys and other indexes is a sub-optimal practice to be avoided.  Since the DBMS is intended to support queries against the data, it should be no surprise that the DBMS is not optimized for use as a file server.  Regular proclamations are made repeatedly suggesting that adding primary keys and indexes slows down performance.  These anti-primary-key arguments are often misstatements.  While there may be some performance impact, well-planned keys and indexes should not significantly impact performance.  When primary keys are not defined, some DBMS create them "under-the-covers" using internal addresses rather than anything that may help downstream queries.  When this internal-primary-key-generation happens, the lack of an explicitly defined primary key is actually generating more work, and hurting rather than helping.

The DBMS must understand two things about each table - what makes a row unique, and how the rows within the table are physically ordered.  Armed with these two pieces of information, the DBMS starts making intelligent decisions that help optimize the queries executing against each table.  These two concepts, primary key and physical clustering, may be the same thing.  Under that scenario, a table's rows are sorted by the primary key values.  However, there is no requirement that these two things be the same.  The distinct nature of these two concepts often confuses people, who wrongly believe they must be one and the same thing.  This confusion is often the root cause of bad design choices. 

A primary key is the minimal number of columns that can uniquely identify a row.  Particularly with staging tables, the incoming data may have quality issues, and it is necessary for all records to be loaded, regardless of any bad data.  Therefore, it is common for staging tables to have a generated value for use as the primary key; this key could be a sequential or a random value.  Clustering is not related to uniqueness at all.  Clustering is meant to help facilitate physical data retrieval.  Clustering is best implemented when defined so that it assists the most frequently used access approach of queries to extract data.  For a staging table, this clustering could likely be the date or timestamp of a row's insertion, or possibly a processing batch identifier.  Both generated primary keys and load timestamps (meaning date and time values) are helpful for auditing.  Key values can flow downstream to assist in tracking records across their lifecycle.  Creating and using these kinds of columns fits naturally into a staging table function.  And by having defined primary keys and explicit clustering rules, one establishes an environment of well-formed table structures.