Before We Get to Normal - the Key Steps in the Data Modeling Process

As one works through the normal forms, be it a journey to the placid shores of third normal, or the more arid climes of Boyce-Codd normal form, or even the cloudy peaks of fourth normal and beyond—and before one starts thinking about normalizing the design—the database designer has covered a lot of ground work already. Before thinking of normalizing, one needs to have conceptualized the relations that might be within the solution’s scope.

One should be able to name these relations, whether they be CUSTOMER, ORDER, or PRODUCT, for example. As a general rule of thumb, most modelers try to keep the names of these relations singular rather than plural, i.e., CUSTOMER instead of CUSTOMERS. If the relation can be named, it should also be defined, with some explanation of why it exists and the kinds of data to be contained within it.

Once the idea of some relations exists, then one can start thinking about what attributes might go within them. The attributes need names, too. The names should be clear; they should follow within the organization’s naming standards. (Hopefully, this means each name ends with a class-word or other domain designation.) And, if one can name the attribute, then one should be able to define the attribute. By definition, this does not mean taking the abbreviated name and simply providing a fully fleshed-out English version of the attribute’s name, such as defining “CustCd” as “Customer Code.”

In point of fact, at this level of the design activities, it would be more appropriate to name the attribute “Customer Code” with an understanding that by using the organization’s standard abbreviations, any implemented column would be titled “CustCd” or “Cust_Cd,” for example. A definition is not a trivial rehashing of the name of an attribute; rather, it is a small excursion into why the column exists. There are plenty of websites offering advice on good versus bad definitions, so that discussion does not need to be surveyed here.

Next, each of these relations needs to have an identified primary key. These days, it seems that far too often a generated key is mindlessly plugged in solely to provide functionality as a primary key, but that approach actually is cheating. The designer should look for the natural candidate keys on each relation inside the nascent model. Composite keys can be fun.

If it happens that a relation has multiple candidate keys, choosing which one to use as the primary key can be entertaining. On the other hand, if no candidate key arises, then one needs to dig deeper for a better understanding of what is happening. Every relation should have a primary key, something that allows each row to be uniquely identified; it goes to the heart of what an entity means. If a relation does not have anything that is useful as a primary key, then how can that relation have a meaning? And, if the relation has no meaning, it should likely not be in your data model.

Before working through the process of normalization, we seem to have quite a list. An organization has a start on naming standards for entities and attributes, lists of standard abbreviations, lists of agreed upon class-words and domains. For the solution in focus, a listing is created of possible relations and their meaning, each relation’s attributes, and those attributes’ meaning. Each relation has an identified primary key.

Upon reaching this point, one may finally start normalizing those structures into a solution database design. Certainly, the intention is not to say that all relations and all attributes need to be defined first, but enough of them should be identified and defined to aid in the start of fleshing things out. Data modeling is a process, new attributes and relations will arise, and each element should be stepped through the process as one completes a design.