Data modeling has always been a task that seems positioned in the middle of a white-water rapids with a paddle but no canoe. On one side of the data modeling rapids are the raging agilists who are demanding working software and decrying virtually all documentation. To this agilists' group, data modeling is often seen as too simple to matter. But at the same time, their implementations will miss standardization in naming or data model patterns. And results may be so far off course that major rework is unavoidable. Sadly, far too many agile practices have been set up to place things under the technical debt umbrella, when in reality those practices never allow the re-factoring closet door to be opened. Poor data models are "overcome" by creating ever more complex logic around the data in order to get to a more proper result, as developers learn what really needs to be accomplished along the way, maybe. The results may work but can be a nightmare to maintain.
Posted July 07, 2022
Data modeling is the process of defining datapoints and structures at a detailed or abstract level to communicate information about the data shape, content, and relationships to target audiences. Data models can be focused on a very specific universe of discourse or an entire enterprise's informational concerns. The final product for a data modeling exercise varies from a list of critical subject areas, an entity-relationship diagram (ERD) with or without details about attributes, or even a data definition language (DDL) script containing all the SQL commands to build a set of physical structures within some chosen database management system (DBMS).
Posted June 02, 2022
The value of normalization is in understanding the data well enough to create the normalized design. Pulling out the business rules, business terms, and relationships from the mass of jumbled together raw content is critical. The business rules that result from performing the normalization exercise establish the requirements that need to be satisfied by solutions, whether they are either built or purchased. When an organization creates and maintains a normalized design for the data within the important areas of their business, they reduce work on all future systems.
Posted May 04, 2022
Data architects live in a world caged by bars of process, standards, and documented procedures—things many would consider a high ceremony lifestyle. As an industry, information technology has been migrating more and more into agile frameworks for some time now. High ceremony is often seen as an earmark of "waterfall" approaches, which constitute the evil empire that agile frameworks are fighting to replace. The result of this opposition is that formal data architecture groups often do not fold easily into agile approaches.
Posted April 07, 2022
Often one reads a book or hears a presenter making a pun about relational theory being called "relational" because of entities being "related." Such references are nothing but misplaced puns. Relational theory derives the relational in the name from the idea that a "relation" is a mathematical term synonymous for a "set" and each entity represents a set of some sort. However, relationships between entities are still a very important concept albeit not an eponymous one.
Posted March 11, 2022
Folks relate to physical tables; even the most non-relational-minded person can picture a fixed structure file and equate that to a table and its columns. The spreadsheet image is ubiquitous. DBMS-defined views are logically similar to tablesand in usage are certainly interchangeable with tables.
Posted February 08, 2022
Dealing with data warehouses, data marts, and even data lakes, can be awkward in an agile environment. While adding a single metric onto a dashboard can be very natural, no one builds a dimension table a few columns at a time. This awkwardness has caused many variants in how an agile methodology might be applied to one's analytics databases.
Posted January 03, 2022
Using surrogate keys within a database is often considered a technique to improve performance. The assumption is that using anything other than a numeric data value to join tables provides "bad" performance. Therefore, whatever the natural key may be—one column, multiple columns, alphanumeric, etc.—the surrogate key can be a 100% numeric single value, standing in for that natural key value set. Some DBMSs have key generators that are numeric, others may be more wide-ranging in values. Some organizations may choose to use surrogate keys generated from hashed natural key values. Will surrogate keys improve everyone's query performance? As with the stock market, specific circumstances differ everywhere, so the individual results may vary.
Posted December 08, 2021
When normalizing data structures, attributes congregate around the business keys that identify the grain at which those attributes derive their values. Attributes directly related to a person, deriving their value from that individual, will appear on an Employee entity, or a Customer entity, and so forth. However, in the process of normalizing, the data modeler must identify the objects that are meaningful and useful to the organization
Posted November 01, 2021
Marvel should have an evil villain named "Null." Nulls have always been trouble in the relational world. Certainly, nulls are used all over the place by virtually everyone. Still, that does not mean that nulls are harmless.
Posted October 05, 2021
It is entertaining to be in a room filled with people all claiming that they love data. What becomes more entertaining is discovering how each one views data uniquely and enjoys doing different things with or to that data. The data community has become one that is far more diverse than many realize.
Posted September 16, 2021
The meaning and interrelationships of and between data are important. If you are the designer of a database and being lobbied to allow the creation of a table without a primary key, make sure you understand how the table is to be used, and that people will not be writing queries against the structure that will potentially be multiplying data unexpectedly
Posted August 02, 2021
Within IT, testing has matured as an industry. Many tools exist, and many IT shops have testing groups. But, often those testing groups are unable to assist on data-related projects. The heart of the problem is that the focus of the testing practice has been perverted. The testing industry is concerned primarily with validating code, specifically the kinds of code that interact with people. With data-related projects, the need is to test the accuracy of the data at rest within the structures.
Posted July 15, 2021
Inside a relational database management system, the principal persisted data structure is considered a logical relation. Operations performed against that data within the RDBMS result in a logical relation too. In other words, everything is a table. To step away a little from the term "table," let's use the word "grid."
Posted June 10, 2021
Anything worth doing, is worth doing again and again. Right? When building out and standardizing new subject areas or new sources for one's data warehouse, hub, or other analytics area, a task often initially overlooked is the logic bringing in the data. Obviously, everybody knows that the new data must be processed. What many ignore is the idea that establishing the process to bring the data in often must be done twice, or more.
Posted April 29, 2021
All data architects should consider themselves change agents for the organizations in which they work. But at the same time, business also wants to keep much the same. Such discoveries can be confusing when business is saying that change is desired, but their actions seem focused on preventing change. It can suggest the Albert Einstein quote (or the Baba Ram Dass quote, based on which version of history one subscribes to) that "We cannot solve our problems with the same thinking we used when we created them."
Posted April 06, 2021
Dashboard users doing unsophisticated, largely repetitive, operational reports and analyses ask their IT support personnel to provide data in a simple way; they demand fast performance from their queries; and they demand new functionality be provided quickly. While often not stated explicitly, the "simple" data presentation implies several characteristics.
Posted March 01, 2021
The data lakehouse is a merging of the organization's data lake and data warehouse into one platform, eliminating data redundancy and loss of time moving data around from one place to another. Is this newest data savior really better than all the data saviors of the past?
Posted February 10, 2021
An architecture derives its strength from a level of consistency in how things are implemented. However, that is not to say that a mindless devotion to absolute consistency is a good thing. Times will arise when exceptions to almost any rule are necessary. The skill, the art, the balance in applying decisions that result in a good data architecture across an organization are based on a prudent use of when to conform and when an exception is needed. If there are too many exceptions, it can rightfully be declared by observers that there are no rules and that chaos reigns.
Posted January 07, 2021
A recent study indicated that IT professionals were three times more likely to disagree with their leadership than professionals in other industries. Identifying a three-times-more-likely difference of opinion seems a significant variance. I don't know what detailed insights the study arrived at to account for this level of disagreeableness. Maybe the study considered IT folks as being more educated, more logical, and therefore more difficult?
Posted December 10, 2020
Data modeling has an intimate relationship with abbreviations. Since the creation of the very first data model, there were circumstances where fully worded names for tables or columns were simply too long to implement within one tool or another. Occasionally one runs into an individual who cannot conceive that anyone on the planet would abbreviate something in a different fashion than they do; but more often data modelers tend to enjoy consistency, and when possible, employ rules to support consistent outcomes. Abbreviations are no exception.
Posted November 04, 2020
Plenty of analytics environments have landing areas, plenty have staging areas, and some have both. So, are landing and staging just synonyms for the same thing? A survey of usage would show there is much overlap in implementations, and even some confusion.
Posted October 08, 2020
Why set a trap to fail at some point in the future? When you arrive at a design wherein an entity needs to have a relationship to almost every other entity, stop and think about what is happening, and review your reasoning before proceeding.
Posted September 09, 2020
IT management can often succumb to repeated patterns of behavior. One of those repeated patterns is accusing developers of being "perfectionists." The developer or developers in question will be told that they should not strive for perfection, because perfection is not needed for the circumstance, and, more importantly, attempts at perfection take too long. Is the developer a closet perfectionist? Maybe, maybe not.
Posted August 11, 2020
An enterprise conceptual data model is often seen as a high mountain to be climbed, a journey that will last a lifetime. People have visions of 10 feet or more of wall in the corporate offices wallpapered with an entity relationship diagram [ERD] that has zillions of teeny, tiny boxes and more relationship lines than the combined lines of queuing patrons in all Disney Resorts, when full. In this context, an enterprise conceptual data model is a daunting task not to be taken lightly. But in today's world, that enterprise conceptual data model can simply be a list of subject areas.
Posted July 01, 2020
As the number of types of slowly changing dimensions (SCDs) increased, things have ended with Types 0 through 7, making essentially eight of them. But it is unclear whether there full consensus exists among current practitioners on what actually differentiates each of these eight SCD types. Some confusion may result from the fact that when the first three SCD types were defined, each could be equated to a result for a dimensional attribute. Type 1 had facts associated to dimension values as they are currently, or always current, Type 2 had facts associated with dimension values as were current when the facts were processed, Type 3 had facts associated with both current values and values current when processed.
Posted June 10, 2020
One creates the potential for some interesting anomalies when building a star schema wherein the fact table contains future-dated metrics and any of the dimensions are Type 2. A Type 2 dimension tracks changes to the data items contained within it. Effectively, each dimension contains a surrogate key, a natural key with a start and stop date, and additional descriptor columns. If any of the descriptor column values change, the existing dimension row has the stop date populated while a new row is inserted with the same natural key, new start date, and new descriptor values.
Posted May 13, 2020
As we have moved forward with APIs and microservices, every organization has even more data stores to manage and more sources of data to consider. Sorting through data structures for operational solutions can become mind-numbing due to the variety, or even frustrating due to a lack of detail from many vendors. Source systems are no longer the monoliths they once were.
Posted April 08, 2020
From CEO to presidents, VPs, directors, to any number of mid-to-low level managers, the concept of hierarchies is pervasive in organizations. But, if one is dealing with a relational DBMS, a hierarchy remains an awkward concept.
Posted March 05, 2020
A good data modeler must know when there is something the users need to analyze, even when what that is is not necessarily obvious. Subtle differences abound between things in a state, versus actions taken to create or change states, versus the duration of a business object within or across states, versus business objects and the workflows containing them. Each of these subtleties drives differing metrics with distinct uses.
Posted February 10, 2020
When applied implementation efforts are not efficient, more often than not, the inefficiencies are due to the interference of an imp known as "churn," i.e., implementation wheels spinning away and not actually making progress. Churn is bad. Churn is one of the most destructive circumstances for any IT project. Churn may raise its ugly head at any point where a project requirement or need is left unclear.
Posted January 02, 2020
What exactly is a data architecture? As the Zachman Framework exposed long ago, different people look for different kinds of details and documentation to answer fundamental questions about an enterprise's architecture. Someone involved with infrastructure will need to understand the tools used and the methods employed to move data and to be clear on concepts about how security will be enforced. But these aspects are only initial parts of the overall architecture, and as such, a simple diagram of tools used is incomplete and insufficient for a comprehensive view of data architecture.
Posted December 01, 2019
Every organization needs a data warehouse. A data warehouse has never been a one-size-fits-all kind of solution. Variations exist and should be accepted.
Posted October 31, 2019
Certain kinds of issues in data modeling need to be addressed in specific ways. Many options may exist, but it is very rare for all the possible choices to be equally appropriate. The reasons for using a less-than-satisfactory path may be many. They may include a misguided concern for speed-to-completion or be a matter of the areas of control for those "in the loop" on the existence of the issue. Largely, this involves communication or, more precisely, the lack of communication.
Posted October 01, 2019
In order to protect your organization, it is critical to watch over the elements that have been built, keep processes running, and be on top of change. Spend the time and resources necessary to properly maintain the solutions for which you are responsible. The amount spent in such endeavors will be less time than that spent trying to play catch up on too many changes after bad things have resulted.
Posted September 03, 2019
An effective approach to processing and transforming large datasets is likely comprised of multiple steps. The large data will likely be split apart into several smaller sets, maybe even in a couple of differing fashions with a common and understandable theme. But there should not be too many split-apart variants; rather, as with the three bears, it should be just the right number of smaller datasets. And then, similar to solving a Rubik's Cube, a twist or two at the very end brings all the new and old datapoints together in a complete and organized fashion.
Posted August 07, 2019
Data virtualization enables the ability to have one or more data stores that break the bank processing-wise, because they can physically exist once but logically exist in multiple transformed structures. Occasionally, IT managers get the idea that data virtualization is a more generic answer, presuming that if it works for the big data, it can work for all data.
Posted July 18, 2019
At times, there is a need to have security within the database be a bit more sophisticated than what is available. On specific tables, there may be a need to limit access to a subset of rows, or a subset of columns to specific users. Yes indeed, views have always existed, and yes indeed, views can be established limiting rows or columns displayed. However, views only can go so far.
Posted June 10, 2019
CDC can greatly minimize the amount of data processed; but the cost is that the processes themselves become more complicated and overall storage may be higher. Costs are moved around, the final level of processing becomes focused on the minimal changes, and this minimization is the efficiency to be gained. Moving forward, using the data becomes standardized and ultimately straightforward.
Posted May 01, 2019
When working on a multidimensional design, every fact table within scope should be handled with care. In an ideal world, each low-level fact table represents the metrics related to a business event. The meaning of a fact table, ideally, should be evident based on the table name and the composition of the fact table's primary key. Deciding on a primary key for a fact table is an important choice.
Posted April 09, 2019
Clarity of vision is absolutely the most important part of database design. The data architect must understand the shape and patterns of the data being modeled. This lucidity arises when the designer understands the subject area, the goals of the target database, the nature of the data sources involved, and the internal lifecycle of the database objects in scope.
Posted March 04, 2019
In dimensional modeling, business events are typically designated as facts while descriptive information elements are dimensions. However, events (or information about them) occasionally serve as dimensions as well as facts. A good data architect must watch their p's and q's and be certain when it is appropriate for a fact to also serve as a dimension—or when the dual function is not appropriate.
Posted February 08, 2019
More harm than good has been done to software development by letting the planning dates drive the work instead of having the work drive the dates. This planning-date-driven approach causes more stress, more bad decisions, more rework, more failed projects than all other causes combined.
Posted January 02, 2019
Data mart builders must understand what they are working to accomplish. The DBMS is not going to magically guide them to a solution. The builder is responsible for knowing how dimensional techniques work, why they work, and what options may exist within the dimensional framework.
Posted December 04, 2018
While relational database management systems are still the workaday workhorse, we are now adding into the mix document, columnar, and graph datastores, and their variants. Each datastore has something at which it excels, and other things it may not. Similarly, the rules followed in composing data structures, based on the platforms selected, also vary greatly.
Posted November 01, 2018
There has always been a need to tightly control some data items, such as passwords and Social Security numbers. Today, with the rise of concerns over personally identifiable information (PII), the General Data Protection Regulation (GDPR), and other legal mandates, a much larger group of data elements must be controlled. These legal data governance issues may need to guide our hands as we establish database designs.
Posted October 10, 2018
In establishing a staging or landing area for a data lake, a data hub, or a quaint data warehouse environment, structures need to be established that will mimic source data in support of two very basic queries. The first is: "What does the current source dataset look like?" And the second: "What change activity has occurred against the source since the last time it was interrogated?"
Posted September 04, 2018
In the big data world of today, issues abound. People discuss structured data versus unstructured data; graph versus JSON versus columnar data stores; even batch processing versus streaming. Differences between each of these kinds of things are important. How they are used can help direct how best to store content for use. Therefore, deep understanding of usage is critical in determining the flavor of data persistence employed.
Posted August 08, 2018
When we hear the term "think outside the box," how often do we really examine what that phrase truly means? First, one needs a box. And it is on this issue where most folks fail. Before one can consider what is "outside the box," one must clearly understand what exactly is meant by "inside the box." People often consider random approaches the same as being "outside the box." However, just different is not enough.
Posted July 02, 2018
Under usual circumstances, the one-to-many or many-to-many relationship, alone, drives the pattern used within the database model. Certainly, the logical database model should represent the proper business semantics of the situation. But on the physical side, there may exist extenuating circumstances that would cause a data modeler to consider including an associative table construct for a one-to-many relationship.
Posted June 01, 2018