Preserving Data Integrity

When designing a system an architect must conform to all three corners of the CIA (Confidentiality, Integrity and Accessibility) triangle.  System requirements for data confidentiality are driven not only by business rules but also by legal and compliance requirements.  As such, the data confidentiality (when required) must be preserved at any cost and irrespective of performance, availability or any other implications. Integrity and Accessibility, the other two sides of triangle, may have some flexibility in design.  

For example, must data be available 24/7 or can you allow some downtime to perform data validations and cleanup? Are you willing to potentially sacrifice data integrity to achieve maximum accessibility (and hence performance)?  Essentially, by answering those and similar questions you are making architectural decisions about which corner of the database triangle is going to be "more important" for your system. 

In practice, in order to achieve maximum accessibility and performance, systems are often designed to accept unverified data.  This leads to accumulation of bad data over time.  The definition of bad data can vary - from data that does not fit the existing business model to basic relational data integrity being broken.  If, for example, in a database only logical relational integrity constraints are defined (simply speaking, have foreign keys drawn on a database schema, but not define them on a physical level), data that violates relational integrity can and will be entered. 

Of course, data can be validated after it's been fully loaded, but since it's a time-consuming process, and no one wants to occupy resources and slow down the normal database operations, data validation is often not done at all.  As a result, systems accumulate data that violates relational and business integrity rules.  However, there are several system architectures in which the compromise of data integrity in favor of accessibility is minimal or none.

To look a bit deeper  into problems associated with the data integrity, and propose possible solutions, let's take an example of non-OLTP system (for example warehousing) which has 1) substantial batch data loads from different sources, and 2) all or some foreign keys have been dropped (for example, for performance reason).  It is self-evident that such a system is going to accumulate bad data over time.  Depending on the many real life constraints (budgetary, human resources, timing and others), one of the solutions can be preferred over the other.


Solution 1

The most common and widely implemented in the IT industry system design is to run validation process after the whole data set is loaded into the operational database. All validation rules are hard-coded in the validation code.  Bad data is identified and either flagged or isolated into a separate set of tables for future analysis.  Usually this method is implemented by hardcoding business rules into database code or other languages methods.  Rarely, almost never, this validation includes relational data integrity validation and as a result database still accumulates bad data.  As a result, orphan records are created and accumulated in the database.

Thus this process has number of weak points:

  • 1) In production implementation this method is not used to do the relational integrity validation, even when all foreign keys are dropped.
  • 2) The process is not flexible - every new rule requires coding and potentially substantial QA effort.
  • 3) The process might be resource intense, may affect the performance for normal processing.
  • 4) The operational data is not 24/7 accessible. The consistency and cleanness of the data cannot be guaranteed until the end of the validation process. Depending on the transaction definition during the validation process the data can be in worse shape then before the process started - some validation rules have been applied, while the others have not.


Let's consider an approach to enhance this widely used solution to address the issues above.


Solution 2


The first and easiest step is to extend the previous solution and address issues # 1 and #2.  Let's create a process that validates not only business data integrity but also relational integrity (issue #1).  There are several ways of doing that.  For example, all relational and business rules can be stored in a separate set of metadata tables. The validation code itself must be written in such a way that it's flexible enough to execute any new rules without additional coding and testing (note, that this is a very strict, but doable requirement). By making all constraints (business and relational) data driven, and thus issue #2 is resolved as well.  Same as in the previous case, run your process after the data has been loaded.  Since rules are data-driven and stored in the database as rows, there are no physical boundaries for creating more validation rules.  The validation code can be optimized by increasing the performance time for each rule.  However, even that optimization might not be sufficient to free up resources (issue #3).  It's being addressed that in the next solution.

Solution 3

Let's start with the case when we are only validating relational integrity of our data. 

In this solution we are proposing to have two separate databases (or schemas) - one for operational data DB1 and a secondary one DB2 with the set of integrity rules (foreign key constraints) stored in metadata format.  The same secondary database also will have mirror tables with only primary keys of operational tables in them.  Load operational data into the operational database DB1, and load all primary keys into the secondary database DB2 (step1). The validation process will run on DB2 using validation rules and keys that are stored in that database.  Since primary keys in most cases are numbers, this should be a very fast process. As a result of the validation process it fill a set of exception tables in a DB2 with list of primary keys that are breaking the integrity rules (step 2).  Since the validation process is running on secondary database DB2, the resources on the operational database DB1 are free, and database is available for normal processing.  After all bad data has been identified, have a clean up process to eliminate bad data from DB1 based on records in the exception tables in DB2 (step 3). This operation is not time-consuming at all (assuming the SQL is tuned correctly). 

Clearly, this solution can be extended to include not only relational integrity, but validation of business integrity as well.  In case of business integrity validation, the data set in DB2 will include not only primary key, but all additional data that is needed for validation. 

However, as in previous solutions 1 and 2 before the cleanup process has run, the dataset might have data that violates integrity constraint rules, and thus database cannot be considered 24/7.  This last issue (#4) is addressed in solution 4.

Solution 4

From all of the above we can derive our final solution. Let's create a system where the data will be distributed between three databases (DB1, DB2 and DB3) in the following manner:

  • DB1 is operational database.
  • DB2 is a validation database that only metadata - set of validation rules and exception tables.
  • DB3 is a staging database that has a copy of an operational database.

The DB3 contains the current operational set of data.  The data load process will load data into DB3. The validation process will run on the database DB3 and mark all the bad records.  Now the DB3 database contains only clean data and the operational processing should be switched to DB3 (similar as it's done with replication database).  For the next data load DB1 will be considered staging and DB3 is operational.  This solution is the most comprehensive one, and while it may look like an expensive overkill, for some systems it's the only way to go.

Based on the system requirements, an organization's need, budgets, resources and other considerations, any of the above four solutions can be implemented (the first solution is the one that is commonly used).  However, only the last solution allows an organization to create a 24/7 non-real-time system that fully validates data i