Data Quality Issues Leave Everyone Holding the Bag

Quality can be a hard thing to define. What is good and what is bad may not be easily identified and quantified.  When a data mart accurately reflects data exactly as found in the source, should that be considered a quality result? If the source data is bad, is the data mart of high quality or not? If the data mart differs from the source, when is the difference an improvement of quality and when is said difference evidence of diminished quality? While it may seem self-evident that correcting the source of load data would be the "right" thing to do, in practice that direction is not necessarily self-evident. The reasons supporting this nonintuitive approach are varied. Sometimes changes to the source impact other processes that must not change, or the changes will expose problems that may provoke undesired political fallout, or it may simply be that making the proper adjustments to the source application would prove too costly to the organization. For all these reasons and more, in the world of business intelligence, the dependent data often is expected to be of higher quality than the source data. In order for that improvement to occur, data placed within the dependent mart or data warehouse must be altered from the source. Sometimes these alterations become codified within the process migrating data from the source. Other times changes are made via one-time ad hoc updates. Either way, this alteration leads to a situation in which the dependent data will no longer equate one-for-one to the source data. Superficial comparisons of this altered content will highlight the disparity that what exists for analytics is not the same as what exists for the operational system.

In cases where data is consciously altered, the organization must have an official "memory" of these changes. This official source could be a metadata repository, if the organization has one, or it could be something as uncomplicated as a spreadsheet logging each change made. Without such knowledge retention, problems eventually will develop. New users or system auditors may step in, look at numbers, find differences, and start demanding answers. Alternatively, a more drastic circumstance might arise in which the data warehouse is reworked and all data must be re-initialized from the source. Certainly those data cleansing changes that were placed inside ETL (extract, transform, and load) code are retained, after a fashion, in the code itself. However, that code is unlikely something that an auditor might easily find and understand. One-time updates would conceivably be lost forever if not properly documented. Metadata that details what, how, and why data was changed becomes a critical component to the business.

Ultimately, quality issues are so vital that no one escapes responsibility. From the business operational users to the analytics consumers, straight though to IT personnel handling the operational solutions, and any data warehousing work-everyone must keep a watch on data quality. All involved must help track the decisions made, especially when they impact what will be visible to users. And most importantly, everyone must be wary of those choices that will result in visible discrepancies that will appear as comparisons are made between systems. These intersystem discrepancies are the ghosts that will pop up from time to time; and if the data divergence is caused by a conscious choice to "improve" quality, then productivity may be needlessly lost as people investigate the same problem over and over again. As a proper defense against wasting time investigating what should be known issues, everyone involved should be conscious of the importance of tracking and knowing where to find this kind of system information.