The importance of data to today’s modern world becomes more and more clear every day. Organizations are creating, storing, gathering, and managing more data than ever before. If you are reading this article, chances are, you will agree with this statement: “You are managing more data this year than you did last year … and your organization is planning to manage even more data next year.”
None of this should surprise you. We are said to be in the Information Age as we manage big data and analytics projects. And all of this is moving toward more cognitive processing with machine learning and AI capabilities that rely on the mounds of data we collect and manage.
But if we peek behind the machinations that drive all of this data collection and examine the accuracy, or lack thereof, of our stored data, things start to become a bit troubling. Of course, to understand data quality we must define what makes high-quality data. According to Thomas C. Redman, noted data quality expert, “Data are of high quality if they are fit for their intended uses in operations, decision-making, analytics and planning.” This means the data should be free of defects while being relevant, comprehensive, at the proper level of detail, and easy to interpret.
Nevertheless, data quality continues to be a pervasive problem. In his seminal book, Data Quality: Management and Technology, Redman reports that, on average, payroll record changes have a 1% error rate, billing records have a 2%-to-7% error rate, and the error rate for credit records is as high as 30%. I challenge you to look at some of your company’s production data (that you have authorization to see) and examine just the last couple of hundred records or so. Chances are, you’ll find problems either with incorrect data or missing data.
Then there is the bullwhip problem that can exacerbate data quality issues. To understand the bullwhip effect, consider this example: A retailer in your supply chain reports demand that is wrong by 10%. This means that insteadof 10 units, the retailer reports 11 units. Now, if there are 50 shops being served by the distribution center, then the impact is an extra 50. And if there are 50 distribution centers, then the wrong data will trigger 2,500 extra units … all from what seems like one very small error!
Data quality truly is a pervasive problem!
Why should we worry about data quality? The answer seems to be obvious, but let’s think about it. Customer satisfaction with your store, website, or product will certainly be lower if the data quality is poor. Every customer wants to pay a fair,correct price for the product they want without experiencing issues that bad data could cause. Bad quality data also can increase costs. If transactions are denied or need to be backed out or repeated because of wrong information, that comes with a price. Furthermore, executives can make improper decisions when they are based on the wrong information. Finally, consider the impact of bad data quality on regulatory compliance. Governmental and industry regulations such as GDPR, HIPAA, PCI-DSS, and many others place specific requirements on how certain types of data are managed, protected, and reported. If the data is not accurate, and therefore your organization does not properly manage it, there can be severe consequences, including significant fines and more. So clearly, bad data quality is something to be avoided.
Bringing this all back to a topic that affects DBAs, think about how many of the tables that you manage lack constraints. By building constraints into the database, overall data quality may be improved. Constraints enable databases to self-enforce data quality, at least to a certain extent. Referential constraints enforce primary key-foreign key relationships so that a foreign key cannot be entered that does not conform to its parent primary key. Unique constraints can be applied so that duplicate keys cannot be entered. Check constraints and triggers can enforce business rules onto data elements. All of these are desirable elements that all too often are ignored and thereby contribute to poor data quality.
So what can we do? The first step is to ensure that the executives at your organization recognize thedata quality problem and endorse the need to rectify it. This means that data (perhaps, more accurately, information) needs to be valued as a corporate asset—and then treated accordingly. What are your other corporate assets? Capital, human resources, intellectual property, office buildings and equipment, and so on. You protect, manage, inventory, and even model all of these. Executives do not need to be told to manage these assets, but perhaps when it comes to data, they do. Have you defined and inventoried all of the critical data elements needed by your organization? Does your company know where every piece of data is? And yes, I am talking about copied data—even in Excel spreadsheets on your users’ desktops.
Only when you understand the issues you are dealing with can you ever hope to ensure that data is accurate.With that in mind, how is data governance implemented (if at all) inyour organization? Data governance encompasses the people, processes, and procedures to create a consistent enterprise view of a company’s data in order to increase consistency and confidence in decision making, decrease the risk of regulatory fines, and improve data security.
Of course, you can always take some baby steps along the way and do not have to implement a grand data governance practice before doing anything. Procuring and implementing a data profiling tool can help to show you the existing state of your data—and perhaps help you to start cleansing it.