<< back Page 3 of 4 next >>

AI: Data Quality’s New Frontier


Given the burgeoning use of massive amounts of heterogeneous data for mission-critical and even life-critical applications such as autonomous vehicles and medical diagnostics, the pressure on companies to develop appropriate data quality strategies has become more acute than ever. Effective data quality strategies have several components. Doing effective work in the beginning of the process helps to avoid major problems downstream.

Designing an Effective Data Quality Program

The first phase in an effective data quality program seems obvious but is too often overlooked. Companies must understand their data. But what does “understanding the data” mean? Companies must identify and catalog the types of data they are using. Is it web data, information from IoT sources, transactional data, semi-structured, or unstructured data? Is the data stored in flat files or relational databases, or does it flow from event streams? The next step is to understand where and how data is flowing into systems. Is the data coming from internal or external systems? Does the data come through a continuous feed or a batch process? Are there manual processes and validation associated with the data?

Companies must also address specific qualities and characteristics associated with the data. How frequently is the data updated? How long does the company retain the data? What data models are being used? Are there known data quality issues with the data being generated?

Finally, the preliminary work associated with data quality must include identifying the key stakeholders. Only the people with the deepest and most complete knowledge of the data can identify the most critical issues and validate the overall state of the data.

Data Quality Trends

The pressure that the growth of machine learning has placed on data quality generally has sparked and enhanced several important trends. Data quality practices can no longer be confined to a company’s IT group. While central IT teams are still primarily responsible for cross-domain analysis, line-of-business managers have to integrate data quality metrics into their use cases and business scenarios.

The area in which business unit managers can add the most value is in data profiling, determining what the data should represent. Through data profiling, companies establish a baseline by which data can be validated. Data profiling includes identifying the key entities such as customer or product, important events such as log-in or purchase, and other critical dimensions of the data. It also involves a statistical analysis assessment of the range of the data—maximums and minimums, for example—as well as a review of outlying and anomalous data points.

The need for efficient data profiling is also driven by the growth in the collection and application of new data types from new data sources. New data generators such as sensor data and IoT data, as well as new data management configurations such as data lakes, can lead to dynamic and more efficient products and services. But data quality is central to that effort.

<< back Page 3 of 4 next >>


Newsletters

Subscribe to Big Data Quarterly E-Edition