<< back Page 2 of 4 next >>

AI: Data Quality’s New Frontier

The expansion of machine learning applications, however, has exponentially raised the stakes on the need for robust data quality.  The effectiveness of machine learning systems is completely dependent on the data that is fed into them during the training period. Bad data, biased data, incorrect data, or incomplete data could limit the effectiveness of a machine-learning system or worse. Some studies have argued that early iterations of risk assessment systems that were used to determine the length of prison sentences for people convicted of crimes replicated the biases already present in the criminal justice system as opposed to making an accurate, well-informed prediction about a person’s likelihood of committing another crime. When machine learning fails in autonomous driving vehicles, the results can be fatal.

Deterrents to Successful Machine-Learning Systems

The fact is, machine learning is data-dependent. No matter how clever the algorithms, the data used to train machine learning systems and predictive modeling applications must be correct for the system or model to function effectively. The data must be accurate, properly labeled, not duplicated, and up-to-date. But the spread and acceptance of machine learning systems raise issues that go well beyond the usual challenges of data quality.  Training machine learning systems requires integrating data from a wide range of sources and a veritable multitude of different data types. And that data must be of very high quality. According to some reports, efforts to apply Watson technology to assisting in cancer research have been slowed by the use of acronyms and the need to incorporate hand-written notes into the training process.

There are many reasons why data may fail to meet the quality standards needed for machine learning. In many respects, they are the same challenges that data quality professionals confront daily. Both for machine learning and a wide range of analytical applications, a major cause of poor data is because the reason the data has been created is not consistent with the way in which the data is being applied. The context has changed, so the data type or format is unsuitable for the new application. In some applications, the gear or instrument making the measurements may be inaccurate, particularly as data from the Internet of Things (IoT) becomes more significant. Data collection processes may be cumbersome or complex, introducing potential sources of data contamination. And finally, there is the most consistent source of poor data quality—human error.

At their core, machine learning and other advanced analytical applications come down to the ability to query diverse and complex datasets and return an appropriate answer. Too often, however, that process is impeded because of significant data quality issues. In fact, data quality can be considered the most important single point of failure for data-intensive applications such as machine learning. And that is why improving the quality of training data can consume as much as 80% of the time needed to develop machine learning applications.

But ensuring clean and accurate data upfront is only the starting point. Machine learning and other complex, data-driven applications are ongoing. New data is constantly being fed into the system, and the quality of that data must be safeguarded as well. Moreover, the output from one predictive model may be used as input into another. In those cases, the quality of the data is only as good as the weakest link in the system.

<< back Page 2 of 4 next >>


Subscribe to Big Data Quarterly E-Edition