Why we need to improve data quality to better train models
The volume of visual data in the world is simply exploding. Consider these facts: Over 1 billion surveillance cameras are estimated to have been installed worldwide.
The global self-driving cars market, which uses AI and integrated cameras to perform driving tasks, is projected to grow from over 20 million units in 2021 to more than compound annual growth rate (CAGR) 62 million units by 2030, at a CAGR of 13.3%. Even the average person is estimated to create 1.7 MB of data per second.
Visual data is expansive, everywhere, and critical to the functioning of industries such as healthcare, security, surveillance, sports and entertainment, manufacturing, automotive, and retail.
But, despite the ever-growing amount of visual data in our world, AI continues to rely on a model-centric approach. The problem with this: It is largely reliant on rules and heuristics.
We’re at a tipping point in the evolution of AI. With the advent of powerful ML/AI hardware infrastructure and more robust general-purpose models that have made significant progress, it’s time to shift AI’s center of mass toward data. The paradigm shift is from a model-centric mindset to a data-centric potential of real-world AI applications. From a practical standpoint, this shift means spending more time, energy, and resources on both the selection and the curation of data.
Most AI models spin-off of “off-the-shelf” or “backbone” models that are trained on large public datasets. These datasets may or may not be close to the current domain of applications that are then customized on data and attributes specific to the domain.
To make production-grade AI, models need to be trained on the operating range, including the ability to handle corner cases, which implies improving the data used to train models.
This can be achieved in two ways: selecting better, higher-quality data to train the model and using data curation.
Why We Must Place More Emphasis on Data
Data and code are the foundation on which AI is built. While both elements play a vital role in AI systems and the development of models, historically, code has been the one to receive the most time and attention.
Focusing on the model can feel more appealing to data scientists and ML engineers, since it requires them to use their knowledge and skills to approach the problem and create the solution, while focusing on the data can feel like an arduous, one-time job.
Data should be at the core of every decision made. Collecting and understanding higher-quality information creates more accurate, organized, and intentional results. Whether you are making a business decision or building a model, improving the data quality used to inform the decision will grant a more optimal outcome.
Fine-tuning a model based on flawed data can have serious repercussions. Wasted hours, labor, and costs can all be a result of focusing solely on a model, along with lower model accuracy and optimization. The data is crucial for the model to perform as intended.
If data is everywhere and vitally important, why have we focused so much on the model? To answer that, it’s best to start with the issue of data selection. Data selection is the process of finding data that best represents the unique characteristics of the domain to apply to a model. However, what constitutes unique or interesting data is often known only to the subject matter experts who may, or may not, be skilled data scientists. It is important to call out the challenges for data scientists that work with unannotated data, while trying to build and maintain relationships, latent structures, and implicit biases within a dataset. This laborious, manual process is drudgery for data scientists, but an integral part of the approach.
Still, no formalized tools exist for sifting through large visual datasets and building effective training datasets in a streamlined, standardized manner. Currently, these are accomplished by hack approaches, gluing together open source codes and working with Jupyter-like notebooks.
Manually querying and searching through huge visual datasets is a tedious process, one many data scientists would rather avoid in favor of the more exciting task of working on the model.
But off-the-shelf models and neural network backbones have now reached a level of sophistication that requires data scientists to focus on the data on which these models have to train to achieve the production-grade goals.
The AI model-centric approach also lacks scalability. Preprocessing code or bespoke offline models tackle specific issues—this method often fails to scale up when it comes to real-world data applications.
Finally, there’s the concern of security. Visual data is often confidential and cannot be shared outside firewalls. However, visual data must be utilized in order to create better, more useful applications of AI. Questions about privacy and security present not only ethical conundrums for data scientists, but practical issues as well, having slowed the pivot to spending more time and energy on data selection and curation over training the model.