Data Quality: The Root of Sound Analysis

Apr 3, 2019

By Joe Hellerstein

Competitive business strategy increasingly relies on data analytics. The core techniques of data analysis are increasingly accessible thanks to commodity Business Intelligence packages, modern open source AI tools, and cloud services. Given this level playing field for software and algorithms, competitive advantage typically lies in the unique data that a business can gather and feed into its analytics pipelines.

The Underlying Challenges of Data Quality

It is important to note that the definition of data quality varies depending on the use case. What qualifies as high quality in one instance may not be the same in another. For example, consider "completeness" of temperature readings: is an analysis dependent on distinct temperature readings every day? Every hour? Every second? Microsecond? This depends on what the readings are being used to assess. So there's no one-size-fits-all notion of data quality, and the use case for the data determines the quality needs.

Keeping the specifics of a use case in mind, the following are some of the common data quality pitfalls that enterprises face:

Questionable statistical validity: It’s common to end up with extreme values in your data that can significantly impact analysis. In one use case, those extreme values might be best treated as “dirty data” and eliminated for the purposes of analysis; in another they might be actually be exactly what you're looking to uncover. For example, an outlier in temperature readings might skew your analysis of typical temperature patterns; on the other hand it might tell you that you have a miscalibrated sensor, or that someone has been lighting matches near one of your sensors. The statistical properties of data—combined with an understanding of the use case—can dictate whether data is dirty or clean for a given purpose.
Not meeting regulatory standards: Depending on your organization and industry, your data may need to conform to certain standards, as dictated by industry-wide or region-specific (such as GDPR) regulations. These regulations are constantly changing, which can add further complexity and customization into the mix.
Missing values: Depending on how data is collected, it’s not uncommon to face issues with completeness. Missing information or values might introduce bias that leads to ineffective decision-making. And the fact that data is missing might itself be important—remember Sherlock Holmes and the dog in the nighttime: the “curious incident” was that the dog did not bark. It is all too easy to overlook the data that is not there.
Outdated data: If data is not refreshed often enough, the data and therefore the subsequent analysis can be outdated, and potentially generate results that are no longer relevant. And in fast-moving organizations, old data is often a small piece of the bigger picture: estimates suggest that 90% of the world’s data was generated in the last two years.
Non-standardized encodings: Real-world entities like people and even dates often have varied encodings, and need to be made canonical. Data analysis often stalls until the names of basic entities conform to a single standard.

Today’s data professionals are increasingly using new and modern solutions to tackle the various flavors of data quality challenges, but many organizations’ data efforts are still hampered by legacy technologies and processes for managing, ingesting, cleaning and ultimately using their data.

New tactics to ensure data quality

So what can organizations do? Here are three actionable steps your company can take to improve data quality and produce better analyses.

1. Activate collaboration between business teams and IT, and collaboration among team members.

The challenge: Individual business teams have an in-depth understanding of how they use data and therefore what their specific quality requirements are for different use cases. Without their active engagement in the data preparation process—better yet, their ability to prepare the data themselves—they cannot have adequate input into how the data is prepared. This brings inefficiencies, potential errors and an overreliance on IT resources whose time would be better spent by serving as a central clearinghouse to align needs across business units and to monitor data quality and practices over time.
The solution: Collaboration between business teams and IT. Many forward-thinking organizations are shifting the responsibility of data quality toward business users, who are closer to the data and understand it better. This brings new efficiencies to data preparation and use. Allowing business users to see and interact with the data sooner enables them to add valuable context to help address data quality issues—they bring a deeper understanding of the use case, and can better navigate the quality issues that are actually pertinent to the analysis. Meanwhile, IT’s role should be to work across business units, manage data quality tests and transformation processes over time on behalf of the business units.

2. Empower business users to leverage external data resources

The challenge: New use cases for analytics often motivate business groups to onboard data from external sources. IT professionals do not typically know the use case or related internal data as well as the ultimate business user; hence, they're likely unequipped to be as creative as the business user about what external data sources could help. Meanwhile, business users who do deeply understand the data and its use case may have the creative ideas, but likely can't easily get that external data into pipelines because it's not sanctioned by IT.
The solution: Self-service. Business users should have the ability to explore and onboard useful external data, and they should have the ability to incorporate that data into processes. They should also have a way to let IT know what data sources need to be tapped on an ongoing basis.

3. Unite people and AI to help assess and fix data quality

The challenge: No one would call Excel a modern—or sufficient—tool for preparing or working with data. Yet Excel is still being used as the primary tool for data preparation among 37 percent of data analysts and 30 percent of IT professionals. Manual processes like these hinder collaboration and efficiency.
Algorithmic automation—especially modern AI—can bring massive improvements to data quality. Many aspects of data quality can be automated. For example, AI techniques can be used to flag outlier values, standardize values, or integrate data sources. But AI techniques based on machine learning bring their own data quality requirements: they need clean labeled “training data” to build their models. Also, almost by definition, AI never gets things entirely right—AI is the technology we reach for when hard-and-fast rules do not work. This is familiar from well-known forms of AI like Google searches: we typically don’t just press the “I’m feeling lucky” button, we review the ranked list of suggested matches and browse through to see what looks most useful.

As a result, even though AI can help with data quality transformation, it is critical for humans to stay in the loop of automation in efficient ways, assessing data quality before and after algorithms run, and having the ability to observe, steer and override automated choices at the right level.

The solution: Increasingly, our tools and processes should lean on AI-driven solutions for automation of data quality assessment and transformation. But those solutions must couple AI with a strong human-facing component for efficient and intuitive training, assessment and override. Advanced tools leverage AI into interactive visual experiences for data quality that ensure quick feedback loops between algorithms and human domain experts. And with machine learning, the accuracy of AI-generated suggestions can improve over time based on that user interaction. This collaboration between people and computation is key to accelerating data preparation and landing on data outputs that inspire confidence in the analytic process.

Improved Data Quality for Better Results

In any data lifecycle, there are many entry points where data quality problems can creep in. From incomplete or improper data entry, to incorrectly blending data from different sources, to selecting incomplete data sets or missing fields for analysis, to even the analysis process—there’s no shortage of chances for data quality problems to arise. Data professionals need to work to lessen and address the quality issues that occur.

Implementing these best practices will help organizations improve data quality and their resulting analysis, and ultimately drive better-informed, more strategic business decisions.

Newsletters

Data Quality: The Root of Sound Analysis

The Underlying Challenges of Data Quality

New tactics to ensure data quality

Improved Data Quality for Better Results

White Papers

Sponsors