In business, data is often treated like the holy grail: ask a question, get an answer, make a decision. It sounds simple, but slow down and think about this for a moment. Before placing your trust in any person, platform, or model claiming to have all the answers, consider this: Do you really have all the data? And more importantly, can you trust it?
What Actually Matters?
If you’ve been in or around the data industry for a decade or more, you’ve probably heard a lot about data quality. But it’s more than a buzzword; it’s foundational. Let’s break it down:
- Accuracy: Is the data valid and free from errors?
- Completeness: Are all required records and fields present?
- Consistency: Does the data align across different systems and sources?
- Relevance: Is the data applicable to your business question?
- Uniqueness: Are there duplicates muddying the waters?
These five questions aren’t exhaustive, but they’re essential. Ignore them, and the picture you’re working with may not just be incomplete; it could be dangerously misleading.
Why Data Quality Really Matters
Consider this: You’re using a map to get from Point A to Point B. You’ve been told it’s accurate and complete, but in reality, it only has 28% of all the roads. Now imagine making high-stakes business decisions based on a similarly incomplete data set.
This exposes some problems for analytics use cases. Let’s explore two scenarios that happen frequently:
- Duplicate records and mismatches. First, acquiring data from two or more systems which are supposed to align, and sometimes do, but do not always. This can lead to duplicate entries which are both incomplete, inaccurate, or both. You might collect data on “Steph Curry” in one system and “Stephen Curry” in another. These inconsistencies can lead to duplication, missing data, and flawed analysis. If you’re building analytics on top of that data, your insights are compromised from the start.
- Missing data that goes unnoticed. This is arguably the most nefarious issue. A user can be working with data that appears accurate and still be drawing false conclusions simply because the user doesn’t realize what’s missing. This is, in my opinion, especially pronounced in website analytics, where entire swaths of user interactions can disappear from your view.
Why is Data Missing?
There are two big reasons for missing data:
Attribution Loss. The first reason relates to how the data is captured and tied to the users in the system. If you ever hear people talking about missing or incomplete attribution, this is what they are talking about. When a user clicks an ad or something similar to come to your website, there are usually special tags in the URL (UTMs) to explain where a user came from, or why they arrived on the page. When a user navigates away, that attribution is lost. This is a commonly known problem of missing data.
Ad Blockers. This is the really big one. Ad-blocking technology prevents many websites from sending event data. If the website sends click events, timers, and other interesting details, but is sending this data in such a way that ad blockers are preventing the data from being sent, you are missing data. It’s estimated that nearly one-third of U.S. adults and a staggering 72% of software developers globally use some form of ad blocking. If you’re a technical product manager in a B2B technology company trying to understand how users interact with your documentation, you could miss over two-thirds of your traffic data. I hope your job, and your company, do not depend too much on that information.
So, What Can You Do?
Start by asking these essential questions. Ignoring them could be costly to your business.
- Accuracy: Is the data valid and free from errors?
- Completeness: Are all required records and fields present?
- Consistency: Does the data align across different systems and sources?
- Relevance: Is the data applicable to your business question?
- Uniqueness: Are there duplicates muddying the waters?
Knowing what you don’t know is half the battle. Knowing if you can rely upon your data is the most critical detail in performing analytics.
Data quality is as important now as it has ever been and is a persistent problem across industries and disciplines that will not go away. Data completeness is always a concern, and knowing the types of questions to ask is very important. While I have highlighted websites as an example, this problem exists in all industries across a variety of disciplines. Ask hard questions of those who collect and manage your data to ensure you can trust it adequately.