Moving From “Junk” Data to Data Integrity

It’s no secret that data is being created at an ever increasing rate. According to IDC, the amount of data created and replicated grew even more in 2020 due to the sudden increase in the number of people working, learning and doing just about everything from home. Further, it’s predicted that the amount of digital data created over the next 5 years will be more than twice the amount of data created since the advent of digital storage.

But this begs the question—is this data useful? Or is it just “junk”? The answer lies in how organizations are managing their data—those that do it well are less likely to encounter junk data. But for those that have not put the right tools in place to manage all sources of data, the answer is definite—they are dealing with junk data.

Before we get into how companies can rid themselves of the trouble (and cost) of junk data, let’s take a deeper dive into what junk data is. Firstly, we should address what junk data is not: original data is not junk data. This includes any data created from a transaction system, products, devices and other sources.  

What is Junk Data?

On the other hand, junk data is any data that is not governed, and occurs when you create a copy of data and then manipulate it for a particular use case without returning improvements to the original data store to improve quality for the next use. Salesforce classifies junk data into four categories: missing information, inaccurate information, outdated data and duplicate data.

Junk data starts to accumulate when individuals make copies of data from a larger dataset for a particular use case, make changes to it, and then do not integrate those changes into the larger set. For example, if you had an official database of customer addresses in your system of record, made a copy of only those based in the Chicago area and updated that subset of that data, and did not update the source data, you’d have created junk data. With junk data, you don’t have a clear lineage or province and it can’t be readily accessed and used by others; and worse, you have multiple versions of “truth” that do not agree. This ultimately creates multiple one-off sets of data within an organization that do not provide value to all users. Junk data introduces problems.

Why is Junk Data a Problem?

Junk data can cause a number of problems for organizations, such as:

  • Inconsistent results, depending on if you’re working from the original or copied, modified data set if the data contains different information, you will have different results ranging differing match rates, operational failures and—maybe worst of all—bad customer experiences.
  • Inaccurate results, if dataset is out of date, incomplete, or contains wrong information, the output is going to be too.
  • Privacy concerns, ungoverned copies of data that contain any sensitive information is risky as regulatory compliance ?requirements may not be met, and this exposure is often unknown by top management until a severe problem occurs and it is too late to avoid the consequences.
  • Information security, in any environment that junk data can be created, there is a security problem. The magnitude of this problem category will vary based on the type of data.  Common examples range from not following internal procedures, to license or IP violations, to data being hacked because it is stored outside of a companies security operations. 
  • Financial costs, creating and using junk data is inefficient because of any of the reasons listed above.

However, what might be the biggest issue created by junk data is that it constructs a barrier to achieving data integrity. By establishing data integrity, an organization is better equipped to develop and manage a trusted foundation of data that’s accurate, consistent, contextual and allows for smarter business decisions.

Why Data Integrity Matters

Data integrity is the quality, reliability, trustworthiness and completeness of a dataset. It is built on four key pillars: enterprisewide integration, accuracy and quality, location intelligence, and data enrichment.

On a larger scale, if an organization’s data has integrity, business leaders can use that data to make accurately informed business decisions that ultimately drive better outcomes. In the context of junk data, if a company has achieved data integrity, they no longer need to spend the time resolving data inconsistencies, correcting and reviewing the data—the data of highest integrity is already at hand, reliable and ready to get to work.

Moving from Junk Data to Data Integrity

The best way to get rid of junk data is to eliminate the need for it. If an organization creates accessible data assets of high integrity, within a governed environment ensuring the data can be used in accordance with the company’s policies, rights, and guidlines, employees will no longer have to create and maintain copies of data to perform a specific task. By taking the time to invest its data integrity upfront, a company can ensure the quality and security of its data assets are appropriately available to the business ultimately saving time and money.