The ABCs of Data Deduplication: Demystifying the Different Methods

Jan 20, 2016

By Christophe Bertrand

The realities of exponential data growth are hitting hard in organizations of all size. Small-to-medium sized businesses are particularly vulnerable to this expensive and ubiquitous challenge, however preserving the integrity of their data is just as critical as it is for a larger enterprise but often their budgets cannot support continually adding more storage space. Further, it’s not enough to merely add more capacity; recovery point objectives must be met, and time is money when it comes to data recovery. The simple truth is that to be effectively managed, adequately protected and completely recovered, your data size must be shrunk.

Some degree of shrinking the amount of data - or deduplication - is necessary to address the permanent issue of ever-expanding data sets. Deduplication, or the removal of multiple copies of the same piece of data, is achieved via algorithms that identify and eliminate redundant data. Organizations have multiple options when it comes to optimizing their processes through deduplication, and all combat the problem differently. But questions linger around how the various options work, tradeoffs to be considered, and which yield the greatest benefit (at the lowest cost).

Comparing Different Methods of Data Deduplication

To better understand deduplication and how it can best be used in your organization, it’s important to understand and compare the different methods.

The deduplication process begins by creating a unique digital signature (“hash”) for a given block of data. This hash value is saved in a database so it can be compared to hash values that are created for new incoming data blocks. By comparing hash values, it is determined if a block of data is unique or a duplicate.

Creating a Hash Value

The process of creating a hash value is well understood and requires a nominal amount of compute resources. What takes a lot of compute resources is the process of comparing new hash values to each and every hash value stored in the hash database. When hash values number in the millions, the process of database lookup can be very compute intensive.

Differences in Post-Process and Inline Data Deduplication

In terms of how well deduplication is performed, it’s important to consider the differences in “post-process” and “inline” deduplication. Like its name says, post-process deduplication means that incoming data is first stored to disk and the data is processed for deduplication at a later time. Alternatively, when data is processed for deduplication before being written to disk, this is called inline deduplication.

Inline deduplication has the advantage of writing data to disk only once, but it stands the risk of slowing down the disk write time if compute resources are not sufficient. With the increase in CPU power, system RAM and solid-state disk drives, inline deduplication is the preferred method of deduplication as compared to post-process deduplication, which requires extra storage space and writes to more disk.

“Target Deduplication” and “Source-Side” Deduplication

The second consideration that should be addressed is where the process of deduplication is performed. Given the existence of the hash database, it is reasonable to believe that the hash values, which are small 16-byte values, can be shared in a network environment more quickly and easily then sharing the full block of data. This is where the definitions of “target deduplication” and “source-side” deduplication become important. Target deduplication means that the full set of data is shared on the network and is deduplicated when it reaches the target deduplication appliance. Target deduplication was the first method that achieved widespread success when combined with data protection. Purpose Build Backup Appliances (PBBA’s) are the target backup appliances that end users installed with their backup software to reduce the storage footprint of backup data.

On the other hand, source-side deduplication means that the process begins at the data source. Only when data is determined to be unique, is it transferred to the backup storage device. In a traditional backup solution, this process is managed by the Backup Server. The Backup Server maintains the hash database and works with Agents installed on the Backup Clients. The Agent on the Backup Client (“the data source”) computes hash values of its local data and sends the hash values to the Backup Server to compare with all existing hash values stored in the hash database. The Backup Server then tells the Agent what data is unique and therefore what data to send to the Backup Server for storage.

The advantage of source-side deduplication is the reduction of data that’s sent across the network, and the resulting performance gain. In particular, source-side applications with large data files, such as database applications, benefit enormously by not having to transfer very large files over the network. The challenge of source-side deduplication is that is requires a major upgrade to the Backup Server. Traditional Backup Servers do not process data for deduplication; rather they depend on a PBBA for this purpose. To save money and headaches, it’s important to look for next generation backup solutions that have integrated deduplication into the Backup Server, and do not use a PBBA.

Global Data Deduplication

Finally, the method of global deduplication should be considered, as it is optimized source-side deduplication. With this method, every computer, virtual machine or server across local, remote and virtual sites communicates with a backup server, which manages a global database index of all associated files while intuitively determining what needs to be backed up. The Backup Server pulls only new data as required while eliminating duplicate copies, and shares the deduplicated “intelligence” across all source systems. Since backup data is globally deduplicated before it is transferred to the target backup storage device, only changes are sent over the network – significantly improving performance and reducing bandwidth usage.

For Best Overall Results

The question of whether or not organizations should leverage deduplication to effectively manage and protect data is somewhat of a no-brainer. Deduplication, when applied correctly, is a fantastic technology that can greatly benefit backup and recovery performance while reducing storage costs. The deduplication process itself is compute intensive and requires special consideration when deploying in your network. By leveraging the latest CPU, system RAM and SSD resources, the performance requirements of deduplication can be satisfied. This allows you to enjoy the benefits of inline deduplication, combined with global source-side deduplication for the best overall results.