What's the Fuss about Data Deduplication?

Bookmark and Share

Listen to a group of database professionals talk for awhile and someone will eventually bring up the topic of data deduplication. Data deduplication is a means  to eliminate redundant data, either through hardware or software technologies.  To illustrate, imagine you've drafted a new project plan and sent it to five teammates asking for input.  That single file has now been reproduced, in identical bits and bytes, on a total of six computers. If everyone's email inbox is backed up every night, that's another six copies backed up on the email backup server.  Through data deduplication technology, only a single instance of your project plan would be backed up, and all other instances of the identical file would simply be tiny on-disk pointers to the original.

Many vendors currently offer data deduplication in their products, including early pioneers Data Domain (now an EMC company), Quantum, FalconStor, and ExaGrid.  Many of the big storage hardware vendors, including EMC, Symantec, IBM, and NEC, also have offerings in this space.  However, not all techniques for data deduplication are created equal, so it's important to understand the underlying technology. In addition, many situations occur in which data deduplication is a bad choice for database backups. 

Deduplication can happen at two levels.  First, data can be deduplicated at the source file system, where the file system scans for new files. When it finds them, it creates a hash, and any files with an identical hash are then removed and a pointer to the source file is created. This approach supports multiple versions of a single file using a feature called copy on write.  Deduplication also can happen on the secondary store, such as a deduplicating SAN or NAS from one of the vendors mentioned earlier. In this scenario, the individual user clients and servers may have duplicate data and files on their own hard disks, but when their data is backed up to the SAN or NAS, the data is assessed to determine if duplicates exist. If they do, they are replaced with hashed pointers. Depending on the vendor, the deduplication process might happen as files are copied to the SAN, or sometime later, based on a regularly scheduled hashing routine.

Data deduplication is a very CPU-intensive process. One reason data deduplicating SAN and NAS are so expensive is that they require a large amount of their own internal memory and CPU to perform the analysis and hashing of files.

While data deduplication has big benefits for email systems, file system backups, and collaboration systems like Microsoft Sharepoint, it does not lend itself to database backups or live database files.  The deduplication algorithms don't offer much value if you're already using a database backup compression feature like the one  in SQL Server 2008 Enterprise Edition, or in products like Quest's LiteSpeed or RedGate's SQLBackup.  (Full disclosure here - I work for Quest Software directly on the LiteSpeed for SQL Server and Oracle products.)  You will typically experience no deduplication of data if you're already using some form of database backup compression.  Also, data deduplication technologies do not improve the speed of backup or recovery, and actually can slow those processes dramatically. In fact, backups to a deduplicating storage device may be lengthened due to long network queues, and database recovery may take longer because the hashing algorithms might fragment the backup files.

To summarize, data deduplication is a great feature for backing up desktops, collaboration systems, web applications, and email systems. If I were a DBA or storage administrator, however, I'd skip deduplicating databases files and backups and devote that expensive technology to the areas of my infrastructure where it can offer a strong ROI.