IT and database managers face the backup window dilemma all too often as data volumes continue to grow typically by as much as 50% annually. The result is that operations drive organizations’ data protection objectives beyond their limits.
The continued expansion of structured and unstructured data storage seems to be never-ending. At the same time, database administrators' need to reduce their storage consumption is accelerating as its cost becomes more visible. Data store management and workflow operations have been an increasing drain on IT budgets. Not only are data stores continuing to expand, but workflow operations increasingly add multiple versions of each iteration of the database as it is updated, managed, and protected. The interesting part of all of this is that the difference between each iteration is usually not significant, but a whole new version is created and the storage it consumes is costly - not to mention the increased floor space, power and cooling expenses of the additional hardware that is also incurred.
Today, however, there are data optimization technologies available that can help with the continued data growth. Most notable are compression and data deduplication. Each has its unique capabilities. Compression reduces redundancy within a file by looking for data strings that are duplicate and eliminating them leaving a pointer. Data deduplication reduces redundant data within and across files using a similar approach of data string (chunk) comparisons that get analyzed with standard algorithms creating key identifiers. The keys are then managed in a database, and compared when new data chunks are added and its determined if there is a duplicate. If there is, again a pointer is left that reduces the storage consumption.
Data Compression Techniques
Compression does work well on structured data and has been viewed as an industry standard. Typically LZ compression has found its way into many storage systems and is available to reduce the file size and storage consumption. It is known to be resource efficient and has not been a significant performance drag on the systems that utilize it. Its “Achilles Heel” unfortunately is that it is bound by its ability to only work within a file. As we have scaled out our data stores, there is an implicit understanding that there are duplicates that span across files and there is more space saving that can be found if the duplicate data strings can be found across all files.
Data deduplication has that ability to look for data chunks (strings) across millions or even billions of files. The initial chunk of data is represented by a key, derived by a hashing algorithm (today most often SHA-256) that is then filed in a database and compared as each new chunk of data is added to the analysis. If a new chunk is a match (same hash key generated) then that data chunk is not stored but a pointer is placed back to the original data chunk. The pointer reduces that space requirement and storage space is saved. This process initially had to consume processor and memory resources as well as some HDD resources to manage the key lookup process. As new techniques evolved to represent the keys and manage the keys, this has become more efficient and less of a resource drain.
In parallel, processor resources went from single or double core to multiple (6-8) cores also enabling incremental power to be applied to the processing needs of deduplication. Deduplication is becoming more and more accepted as a significant performance improvement in dedupe engines and has been seen as a technology that continues to improve and evolve. With these incremental improvements, data deduplication can scale and has lost its initial criticism that it imposes a processing latency to the systems that employ it.
Backup solutions have employed data deduplication engines for many years to save storage space in the backup data store. Some backup deployments use data optimization in a post-process mode to help shrink the amount of time required for backups in order to meet their backup window objectives. To do this, they must add data cache storage requirements prior to data optimization, which increases the cost and complexity parts of the IT equation.
The Need for Data Optimization Technologies
Rampant data growth affects budgets, operating costs, floor space and, of course, capital expenditure through the amount of data created and its associated cost. Therefore, the efforts to shorten backup times may be further limited by the need for cost containment.
To address these issues and to gain competitive market share and revenues, backup vendors need to implement data optimization technologies that satisfy the following key requirements:
- Performance — Data optimization must be extremely efficient and maintain a level of performance that does not impede overall storage/backup performance. If the data optimization engine runs at very high performance levels, the engine can run inline. This mitigates the need for a post-process optimization run and its data caching, while also eliminating the need and cost of temporary storage for this cache. Storage vendors have made billion dollar R&D investments to optimize their storage performance as a means of differentiating their offerings. Any optimization engine today should never impact the speed of the overall backup process.
- Scalability – Less than 10 years ago, only a handful of IT organizations had a petabyte of data. Today, thousands of large organizations have requirements for more than that amount. Data optimization solutions must be able to scale to multiple petabytes to address the needs of these customers at petabyte levels and, in the future, exabyte capacities.
- Resource efficiency – Whether backup is done in a standalone appliance or with server-based software, RAM and CPU resources are consumed by the dedupe/data optimization process. By employing highly efficient resource utilization techniques, more scalability is enabled and the ability of the backup process to run as quickly as possible is enabled.
- Data Integrity — Data optimization technology must not interfere with the storage application software in a way that increases data risk. The storage software must maintain control over writing the data to disk and the data optimization software cannot modify the data format in any way. This has the benefit of eliminating the need for complex data reassembly processes (commonly called “rehydration”), and protects data against possible corruption.
Backup vendors need to evolve a solution that can run without latency or performance impact (which may further limit the ability to complete backups within the allocated window) while eliminating the need for the post-process data cache and its cost. The solution must also work seamlessly across primary, archive, and backup storage by broadly addressing the needs of each of these storage tiers.
Backup windows are getting smaller and smaller, yet also more critical. When properly deployed, data optimization technologies help IT complete backup operations within the allotted time and at an improved TCO over alternative techniques.
About the author
Wayne Salpietro is the director of product and social media marketing at data storage and cloud backup services provider Permabit Technology Corp. He has served in this capacity for the past six years, prior to which he held product marketing and managerial roles at CA, HP, and IBM.