Big Data Backup - Effective Remote Replication and Disaster Recovery Strategies

For many years, enterprise data center managers have struggled to implement disaster recovery strategies that meet their RTO/RPOs and business continuity objectives while staying within their budget. While the challenges of moving, managing, and storing massive data volumes for effective disaster protection have not changed – exponential data growth and the advent of big data technologies, have made the challenge of disaster recovery protection more difficult than ever before.

According to a recent vendor survey of large enterprises, The Data Protection Index 2012, data growth in large companies continues unabated with thirty-three percent of respondents reporting that their data was growing at 20-30 percent annually and an additional 20 percent reporting even higher annual growth rates. Respondents also reported a marked increase in data growth compared to last year. Nearly one quarter of respondents reported a twenty-five percent higher growth rate compared to last year.

Replication Efficiency

While a significant number of large enterprise backup environments (eighteen percent) are still making copies of physical tape and storing them off-site, more and more companies are replacing their physical tape systems with disk-based backup and electronic replication for DR. Nearly half (forty-seven percent) of respondents are replicating more than fifty percent of their data to remote location for DR protection. Electronic replication not only adds the obvious benefits of speed and automation, it also enables companies to perform faster, more frequent, and more realistic DR testing than physical tape libraries.

Interestingly, twenty-one percent have an active-active remote replication strategy in place and forty-one percent have an active-passive replication strategy.

There is clearly a growing need for more efficient ways to move massive data volumes over a WAN and to manage backup, restore, replication, and DR from a holistic, enterprise-wide perspective.

Centralized Management Saves Money, Improves Efficiency

While many companies have moved to disk-based backup and recovery technologies to gain performance and capacity reduction through deduplication, many have implemented systems designed for medium-sized enterprises. While these systems offered a simple way to move from tape to disk, they are not designed to handle the volume of data or complexity of backup requirements in a large enterprise or big data environment. They have limited capacity and performance, forcing companies to add a new system every time their data volumes grow. This practice soon leads to data center sprawl - with data divided among multiple “siloes” of storage in the main data center, and remote offices left with little or no backup or DR protection at all. In fact, fifty percent of survey respondents characterize their environments as having “moderate” or “severe” sprawl requiring them to routinely add data protection systems to scale performance or capacity. These systems also use a hash-based, inline deduplication technology that is quickly overwhelmed by large data volumes.

Not surprisingly, companies are increasingly focused on finding more efficient ways to protect massive data volumes from disaster and downtime. Massively scalable and manageable data backup and disaster protection solutions are needed to manage these large, complex environments in an efficient centralized, holistic way. While the limited scale systems are effective for small-to-medium enterprises, they quickly become costly and inefficient in larger implementations. Systems designed specifically for very large enterprise and big data backup environments have several advantages over these non-scalable counterparts:

  • They allow IT managers to add capacity and performance as their needs grow, eliminating both costly data center sprawl and wasteful over-buying
  • They reduce labor costs and human errors by enabling a single administrator to manage petabytes of data in a single automated system.
  • They cut bandwidth costs and capacity requirements by performing global deduplication of data transmitting it to DR sites in a bandwidth-optimized, deduplicated form.
  • They leverage powerful new DR tools, such as NetBackup OST Auto Image Replication to enable companies to move massive data volumes over a WAN efficiently and to simply manage complex replication, retention, and restore policies through automated storage lifecycle policies.
  • They deduplicate data across volumes and disks and optimized to deduplicate data types (Oracle, SQL, DB2) that inline/hash-based deduplication technologies cannot process.

Remote Offices Under-Protected

Despite the increased use of electronic replication and inroads made in implementing DR protection, companies have been slow to improve DR in remote office. Fifteen percent of data in remote offices and eleven percent of data in main data centers are currently not backed up or protected. In addition, a full seventeen percent are still either working without a disaster recovery strategy or are in the process of implementing one.

Capacity Reduction Imperative

Deduplication is a key technology for companies that are trying to control rampant data growth and reduce bandwidth for efficient replication to DR sites. However, increasing use of very large databases (Oracle, SQL, and DB2) and big data analytics tools has pushed traditional inline/hash-based deduplication technologies to their limit. Databases pose two challenges to this deduplication technology. First, databases store data in very small segments (<8KB), that inline/hash-based deduplication technologies cannot process efficiently without slowing backup or replication performance to unacceptable speeds. 

Second, inline, hash-based technologies do not deduplicate multistreamed and multiplexed data, two critically important performance accelerators for database backup, replication, and restore. As a result, massive data volumes within large enterprises are left uncontrolled by deduplication or deduplication/replication technologies. Therefore, large enterprises and companies big data backup environments need to use the appropriate deduplication technology -- ContentAware, byte-differential deduplication – that is designed specifically for these very large database environments. This technology can process massive data volumes and multistreamed, multiplexed databases while maintaining high performance backup, restore, and replication efficiency.

The Outlook for 2013

As we look ahead to unabated data growth and further adoption of data-intensive big data technologies, enterprises will increasingly look to more powerful backup and recovery technologies to deliver fast, scalable, cost-efficient DR.


About the Author:

Peter Quirk is a director of product management at Sepaton, Inc. He has spent most of his career working for vendors in systems engineering, product marketing, product management and project management roles, with responsibilities in operating systems, databases, languages, hardware platforms, storage, and social media. In his spare time he likes to code and explore the world of big data and all things related to Hadoop.