In recent years, disaster recovery (DR) has garnered attention from the company boardroom to the Office of the CIO. Despite this fact, many companies have yet to implement an effective DR solution to safeguard their applications and data. This inertia is attributed to two factors - perception of the term “Disaster” (“when the disaster happens, we’ll deal with it then”) and shortcomings of existing solutions (“we don’t have budget for machines to sit by idly”).
Due to the rarity of catastrophic disasters such as earthquakes, floods and fires, organizations rarely implement comprehensive disaster protection measures. However, there is another set of “technical disasters” that are caused by much more mundane events which regularly lead to significant system outages. These span faulty system components (server, network, storage, and software), data corruptions, backup/recovery of bad data, wrong batch jobs, bad installations/upgrades/patches, operator errors, and power outages, among others.
("Effective Disaster Recovery Planning Needs Centralization," published in DBTA in January 2007, outlined the importance of having a centralized IT organizational structure in order to design an dimplement an effective disaster recovery strategy. With that in place, organizations should evaluate DR solutions and implement the one that best aligns with their strategy.)
A comprehensive DR solution must adequately protect organizations from both catastrophic events and these more commonplace disasters to maintain business continuity.
Evaluation Framework for DR Solutions
Organizations should consider the following three categories when evaluating potential DR solutions: data availability, data protection and systems utilization.
Data availability implies that even if an outage occurs, data is available for continual access by business critical applications. A DR solution must ensure that outages are tolerated as transparently as possible and recovered from as quickly as possible. Relevant outages span server failures and network failures. To provide uninterrupted data access, a DR solution must detect an outage rapidly and ensure failover in short order. This failover should occur to a fully synchronized replica of the production database, known as a “hot standby” database, which is preferably located at an alternate site. This failover should also happen automatically without any data loss. Application clients must also be directed to this new production database seamlessly and instantaneously.
Using “active-active” server clustering, it is possible to tolerate server failures transparently as long as other servers are available in the cluster to handle the application workload. However, the DR solution must also provide an automatic failover capability to a remote site in case clustering is not configured for the database servers or the entire cluster fails. Similarly, if such server clustering is indeed configured, the DR solution should recognize that and avoid beginning a site failover process when there are still active nodes in the cluster.
Another best practice to consider is the ability to detect false failures. Customers must be able to configure their DR solution with appropriate timeout values, which can be adaptive based on historical trends, to ensure that false site failovers don’t occur after transient network/server error conditions.
Data protection means that even in the presence of data-related failure conditions such as storage failures, site failures, data corruptions and operator errors - the integrity of the underlying data is not compromised. Most DR solutions offer protection from storage and site failures by maintaining synchronous replication (or mirroring) to another storage array located remotely. But even then, the devil is in the details. Does the solution allow synchronous replication only up to tens of miles? Synchronous replication is needed to ensure zero data loss in the event of outages and customers should consider a distance of approximately 200 miles or greater to be safe from outages that may affect the production site. This distance typically puts the remote storage array on a different power grid or flood zone. The sweet spot for any DR solution should be the ability to support zero-data-loss synchronous replication over hundreds of miles.
With regards to data protection, reducing or eliminating the issue of data corruptions is critical. In today’s complex IT stack, any component can fail – file system, volume manager, device driver, host bus adapter, storage controller, disk drive, and software/firmware. Such a failure can potentially corrupt the underlying data. A DR solution must deliver adequate fault isolation from such data corruptions. For example, if the solution employs a replication scheme, it must ensure that primary site data corruptions do not propagate to, and impact the data at secondary sites.
A DR solution typically consists of redundant components such as storage, server, and software. If these components are not conducting any useful work, it is difficult to justify the costs of the DR solution, which becomes a roadblock to an effective DR strategy implementation. A DR solution should allow its components to be used for productive work, to add value to an organization’s DR investment.
What other projects can these DR systems support? They can facilitate hardware and software upgrades, requiring minimal application downtime. For hardware upgrades, an organization can switch production applications over to the DR system, upgrade the hardware at the production site, and then switch applications back to the original production server thereby reducing overall downtime. Similar mechanisms can be used for rolling database upgrades, SAN migrations, data center relocation, and platform migrations. With the right set of capabilities to support it, a DR solution can reduce - and potentially eliminate - downtime for planned maintenance activities.
Another aspect of effective DR systems utilization involves offloading ancillary work from production to DR servers - resulting in increased performance of the production system. Reporting, backups and testing can be offloaded to DR servers. If the DR solution supports real-time data synchronization of the DR server, and the latter is open and online at the same time this synchronization takes place, it can serve as a real-time reporting solution. Organizations should also examine the possibility of offloading backups to the DR server to relieve important processing cycles from the production server. Similarly, since the DR server is synchronized with the production data set, it should also be possible to conduct testing on the DR server using the production data set, without compromising disaster protection abilities.
While implementing a DR solution is now a corporate-wide mandate, choosing the right solution is a challenge. This article outlined a DR solution evaluation framework based on three categories:
Data availability – “Does the DR solution transparently tolerate server/network failures and recover from them quickly?”
Data protection – “Does the DR solution protect from a wide variety of data-related failures such as storage failures, site failures, human errors, and data corruptions?”
Systems utilization – “Does the DR solution allow effective use of system resources, for real-time access by applications and offloading processing from the production server, and also for planned maintenance activities by administrators?”
A DR solution is not comprehensive unless it provides robust capabilities across all three of these areas. Additionally, the solution should be easy to deploy and implement, requiring zero-to-minimal integration with existing systems. A careful evaluation of DR solutions will go a long way towards ensuring organizations have a bulletproof DR strategy in place.