IT managers from organizations of all sizes know the importance of maintaining access to critical applications and data. From irritating "system unavailable" messages to the most unfortunate natural and manmade disasters where entire systems may be lost, the challenge is particularly acute for database-driven, transactional applications and data-the lifeblood of the business. The dynamic, transactional data and applications that comprise, process, manage and leverage critical customer accounts and history, sales, marketing, engineering and operational components keep the organization thriving.
A comprehensive disaster recovery solution for both critical database applications and data in today's non-stop business environment must provide integrated application and data protection, including recovery against hardware, software, and site failures. In order to implement a comprehensive disaster recovery solution for your critical database application systems, it is important to first look at the comparative benefits of various technologies and solutions.
Today's spectrum of technologies can help organizations recover from varying levels of systems failures, whether they be hardware, software, or complete site failures. The technologies and approaches are generally divided into two categories: data protection and recovery and application protection and recovery. Data protection technologies include traditional tape backup, file replication, disk mirroring and data replication. Application protection technologies include simple server failover, high-availability clustering, load-balancing clustering and fault tolerant server failover. Integrated business continuity solutions protect both.
Data Protection and Recovery
Tape backup consists of solutions that backup important data to tapes on a regular or ad-hoc schedule to guard against system failures or simple data loss and corruption. It goes back to the earliest days of computing, so it has a decades- long history.
Backup tapes are typically manually transported and stored at a different site to guard against primary site failures. Tape backups are relatively inexpensive to deploy and offer additional benefits, such as human error recovery and historical archive capabilities.
As much as tape backup solutions have developed over time, they still offer only limited recovery capabilities against system failures because the underlying technology is inherently asynchronous. A significant amount of business critical data could be lost in between backups. And it is comparatively slow. Data recovery from tape-to-disk can take hours, and data backup/recovery alone is not enough for system recovery because applications must also be recovered. In many respects, especially for business critical systems, the real value of tape backup solutions is point-in-time snapshots for dealing with data corruption due to human errors.
File replication copies important disk files to a backup disk on a regular or ad-hoc schedule. The comparative advantages versus tape backup include higher data I/O speed and random access capabilities. The disadvantages include higher media costs (although new data reduction technologies have helped reduce costs) and lack of portability. This approach is still based on asynchronous data replication so it has the same deficiencies as tape backup with respect to business critical system failure recovery.
Disk mirroring typically offers real-time synchronous replication of entire disks that store important data. Disk mirroring virtually guarantees data security as any data written to the primary disk is concurrently written to the backup disk.
In the event of system failures, disk mirroring is the most effective data recovery solution because of its fast recovery time without data loss. However, disk mirroring does not have the capability to recover from data corruption because it does not retain point-in-time data snapshots.
In database replication, a query is executed on two or more servers to produce the same result and is then stored on different physical media. The same data is indirectly replicated on separate systems. In comparison to the other data protection approaches, this approach is quite restrictive because it works only for database data, is proprietary to each database vendor, and leaves little system resources to do other work.
Despite its limitations, database replication does have some interesting capabilities. Administrators can replicate a subset of a database to a remote system to greatly increase access performance so it could be complementary to other data protection technologies.
Application Protection and Recovery
Simple Server Failover
Simple server failover provides indirect application protection and recovery by automatically assigning the identity of a primary server to a standby server, then restarting the application on the standby server when a server fails.
While this approach can provide basic application recovery there are some significant functional and manageability limitations. One such limitation is that the original primary server can no longer run on the same network in order to avoid server identity conflicts that could cause system corruption. Another limitation is that there is no recovery against individual application crashes.
High-availability clustering (HAC) offers granular monitoring and recovery of applications since servers are organized as groups of resources that share the compute load, including individual application and data resources. In the event of failure of entire server or individual resources, the software can automatically failover all resource groups from the primary server to the standby server.
One unique feature of high-availability clustering is that it offers easy failback of target applications from the standby server back to the primary server. HAC is an effective way to ensure fast application recovery in event of system failures, and it is generally attractive because it does not require target application source code to be modified. However, a typical limitation in most failover clustering solutions is the shared-disk requirement that calls for both the primary and standby servers to be connected to a common disk array, which is only practical when the servers reside at the same site.
Load-balancing clustering (LBC) originated as a scalability solution for large scientific applications where each installation had many identical components that resided on different servers. Typically, the application components can be added or removed dynamically to accommodate the overall application workload to maintain overall system performance level.
LBC also has the benefit of making the entire application system resistant to single server failures. However, LBC has yet to be adopted broadly for enterprise applications because most applications have to be redesigned and re-implemented from scratch, but IT outsourcing and cloud computing is starting to shift more compute load to clusters. LBC also requires shared storage so it does not offer protection against site failures.
Fault Tolerant Server Failover
A fault tolerant server typically includes two identical modules each containing identical hardware components (e.g., processors, memory, etc.) to ensure all processing is done in parallel in strict lock-step fashion. When a hardware component fails the remaining functional module continues processing with sub-second recovery time. Fault tolerant servers provide very high level of protection again hardware failures but they are generally more costly, limited in OS platform support, provide no protection against software failures, and does not address planned downtime for system maintenance.
Comprehensive Protection and Recovery
An integrated business continuity system, whether integrated in-house or purchased as pre-integrated software, incorporates the best approaches and technologies of both data and application protection solutions. Organizations looking for a comprehensive and cost-effective solution to protect database-driven transactional applications and data should consider integrated solutions. Disk mirroring for data protection and recovery to minimize data loss, and high-availability clustering for application protection and recovery could be one integrated solution. Pre-integrated software is very cost-effective and provides a high level of protection as well as additional benefits in the ability to minimize planned downtime.
In the end, there is no one-size-fits-all solution so an organization should carefully evaluate available technologies to determine the optimal disaster recovery solution for its critical database application systems.