How Best To Disaster-Proof SQL Server?

Let’s start by admitting that the title of this article is a tease. It’s a valid question and one that thinking people ask all the time. But in truth it’s not the first question you should be asking. More importantly, the answer to the question really depends on how you answer the questions that you should be asking first. Those questions are these:

  • How long can your SQL Server system be offline before it becomes problematic for your organization?
  • How much data can you afford to be missing when SQL Server comes back online? Two seconds’ worth? Two hours’ worth?
  • What is your budget for disaster-proofing your SQL Server infrastructure?

The best solution for you will meet your needs in each of these dimensions—but until you go through the exercise of answering these questions you won’t have the insights you need to make an informed choice about how best to protect your infrastructure.

Quantifying the cost of downtime

Everyone wants SQL Server back online immediately, but what is the actual business impact of being offline? It’s important to quantify the cost of downtime because that enables you to understand where a critical tipping point lies. As your tolerance for enduring downtime approaches zero minutes, the cost of a DR solution that can ensure recovery within that number of minutes increases exponentially. While it is possible to configure a DR solution for near-instantaneous failover and recovery—the cost of such a solution is hard to justify for most organizations. The longer you can afford to be offline—even if that’s just a few additional minutes—the more affordable the DR solution that can put you back online within that period.

So, what does it cost your business to be offline? One might measure that cost in terms of, say, the value of total database transactions committed per minute. For one company, an average minute of downtime might represent $1M in lost revenue; for another, it might represent only $100 in lost revenue. You need to quantify the cost of an extended period of downtime because after some point in time the cost to you of downtime will exceed the cost of a DR solution that would have SQL Server up and running before that point in time.

That point forms the basis for what is known as your recovery time objective (RTO). RTO quantifies the amount of time within which another instance of SQL Server must be brought online and into service so that your business can operate without further interruption.

Note one thing, though: depending on the environment in which your business operates, there may be other costs you may need to consider beyond simply the value of transactions occurring during an average period. What’s the reputational cost to your business if you are offline? That may be much harder to quantify, but it also may be no less important, depending on your business.

Quantifying acceptable data loss

Disaster-proofing SQL Server involves more than bringing another instance of SQL Server online within a specified RTO. You need to ensure that the second instance of SQL Server can interact with the same data that the primary instance of SQL Server had been using before it went offline. If you’re running SQL Server in the cloud, your backup instance of SQL Server needs its own copy of the data that the primary instance had been using.

How you ensure that your backup SQL Server infrastructure has a copy of all the data that the primary system is using becomes a crucial question in a DR scenario. Using log shipping techniques or relying on a recent backup shared to the secondary infrastructure may expose your organization to data gaps. Those gaps reflect the period of time that elapsed between the last backup or log snapshot and the moment the primary SQL Server instance went offline. That gap could be measured in anything from seconds to hours or days. If you’re using a high speed synchronous data replication system to keep the primary and secondary storage systems in sync, the database files in your secondary SQL Server infrastructure may perfectly mirror the database files in the primary infrastructure and there will be no data gap at all.

The question for you is how great a data gap your organization can accept. Clearly, no one wants to lose any committed transactions but, as above, the cost of disaster-proofing your infrastructure increases as your tolerance for data loss approaches zero. The degree to which your secondary database fully mirrors your primary database is represented by what is known as your recovery point objective (RPO). The RPO represents the number of seconds or minutes of data loss that you consider acceptable in a disaster recovery scenario. As with RTO, you can configure DR solutions with an RPO that approaches zero minutes of data loss. The question is whether the cost of such a solution can be justified—and that really depends on the nature of the transactions and their importance to your business.

Fitting your needs into your budget

As you can see, the answers to questions about RTO and RPO will help you determine how best to approach disaster-proofing SQL Server. A realistic assessment of the value of lost time and lost data can help you determine what your budget should be—if for no other reason than a failure to meet those objectives will prove more costly to your business than the cost of a DR solution engineered to meet those requirements.

Given a variety of RTO and RPO needs, there are multiple approaches to configuring for DR. If your organization can tolerate a high RTO and RPO, log shipping and backup techniques may enable you to disaster-proof your SQL Server infrastructure sufficiently and without extra costs. If your recovery time and recovery point objectives demand a rapid recovery and minimal-to-no data loss, you’ll need to look at more sophisticated solutions.

One approach would be to use Windows Failover Clustering Services (WFCS) to build out a multi-node failover cluster in the cloud. Two of the nodes might be built out in separate availability zones within a single region, with a third built out in a remote region (thereby fully protecting yourself from an outage affecting multiple AZs in a single region). WFCS supports rapid failover among cluster nodes, ensuring a very low RTO.

You could achieve a low RPO by using the Always On Availability Group (AG) replication features built into SQL Server. These replication features ensure that data is fully replicated between the primary and secondary infrastructures in the same region (and asynchronously with the infrastructure in the remote region). If a disaster strikes the primary instance of SQL Server, the secondary instance in the same region would immediately take over with minimal RTO and zero RPO. If the secondary in the same region were also affected by the disaster, recovery on the 3rd node in the remote region would be a manual process.. There is a risk of a data gap (a reduced RPO) if the DR solution relies on that remote region’s infrastructure because data replication to the remote infrastructure was asynchronous and data replication to the remote node may not have completed when the primary infrastructure went offline. Still, such a data gap might be a matter of seconds rather than minutes or hours.

The downside of the AG approach is its cost. To replicate SQL Server databases among three separate cluster nodes using AG requires use of SQL Server Enterprise Edition. If that requires you to upgrade from SQL Server Standard edition, the cost of disaster-proofing your infrastructure would increase considerably. Alternatively, you could use a third-party SANless Clustering tool with WFCS to achieve the same DR results without the high price tag associated with AGs. The SANless Clustering approach involves block-level replication of data from the primary instance to the secondary instances (either synchronously or asynchronously). It offers a slight advantage over AGs insofar as the approach will replicate everything in the target storage volumes (whereas AG only replicates the user-defined SQL Server databases). The cost advantage of a SANless Cluster approach lies in the fact that you don’t need to use SQL Server Enterprise edition to replicate your data effectively. You can replicate data between two local nodes and one remote node while continuing to rely on SQL Server Standard Edition if that’s the edition that’s driving your organization.

Ask the questions. Determine your organization’s real needs for RTO and RPO. Only when you know what you really need is it time to start shopping for solutions.