Clustering for SQL Server High Availability

By David Bermingham

May 9, 2019

DBAs have many options for how to ensure high availability (HA) for SQL Server. Some of these options are available in the different versions and editions of SQL Server itself. Some are available in the underlying Windows Server and Linux operating systems. Some are available from cloud service providers, including as options for PaaS SQL Server offerings. And some are provided by third parties in purpose-built failover clustering software.

A common denominator in all of these options is the cluster, which consists of an active and one or more standby instances of SQL Server. There are significant differences, however, in how each solution replicates data, detects failures, and handles failover and failback. There can also be substantial differences in costs, including for both initial implementation and ongoing operational expenditures.

Let’s consider the HA clustering options available for SQL Server to help DBAs make more informed choices. Some of these options are not actually suitable for HA purposes; they are included here only to show how they fit into broader backup, archiving, and disaster recovery (DR) strategies.

What is ‘High Availability’?

Ask 10 people “What is high availability?” and you might get 15 different answers. For the purposes here, HA will be defined as ensuring SQL Server is operating when and as needed. “When” takes into account the percentage of time SQL Server is up and running, while “as” takes into account proper operation with no data loss and/or corruption.

HA has many moving parts, and this partly (no pun intended) explains why it means different things under different circumstances to different people. The foundation of HA configurations is formed by eliminating as many single points of failure as possible at all levels, including in the servers, storage, network, and software (applications and OS), as well as in the data center’s power, cooling, and security infrastructure.

The redundant resources must be configured in a way that makes it feasible to automatically detect and recover from all possible failure scenarios with minimal downtime and no data loss. In other words, having a “spare” is not the same as having a “hot” standby instance of the application ready to take over immediately should there be a failure in the active instance. Ensuring that the standby instance is fully up-to-date with the current data is what makes it hot, and this could be achieved using a shared drive on a storage area network, or more commonly, a replicated copy of the data stored on a local drive.

For HA for SQL Server, the replication should be synchronous; that is, both the active and standby instances are being updated simultaneously, and this requires a high-performance, low-latency network. For DR purposes, where the active and standby instances should be separated geographically, asynchronous replication is used to avoid adversely impacting on write performance.

Options Available With SQL Server

Always On Failover Cluster Instances (FCIs) is a feature that has been standard since SQL Server 7 running on what was then called Microsoft Cluster Server available on Windows NT 4.0. FCIs afford two major advantages: inclusion in the Standard Editions of SQL Server, and protection for the entire SQL Server instance, including system databases. A notable disadvantage is the need for cluster-aware shared storage, which is not available in the public cloud. On-premise, by contrast, where shared storage can and often does exist, FCIs leverage Windows Server Failover Clustering (also a standard feature).

Always On Availability Groups is a feature that was introduced in SQL Server 2012 Enterprise Edition as a replacement for database mirroring (covered below) and is also included in SQL Server 2017 for Linux. This is SQL Server’s more robust HA/DR offering, capable of delivering rapid, automatic failovers with no data loss. Among its disadvantages are the lack of protection for the entire SQL instance and the need to license the more expensive Enterprise Edition, which can be cost-prohibitive for many applications. For Linux, which lacks the equivalent of Windows Server Failover Cluster (WSFC), there is a need for additional open source and/or commercial software to create fully functional HA failover clusters.

A significant disadvantage with all application-specific options such as Always On Availability Groups is the need to use different HA and/or DR solutions for different applications. Having multiple HA/DR solutions inevitably increases complexity and costs (for licensing, training, implementation, and ongoing operations), which is why many organizations prefer using separate general-purpose or application-agnostic solutions.

Separate Options Available for SQL Server

Backup and recovery procedures are commonly used with all applications for archiving purposes and may be suitable for the DR needs of some applications. But full and incremental backups do not provide the continuous, real-time data replication needed for HA.

Database mirroring is a form of data replication, and this feature was replaced by the much more capable Always On Availability Groups in SQL Server 2012. Mirroring is currently in maintenance mode, making it likely to be removed in a future version. The feature was not included in SQL Server 2017 for Linux—and never will be. If being used successfully for an existing application, there is no need to make a change at this time. But mirroring is a risky choice for any new HA configurations.

Log shipping is a way to reconstruct the current edition of a database by applying the transaction logs asynchronously from the active to an earlier copy on a “warm” (versus hot and ready-to-go) standby instance. Because replaying logs can take a considerable amount of time to complete and verify, log shipping might be suitable for some applications for DR, but not for HA.

Hypervisors provide their own “high-availability” features to facilitate a reasonably quick recovery from a failure at the host level. But they do nothing to protect against failures of the operating system of the guest VM or the applications running in it. In effect, these features assure “dial tone” to a VM. That’s it. Nothing less but also nothing more—and much more is needed for HA at the application level.

Public clouds all offer features to maximize the availability of services, and these are usually accompanied by a money-back guarantee in a service-level agreement. But as with the hypervisors, the SLAs only guarantee “dial tone” at the server level. SQL Server needs additional provisions to ensure the data is being replicated from the active to standby instances, preferably across availability zones, and that clients are automatically reconnected to the active instance after a failover. So while a cloud service provider’s infrastructure can be leveraged in HA configurations, additional steps must be taken to ensure SQL Server remains online.

Storage Spaces Direct (S2D) is a form of software-defined storage that makes it possible for direct-attached drives to be pooled and shared among all the nodes in a cluster. S2D makes it possible to create failover clusters in public clouds, which all lack storage area networks (SANs) or other forms of physical shared storage. S2D first became available with Windows Server 2016 Datacenter Edition and is supported in SQL Server 2016 and later. It is compatible with SQL Server FCIs, but there are some substantial limitations, including the lack of a DR solution and its inability to work across availability zones or regions, which is needed to protect against failures at the data center level.

Purpose-built failover clustering solutions are designed to provide HA and DR protection for most applications, including SQL Server databases, running on Windows Server and Linux across public, private, and hybrid clouds. These solutions are implemented entirely in software that creates, as their designation implies, a cluster of servers and storage with automatic failover. All include, at a minimum, real-time data replication, continuous monitoring for detecting failures at the application level, and configurable policies for failover/failback. Most also offer a variety of value-added capabilities to simplify implementation and management.

Choosing the Best Options

The many HA/DR options described here are not mutually exclusive; that is, multiple options can be combined to provide the best possible solution. Combining FCIs with Availability Groups, for example, can facilitate the robust HA protection of an FCI, while also enabling readable secondaries, a feature of Availability Groups. Alternatively, cloud DR solutions, such as Azure Site Recovery, can be used to replicate a two-node SANless FCI HA cluster to create a cost-effective HA/DR solution using the Standard Edition SQL Server.

Similar to most HA configurations, these examples of failover clusters have many moving parts. But they are proven to work well for SQL Server and other applications requiring high availability.