Storage Strategies for Meeting SQL Server SLAs in VMware Environments

Almost all database workloads are now (or eventually will be) virtualized. High availability (HA) clusters using virtual servers provide the application protection needed to satisfy even the most demanding service-level agreements (SLAs).

Configuring the shared storage needed for HA in VMware can present challenges to creating these clusters and therefore satisfying SLAs. These challenges can create a barrier to migrating business-critical database applications to VMware. SQL Server also presents challenges that can make satisfying SLAs substantially more expensive.

Let’s look at the issues involved with Windows Server Failover Clustering (WSFC) to provide HA for SQL Server applications in VMware environments, and explore some storage strategies capable of meeting SLAs more cost-effectively.

Challenges to Using VMware Clusters With Shared Storage

In a traditional failover cluster, two or more physical servers (cluster nodes) are connected to a shared storage system. The application runs on one server, and in the event of a failure, clustering software, such as WSFC, moves the application operation to the standby server. It is possible to do this with virtual machines (VMs) in a vSphere environment, but this requires a special configuration for the storage that can limit the ability to use important features in vSphere such as vMotion and VM Cloning.

A typical WSFC cluster in vSphere utilizes a storage configuration technique known as raw device mapping to connect VMs directly to the storage area network (SAN). Raw device mapping makes a physical storage device or subsystem appear to the guest operating system as if it were a virtual disk file in a VMware Virtual Machine File System (VMFS) volume. Such mapping enables the use of specialized SAN SCSI commands needed to support HA clustering, making virtualized storage access seamless to the operating system, the clustering software, and the applications.

The problem is a failover cluster that uses raw device mapping complicates or prevents the use of several important VMware features that employ virtual machine disk (VMDK) files. For example, raw device mapping prevents the use of VMware snapshots, which then prevents the use of every feature that requires snapshots, such as Virtual Consolidated Backups (VCBs).

Raw device mapping also complicates VM mobility, which creates impediments to using the features that make server virtualization so beneficial, including converting VMs into templates to simplify deployment and using vMotion to optimize performance by migrating VMs dynamically among hosts. These restrictions associated with the use of raw device mapping can undermine the potential gains that most IT departments hope to achieve with vSphere.

Challenges Using SQL Server’s Failover Clustering

SQL Server provides two of its own options for clustering: AlwaysOn Availability Groups and AlwaysOn Failover Clustering. The former offers enterprise-class HA but also requires the more expensive Enterprise Edition licensing. Similar robust HA protection is available for other data- bases, such as Exchange Server Database Availability Groups and Oracle Active Data Guard. But these “premium” HA solutions also come with a premium price that can make them prohibitively expensive for many database applications. AlwaysOn Availability Groups, similar to many other database protection techniques, protect only data stored within the actual database. Files outside the SQL database are not protected. In contrast, AlwaysOn Failover Clustering, used in conjunction with storage mirroring technologies, protects all application data.

The Failover Clustering feature in SQL Server Standard Edition works with Microsoft’s WSFC to provide automatic and seamless failover and failback on a fully redundant configuration. Should any server (or virtual machine) fail for whatever reason, another takes over using an  up-to-date version of the data. Seamless failover/ failback also enables software updates and patches to be installed with minimal application downtime.

However, assuring immediate and seam- less failover requires that each instance of the application has access to the same data. That is typically accomplished by using a single dataset in a shared storage (SAN) configuration (which also introduces the risk of a single point of failure).

The problem is WSFC and other failover clustering solutions require some form of shared storage, and fully redundant, cluster-aware shared storage (e.g., SANs) can be quite expensive. The requirement for shared storage also means being unable to provide disaster recovery across data centers. This gives DB administrators two basic options: Use the more expensive AlwaysOn Availability Groups available only in the SQL Server Enterprise Edition, or add cost-effective SANless clustering software to provide real-time, guest-based block level replication that is compatible with WSFC.

Overcoming the Challenges With Cluster-Aware ‘Shared Nothing’ Storage

The popularity of VMware and SQL Server, combined with challenges involved using raw device mapping and the high cost of Enterprise Edition, has given rise to third-party solutions purpose-built for providing high availability and high performance more cost-effectively. Indeed, such software-based data replication and synchronization solutions designed for the high availability and disaster recovery needs of business-critical applications have been available since the 1990s.

A multi-site high-availability configuration is the only way to protect applications from outages that affect an entire data center. 

The best of these solutions use efficient, real-time replication to synchronize data across “local” storage between each of the VMs in the cluster. The synchronized storage is presented to WSFC as if it were a shared disk.

Efficient guest-based replication makes it possible to create a shared-nothing, hardware-agnostic storage cluster in a vSphere environment without the need for—or limitations of—raw device mapping. As shown in the diagram, different VMDKs are attached to different VMs in a multi-site N-node cluster using each VM’s independent storage. Some of these solutions also make it possible to implement LAN/ WAN-optimized block-level replication in either a synchronous or asynchronous manner. In effect, these solutions are capable of creating a RAID 1 mirror across the network, automatically changing the direction of the data replication (source and tar- get) as needed after failover and failback.

These shared-nothing (SANless) clusters provide a cost-effective way to create high-availability vSphere clusters without the need for raw device mapping or the limitations that raw device mapping imposes. SANless clustering software is hardware-agnostic; any storage device or subsystem is presented to the Windows operating system as a block-level device, and appears in Windows Disk Management as a drive letter for use by any application. Most data replication software also integrates with WSFC to enable administrators to configure high-availability storage clusters using this familiar Windows feature, while avoiding the use of shared storage as a potential single point of failure. Once configured, the software automatically synchronizes the local storage in two or more servers (in one or more data centers), making them appear to WSFC as a local or shared storage device.

In addition to high availability, most SLAs also have a requirement for high performance. And here, too, data replication software is capable of delivering superior results, especially for highly transactional applications such as SQL Server that require very high I/O operations per second.

Part of the performance advantage derives from being hardware-agnostic, which facilitates the use of direct-attached storage, optionally with solid state drives, that is far faster than network-attached storage or a SAN.

Performance is enhanced ever further by the way replication software integrates with the Windows file system. As writes occur on the primary server, the driver, which sits immediately below NTFS, writes one copy of the block to the local VMDK and another copy simultaneously across the network to the VMDK on the remote secondary server. The throughput performance achieved can be truly impressive. Testing has shown that data replication software is able to deliver SQL Server transactional throughput far higher than with AlwaysOn Availability Group replication and nearly as high as storage configurations not protected with any data replication or mirroring.

Data replication software approaches can have other advantages as well. For example, those that use block-level replication technology that is fully integrated with WSFC are able to protect the entire SQL Server instance, including the database, logons, and SQL agent jobs—all in an integrated fashion. Contrast this approach with AlwaysOn Availability Groups, which failover only user-defined databases, and require IT staff to manage every cluster node separately and manually.

High availability and high performance without the high cost of SQL Server Enterprise Edition and without the limitations imposed by raw device mapping—it’s why data replication software has a role to  play in virtually every virtualized data center.