Application Versus Infrastructure: How to Collaborate for High Availability

By David Klee and Jerry Melnick

May 7, 2018

We’re all familiar with the IT “war room” scenario. When there’s an issue, members of the IT team come together to troubleshoot and find a solution. For organizations with important applications, such as SQL Server running in virtual infrastructures, these IT teams are no strangers to the IT war room environment. In this case, input from the SQL Server system administrators and DBAs isn’t considered (or even requested) when the infrastructure team implements it virtualization plan.

Inevitably, the SQL Server or application team will get a call—the mission-critical SQL Server is slowing to a crawl or down completely. While the IT infrastructure managers believed it had high availability, they never asked about the requirements needed for the application stack.

Silos Don’t Work

Current IT departments are organized into silos. As different teams manage the network, storage, etc., each team has a particular method and best practice guidelines for delivering high availability to their own layer. They may have implemented those methods for other applications and other parts of the IT stack before SQL Server was virtualized.

The problem is that what works best for one IT silo might not work well for another. For instance, a good HA plan for one IT silo can actually result in lower availability or slower performance for the database. Whatever the circumstance is, it often results in conflict between the various constituents. And, typically, a SysAdmin and a DBA have opposite opinions on whether clustering is the best solution for SQL Server HA or not.

The bottom line is: A high-availability plan for the infrastructure team rarely aligns with what will work for the application or DBA team. And, when those critical SQL Servers are down, productivity plummets, the bottom line suffers, and the finger-pointing begins. No one knows for sure if the problem is a database/application issue or an infrastructure issue.

How do these teams collaborate to eliminate issues?

The first step is for SysAdmins and DBAs to understand the anatomy of the virtual server stack where SQL Server runs. This basic understanding can help you identify and fix problems faster, communicate with IT staff better, and eliminate performance issues in your SQL Server database before they arise. Here’s a high level overview:

The shared storage attached network (SAN) is at the bottom of the infrastructure stack. This layer is connected to your physical server (hosts) with the network and/or fibre layer.

A hypervisor sits on top of the physical server virtualizing the CPU, memory, capacity, and network resources that will be provisioned to your application. The virtual machine (VM) configurations and operating system runs inside the hypervisor as well.

Your SQL Server instance and critical databases are at the base of the application layer. They sit just above the VM stack. Your applications sit at the very top. They fetch and present data from the database to the end users.

Note two key differences between a virtual environment and a traditional physical server environment that directly affect application availability and performance:

Shared virtual resources —CPU, memory, storage, and network are all shared among virtual servers. That means your SQL Server performance or availability issue may be caused by a different workload running on a different VM.
Frequent Movement/Change—Virtual server workloads are often moved from one virtual machine to another. Finding the root cause of your issues may become a glorified game of “whack-a-mole.”

How Do We Fix This?

The first step is to understand the business requirements and then apply the appropriate technology. Get the business people and members of all of the infrastructure layers in the same room together (the classic IT war room) and talk through the actual business needs around the database and application stack themselves.

Look at the issue from a business perspective and define service-level agreements (SLAs). If you don’t have these defined in your organization, take the initiative and work to get them established. What you might find after you get them nailed down, is that the business expectations when it comes to availability are much different than what the current systems architecture can support.

Consider using technologies that can pinpoint the root causes of SQL Server or other application performance issues to resolve the issue of “who owns it” from the start. New machine learning analytics technologies can be a cost-effective investment that nips conflicts in the bud. They can correlate performance problems in SQL Server or other applications to the changes in the infrastructure that are at the root cause. They can even recommend specific changes that will resolve issues and/or prevent problems for arising in the first place. Also consider high availability solutions that minimize the sources of conflict in the first place, such as SANless clustering software that eliminates the need for SAN storage—reducing DBAs’ reliance on IT infrastructure expertise to manage troubleshoot.

These are the applications that your organization depends on, and you can’t put their availability at risk.

Creating a High Availability Strategy

IT teams need to remember that a database server is not a test file server. Rather, it’s something that multiple business-critical applications depend on every second of every day. And these applications often have a direct line into how your business makes its revenue, making their protection and availability an executive-level priority for every organization. With so much on the line, how should you approach your high availability strategy?

Understand Your Application Requirements

First, the team should discuss the application requirements. The database layer is the base of the application stack, and if a database instance is down, multiple applications can be affected. So, what sort of outages can companies expect? By defining the scenarios that the application stack might encounter, such as a physical server failure, OS blue screen, or networking port issues, teams can be better prepared for incidents and how to remedy them quickly and even before outages occur.

Know What You Can Tolerate

Next, define the outage windows that the business can actually tolerate. Downtime tolerance policy may be as simple as “no greater than a five-minute window of interruption during the business day.” Downtime tolerance may also vary depending on the time of day, day of the week or period of the year. Some teams may have extended flexibility if it is after hours or on the weekends. But if it’s a web server that’s up servicing clients 24 hours a day, that’s a different story. So, define three key things:

Recovery Time Objective (RTO) is how long it takes to get the application stack and the SQL Server up and accepting connections if a failure were to occur.
Recovery Point Object (RPO) is the amount of data loss that is acceptable (or not acceptable) if the system were to fail.
Mean Time to Recovery (MTTR) is the average time that a device takes to recover from a given failure. MTTR could be zero for a single drive failure in a RAID array or a networking path failure in a multi-pathed configuration. However, MTTR might be higher than you think. The rebuild or HBA replacement process can seriously affect overall storage performance. The system may not necessarily be down; it could just be running slowly.

Some Downtime is Deliberate

Differentiate between planned and unplanned outages. This is a major point of conflict between virtual machine (VM) admins and database admins (DBAs). Some VM-level high availability technologies can protect against hardware failure, but they don’t protect the operating system if someone needs to use the constant stream of Windows patches. So, plan a strategy around the organization’s maintenance windows for that application stack for routine maintenance. Production can be different from pre-production systems; some production systems might not be as critical as others, such as a database server underneath a corporate antivirus deployment. Minimize the complexity of the app stack whenever possible. The more complex a given system is, the harder it is to triage in the event of an emergency.

Putting all of this information together into an application classification matrix will make it very easy visualize the impact of the application stack on the business. IT teams should include the storage, network, database and application layers and add other classifications according to the business requirements. Lastly, define the availability requirements for each component of the application stack, such as web servers, load balancers, application servers, and of course, the database layer.

For business-critical applications that need high availability protection, consider using machine-learning based analytics software that can help eliminate unnecessary complexity and costly provisioning. By clarifying the root causes of issues and recommending specific improvements to the infrastructure these tools eliminate the sources of conflict, enabling DBAs and IT infrastructure managers to work collaboratively.

With this business perspective, applications and infrastructure teams will not only be able to work constructively together, but will be more effective in protecting against and preventing outages and high availability failures.