The Problem with Quality in the Cloud

Cloud computing has taken the enterprise IT world by force. IT managers and CIOs are evaluating private, public and hybrid cloud infrastructures for running corporate applications and services. Many are doing pilots and evaluating large-scale migrations to the cloud, with the hope of not only saving money but increasing services for users.

Business people don't care about the infrastructure but they continually demand from IT more applications, more data, faster response times and better reliability. Through virtualization, IT can maximize existing hardware to deliver more, in a shared services environment. CIOs need to meet these rising business demands but without increasing their budget; often, virtualization and cloud computing is the only answer.

Yet when it comes to hosting mission-critical applications, which the business depends upon for revenues and customer support, IT is wary.

Will they have enough control and visibility if something goes wrong? Can IT deliver the same quality of service (QoS) in the cloud that business users expect:  fast response time and transaction success? If an application is slower when hosted in the cloud, or experiences more errors, CIOs get the blame.

The Problem with QoS

Ensuring success in the cloud means coming to terms with quality of service. CIOs are not keen to guarantee an end-user application QoS metric, because no one in IT wants to take ownership of service levels at the application and transaction level. The network, storage, and serverteams monitor their own individual SLAs, and these people don't have visibility into the entire application chain of events from transaction to disk. Unsurprisingly, people don't want to be held accountable for issues outside their direct control.

In the cloud, managing and ensuring application QoS is even harder, because of the highly dynamic nature of the infrastructure. Virtual machines are provisioned and decommissioned as needed by application and user demand.  Applications compete for shared resources not only on the compute stack but also in storage, creating intermittent performance issues. Furthermore, when a problem occurs, virtualization is the enemy of visibility. Origins of problems are not often obvious. For example, if you are monitoring a database that shows a high I/O, which is slowing response time for users, you might blame storage or sub-optimized data access. But the problem could originate from a simple glitch in the virtualization software configuration.

Virtualization and cloud technologies create unique challenges for applications that need to be adjusted for dynamic provisioning: How applications start and connect to servers will change as the virtual environment allocates and shifts resources based on policies and hardware availability. For instance, some applications have hard-coded server names: this doesn't work in the cloud.

The following scenarios are three of the top IT concerns for applications in the virtual world:

  • Degradation of service during the phases of transition from physical to virtual infrastructure (and thereafter);
  • The potential negative impact on QoS of inter-application shared resource contention;
  • Degradation of QoS at peak application load times.

If IT begins to notice an uptick in the number of support requests and user complaints around an application after it moves to the cloud, what's the fix? Applying targeted automation to the cloud can help prevent and more quickly repair application issues. The fluid nature of the cloud environment requires that IT managers receive more frequent feedback about transaction performance and user behavior, in order to ensure service levels for the business from day one.

Defining QoS for Your Business

There is no gold standard for QoS: it all depends upon your user needs, business goals and any industry or regulatory requirements. Transaction Performance Management or TPM software can help measure and define QoS in the cloud by recording how your transactions are performing and what components of the infrastructure they rely upon. By reviewing this data before moving the application to the cloud, you can understand what's normal and then accurately define the notion of "quality" for your business.

For example, a global financial organization is using a transaction performance management (TPM) package to measure the quality of its ERP and financial systems.  Each time they make changes to the application or infrastructure they use a performance management database to review past performance and set expectations for new performance.  They use the same metrics and data to prove that the quality has not deteriorated after the change, such as when transitioning an application to the cloud.  What matters most in the cloud is not CPU utilization rates or even cost savings, but how well transactions run and their impact to the bottom line.

With TPM, an organization can associate transactions to users and trace transaction paths through both virtual machines and physical servers, in real-time. Here are a few examples of how this automation works:

Co-provisioning:   When a new virtual machine is provisioned, the application monitoring software will automatically appear and follow that VM wherever it goes, making the process of monitoring automatic from the physical to the virtual world.

Historical analysis: One of the problems with virtual systems is that they come and go. A virtual machine here in the morning may be decommissioned in the afternoon, and all of the relevant monitoring data within it also disappears. However, monitoring automation should retain that performance data so that you can go back and understand what happened during a transaction window, helping you determine the source of the problem. This is akin to stopping a video and replaying a scene. Without postmortem analysis, application monitoring is unable to effectively pinpoint the cause and suggest remedies.

Linking a virtual event with application performance: TPM should allow you to determine the impact of the virtual environment on your applications. For instance, if an application calls for a new virtual machine, you should be able to find out when that happened and if it caused any changes to the application or its performance.

Problem isolation: In the cloud, servers are typically arranged in a cluster architecture, which can make it difficult to determine which server or VM is to blame when troubleshooting. Yet, just as in the physical environment, monitoring tools need to distinctly isolate where, when, and how an application problem occurred to facilitate a quick fix.

Defining and managing QoS in the cloud is a moving target. But it's also a natural evolution of the cloud computing adoption cycle. As cloud technologies mature, so will the means and best practices to ensure that the cloud environment is as reliable, secure and manageable as the physical infrastructure.