Virtualization of x86 servers is ubiquitous for obvious reasons. Cost savings, efficiencies in provisioning virtual machines (VMs), recoverability, and the ability to move workloads are a few of them. There are, however, key VM and host metrics and events you should keep an eye on if you suspect your database performance is being impacted by running in a virtual machine. Let’s walk through those metrics using VMware ESXi as the basis of the discussion.
The beauty of hypervisors such as VMware’s ESXi is that they do a good job of obfuscating to the OS and database that they’re running in a VM and not a physical server. This is also why the following CPU-related metrics are important to monitor.
CPU Ready: This metric indicates that the VM (and the database trying to run inside it) was ready to run but instead sat idle while waiting behind other VMs contending for control of the same shared resources, such as physical CPUs or memory. The term “oversubscription” means you’ve assigned more virtual resources than the physical resources that exist to run all VMs concurrently. It may seem a bit strange, but reducing the number of vCPUs may dramatically increase its performance. Generally, oversubscription shouldn’t go above 5%.
Host CPU Usage: This relates to actively used CPUs as a percentage of total available CPU on the machine. If this number is high, you might see VMs with high CPU Ready and/or Co-Stop metrics. Active CPU is approximately equal to the ratio of the used CPU to the available CPU where Available CPU = # of physical CPUs x clock rate.
Co-Stop: The amount of time a VM waits for a vCPU is due to scheduling (lack of resources). This means that your VM can be waiting for physical CPU resources in use by other VMs. If you see high Host CPU Usage, it is probably a sign that there are too many VMs on this host and/or you need more physical CPU resources.
Memory management is a strength of ESXi and other hypervisors—assuming you have enough physical memory on your server supporting the various virtual machines running on it. Generally, you shouldn’t see an impact on the VM your database is running on as long as you don’t see a high VM memory swap rate, which means your VM is suffering the painful overhead of its memory being swapped in and out from the disk. The following five metrics should be observed to ensure your VM isn’t suffering due to memory.
VM Memory Swap Rate: The “swap in” and “swap out” rates generally mean you have a shortage of physical memory on the host, so the memory is swapped in and out from disk.
VM Active Memory Usage: This is the memory in use as a percent of the memory configured for the VM.
Host Memory Usage: This is the memory usage on the host (consumed memory/total machine memory). If this is high (e.g., GT 90%), this could indicate host memory over-commit, which could lead to high VM swap rates.
VM Memory Overhead: This is simply the amount of memory used to run the VM. Over-configuring memory (or excess vCPU for that matter) will unnecessarily increase overhead. That said, there’s memory needed by ESXi itself and the virtual machine (virtual machine frame buffer).
VM Memory Balloon: The balloon driver reclaims pages on the server considered less valuable. The goal of this VMware proprietary technique is to match the behavior of a guest OS. You should only see this when the host is running low or is out of physical memory. If you see the virtual machine your database instance is running in has a certain percent of memory claimed by the balloon driver, look for memory swapping, which could affect your VM’s performance. However, if you don’t see any swapping issues, you don’t and won’t necessarily have a performance problem.
Latency at the physical host can have a negative impact on VM and database performance.
Host Max Disk Latency: This is the highest latency value across all disks used by this host.
Host Disk Latency: Read latency is the average amount of time to process a read command to a disk to the host (across all VMs). High disk latency indicates storage may be slow or overloaded. Write latency is similar to read and is the average amount of time to process a write command from the specific disk across all VMs. Disk Write Latency = Kernel Write Latency + Device Write Latency. Expected disk latencies will depend on the nature of the storage, such as the read/write mix, randomness, and I/O size, along with the capability of the storage subsystem.
Understand the MetricsDBAs and those responsible for database performance don’t have to be virtual administrators or “VM gurus,” but they do need to be lucid regarding the metrics to ensure their database instance isn’t being impacted due to the VM running the instance.