Avoiding Slow VM Performance - Identifying Issue Root Cause (Part 2)

"We're hitting a performance issue on our 4.1 ESXi Cluster issue with storage. We are using EMC Clariion CX3-80 RAID5 and have some metaLUN's with datastores and some VM's running with RDM. What were finding on these RDM's is very high 'disk queue length figures' showing on both the SAN LUNs and the VM's themselves...

Any suggestions on what the cause might be..."

- From Storage Performance Monitoring help pls...., Posted April 19, 2011 on the VMware Community Forums

Last week, I wrote about how proactively monitoring VMware performance could help avoid slow VMs. However, as noted from the VMware Community Forum post above (and every week, there are similar performance-related postings), even with the most thorough monitoring, VM or VMware storage performance issues can still occur and bring down a virtualization administrator’s day. Depending on which applications or datastores have been affected, an organization may be losing hundreds, thousands, or even millions of dollars for every hour that an application is either down or so stricken by slow VMware performance that it is nearly unusable. As the data center scrambles to resolve a VM performance issue, hours can seem like days as system logs, resource utilization data, and VMware knowledge base articles are scoured to try to get to the heart of an issue.

What makes this troubleshooting process so difficult and tedious is that virtualized environments are very complex, produce huge amounts of data to document what is occurring within the system, and are constantly changing through dynamic resource re-allocation actions such as DRS and vMotion. To add even more intricacy, VMs within a host are "connected" through the memory and CPU resources they share, and several VMs or hosts may be connected to entirely separate areas of the virtualized infrastructure through datastores that are shared within the SAN. Importantly, this means that:

  1. When a VMware performance issue strikes a VM, that issue may also affect other VMs or datastores. When this occurs, several vCenter alerts may be triggered around the same time, resulting in an “alert storm”.
  2. A VM performance problem may be caused by another VM that appears to be performing normally. For example, a VM may have needed more memory, and began ballooning, drawing memory from another VM that was on the same host and began to experience slow VMware performance when its utilization increased.

Due to this interconnected nature of virtual machines, the first thing that must be done when a VMware performance problem occurs is to identify the “root cause” of the performance problem. In many cases, finding and implementing a fix for this one issue may resolve all related vCenter alerts in an “alert storm”.

What Needs to Be Done

Getting to the root cause can require a significant amount of investigation. The first thing to do is to evaluate all alerts and collect system metrics information for the related virtual objects that have been affected. (These are 20 metrics that should be evaluated for each VM where there is an alert). Next, looking at which areas are being highlighted as having issues, eliminating resource areas that aren’t problematic will help narrow down the possible causes of issues. This is particularly important for data centers that have separate network or storage teams as they may need to be called in to help with troubleshooting. After the infrastructure area where the issue is occurring has been identified, virtualization administrators can start to assess the multiple issues chronologically, analyzing resource usage metrics to trace back the history and spot the initial "root cause" of the problem.

Much more detail about how to methodically go through a troubleshooting process, which VMware metrics to assess and how to access them is available in our recently released whitepaper, The Top 5 VM Performance Problems, written by David Davis of VMwareVideos.com.

Automating the Process

The process described above can be very time-consuming. Thus, VKernel aimed to vastly speed up going through these troubleshooting steps that can take hours manually with vOPS Performance Analyzer. Performance Analyzer “pings” vCenter every 2 minutes to see if any issues have been covered. If so, Performance Analyzer immediately accesses all relevant system metrics, isolates the problem area, and runs through a decision algorithm to locate the root cause of the issue. Importantly, Performance Analyzer will also recommend a solution that will remediate the issue root cause and can be automatically implemented through the vCenter API with the click of a button in some cases. The end result is that there is one screen that will show the single root cause for what may be multiple vCenter alerts, finding this issue in seconds when the same work manually would have taken hours.

Stay tuned for the concluding VM performance problem blog where we’ll delve into assessing the impact of a VM performance issue on the environment.

About the Author