Sudden changes in the system behavior can often be traced back to “Change Events” which may have triggered the changes in the system behavior. Looking at Change Events in correlation with key performance metrics is a powerful troubleshooting technique to help us identify infrastructure changes that were the cause of changes in the system behavior.
Foglight for Virtualization, Enterprise Edition has an event tracking functionality which may not be known to all users. The “Event Analytics” tabs in Virtual Machines and ESX hosts can overlay historical Alarms and Infrastructure Changes on top of selected performance metric(s) trends related to CPU, Memory, Network and Disk.
Let’s follow a use case to better explore the value of Event Analytics function when troubleshooting performance problems. We are being notified by an application user complaining about slowness of their application. We can start the investigation by looking at the VM in question and the ESX host for the VM. By selecting the Event Analytics tab for the ESX and enabling Infrastructure and Alarms to overlay the selected metric(s), we can start the investigation by looking to see if any alarms were fired or any infrastructure changes in the environment may be the root cause of the performance problem. As you can see in the screenshot below we can quickly correlate Change Events with historical metric(s) behavior.
In this case, by looking at the graph we can quickly observe several Infrastructure changes and Alarms around 12:00PM. In the above diagram you can see the “Active Memory” metric shows sudden change of behavior. Several Alarms are fired to notify the user of changes in the metrics behavior.
We can further investigate by selecting the “Active Memory” metric and clicking on a star in order to see the root cause for the Alarm. The following screen shot shows the reason for the Alarm is due to memory utilization being outside of the normal operational range. As you may know, Intelliprofile analytics in Foglight provides the base lining capabilities to identify normal behavior. When the “Active Memory” metric breaks out of its normal pattern an Alarm is generated to notify the user of the metric behavioral change.
To continue with the investigation, in the following screenshot, you can see the impact of the change on Disk Metrics. The latency associated with “Write Rate” clearly is an indicative of performance problem.
And in the next screenshot we can look at the impact of this infrastructure change on network metrics, which clearly shows an abnormal Network Transfer Rate.
By looking at the infrastructure changes, we can quickly recognize the ESX host performance change is related to creation of a new Virtual Machine on that ESX host. The new VM has changed the dynamics of resource utilization on that ESX which have resulted in performance degradation in the neighboring VMs. At this point, now that the root cause of the problem is identified, we can move forward with any of the multiple ways to correct the problem, such as moving the new VM to another ESX host, etc.