Troubleshooting VM Performance with Foglight for Storage Management 3.0

In my last blog, I covered the new storage array support being added to the upcoming release of Foglight for Storage Management (FSM) 3.0. Now, I’d like to talk about our new Storage Troubleshooting feature that we’ve added to FSM 3.0.

This new Storage Troubleshooting feature enables you to determine whether slow VM performance is caused by your storage subsystem and helps you to fix the problem. Imagine you experienced slow VM performance yesterday around 2:30pm. Using our time travel feature, you can scroll back to that exact timeframe to see what was happening in your storage subsystem when VM performance was slow.

Then you can select the slow VM from a drop down box, edit the latency thresholds (if you desire), and click the “Perform Analysis” button. Foglight responds with a performance analysis for every datastore and RDM attached to that particular VM. In this case, your VM is attached to a single datastore. We can see that during the timeframe you examined, your datastore experienced latency that exceeded the thresholds you set above. If the “vs threshold” line is green, your latency is below the thresholds. If it’s yellow, you’ve exceeded the Warning threshold. And if it’s orange, you’ve exceeded the Critical threshold. We can see that your VM indeed was experiencing latency above the Critical threshold at 2:30pm, but that it dropped back to acceptable levels by 3:30pm.

We can also compare the VM latency during this timeframe against what latency the VM has typically experienced during the past 30 days at this same time of day. Because, it’s possible that your VM is always exceeding the latency thresholds during this time of day due to some periodic event in your storage subsystem (scheduled backups, for example). If the “vs typical” line is green, your latency is equal or less that what you normally experience. If it’s pink, your latency is worse than what you normally experience. We can see that your VM was experiencing latency that was worse than usual at 2:30pm, but that it dropped back to within normal levels by 3:30pm.

Finally, we can do a similar analysis for latency at the disk extent connected to this datastore. You can see that in this case, we see latency at the disk extent which exceeds the thresholds that you set above.

Based on the analysis, the Storage Troubleshooting feature can pinpoint whether the slow VM performance is due to inadequate resources allocated within the ESX host or whether it’s related to the storage subsystem. If the performance issue is caused by your storage subsystem, you can click the “Analyze SAN Storage” button for a deeper analysis. This view displays latency measurements for the LUN attached to this VM. If the storage array is reporting no excessive latency at the LUN level, you know that the problem resides within your fabric. However, if the storage array is reporting that the LUN is experiencing high latency, it indicates that the problem resides within the array. We see here that the LUN is reporting latency that exceeds the user-defined thresholds, and is higher than what we typically see at this time of day. In this case, we also report a “vs typical” comparison of disk IOPS for the storage pool within which the LUN resides. We see here that the pool for this VM’s LUN is experiencing higher than normal IOPS. This could mean that other LUNs within the pool are consuming excessive IOPS, which is resulting in high latency for your VM’s LUN. To dig deeper, you can click either the “Perform Pool Change Analysis” or “Perform Pool Load Analysis” buttons.

The Pool Change Analysis shows you which LUNs and NAS Volumes in this pool have experienced significant changes in IOPS compared to a user-configurable time period in the past. This helps identify, for example, if another LUN in the pool started driving more traffic than normal, causing your VM to suffer poor performance. In this case, we see that the LUN named Kappa-Lun/LUN1 (carved out of NAS volume /vol/vol1 on this NetApp filer) has experienced a large increase in average IOPS consumed during this time period.

The Pool Load Analysis identifies the LUNs and NAS Volumes that are typically the busiest during this time period based on the last 30 days of history. Again, this can identify other occupants in the same pool as your LUN which could be contributing to the poor performance of your VM. And, again, we see that the LUN named Kappa-Lun/LUN1 is a heavy IOPS consumer and a possible culprit to causing your VM’s performance problem.

As you can see, the new Storage Troubleshooting feature being released with Foglight for Storage Management 3.0 is an easy-to-use tool for explaining and resolving slow VM performance. For more information about FSM, please visit our product page located at