Understanding Foglight Metrics

In this article, Quest Principal Architect Brian Wheeldon discusses Foglight metrics: What they are, what elements they contain, how they're managed, what they mean and how they can be visualized.

Metric Definition

A metric is a particular type of observation representing a numeric value over time.

A metric may originate as a series of observations collected by an agent (raw metric), or it may be calculated in the FMS (Foglight Management Server) based on other observations (derived metric).

The most basic way to examine a metric in Foglight is to open the Data browser under "Configuration" and open one of the branches of the data tree until you find and select a metric.

This is what it looks like:

In this screenshot, the Host/Context Switches metric is selected in the left panel and displayed on the right right side of the page.

This metric has four sub-elements: Current, Latest, Period, and History

Each of these sub-elements is a MetricValue or a list of MetricValues.

Current

When a metric observation arrives in the FMS, it looks like the Current value in this screenshot.

The current property is the most recent entry that applies to the current time range, as long as it hasn't gone stale. The most recent value is the one with the greatest endTime that is equal to or before the end of the selected time range, and could be before the start of the selected time range.

If an agent is deactivated, then new metrics from that agent will stop arriving in the FMS. When that happens, the "Current" value will go stale and disappear when the last observation is less recent than the end of the selected time range minus three times the sample period.

Latest

The Latest property will continue to show the most recently collected timestamp and value for the metric. If you stopped data collection of an agent yesterday and looked at a metric collected by that agent today, Current would be null (n/a, or not available) and Latest would show the last observation yesterday before data collection stopped.

Even if you're looking at a historic time range. the latest property will still contain the most recent value available.

Period

The Period MetricValue is an aggregate of all observations collected in the selected time range. In the one hour period shown in the screenshot, 12 observations (five minutes apart) are combined with Min, Max, Average and Sum values aggregated across those observations.

History

The metric's History is a list of MetricValues collected in the selected time range. A value will appear in the history if its endTime is after the start of the selected time range, and equal to or before the end of the selected time range.

Automatic Aggregation

A metric actively collected every five minutes will have a "Current" End Time within the last five minutes and a Start Time five minutes before that. The Sampled Period will be 300,000 milliseconds, or five minutes.

The Count will be "1" because there is one observation in that five minute time range and Min, Max, Average and Sum will all be equal because there is only one observation. Sum Squares will be the square of the single observation value and Standard Deviation will be zero.

As the observations age, they will be rolled up according to the configured persistence policies defined in the FMS. You can see the rollups begin to take effect in the oldest MetricValues under History in the screenshot above.

These two MetricValues show a Sampled Period of 900,000 milliseconds or 15 minutes. The Count is "3" because three raw observations were rolled up into this time range. The Min, Max, and Average are slightly different. The Sum and Sum Squares reflect the aggregation of these three different values. The Standard Deviation is non-zero.

As the observations age further, the 15 minute aggregations will be combined into hourly and then 4-hour MetricValues until they are ultimately purged (or not) according to the persistence policies.

Aspects of a Metric Observation

Each MetricValue has a set of properties that you can see on the right of the screenshot above.

Start Time: the starting time of the observation

End Time: the end time of the observation

Sampled Period: the number of milliseconds representing the collection frequency of the metric. This will usually match End Time - Start Time.

Count: the number of raw observations rolled up into the MetricValue

Min: the minimum value of the metric in the time range

Max: the maximum value of the metric in the time range

Average: the average value in the time range

Sum: the total of all observations in the time range

Sum Squares: the total of all observations in the time range squared

Standard Deviation: the statistical variance of the observation in the time range

The Sampled Period, Count and Sum Squares are used for calculating roll ups and are not typically interesting to display.

The other values are useful to display. If you drag and drop a metric on a custom dashboard, you can choose the Metric Analyzer or Metric Analyzer (detailed) views to visualize it

To see this view in a drag and drop dashboard or report, drop the parent of the target metrics on the page and choose "Select a view..."

Select the template Management Server/Commons/Metric Analyzer - Detailed

Best Options for Visualizing Metrics

When you think of a numeric value that varies over time, you probably imagine a chart, or more specifically, a time chart. Most of us will usually select a Time Plot Chart to display a metric on a custom dashboard for good reason: it's the clearest and most flexible display of a value over time.

To keep charts simple and easy to read while also communicating a sense of the variability of the target metric(s), a Time Plot Chart displays the "Minimum/Maximum" metric value by default. "Minimum/Maximum" compares each time slice average to the period average. If the time slice average is greater than the period average, the time slice maximum is plotted. If the time slice average is less than the period average, the time slice minimum is plotted.

You can configure a chart to show more a detailed representation of the the minimum, maximum and average in each time slice.

This chart is configured to display the Average metric value, indicating the value at the start and end of each time slice, with the min/max displayed as an envelope.

Notice that the time slices at the right edge show "current" observations with small time slices and min==max==average.

As automatic rollups are applied, the time slices towards the left get wider and you can see the envelope from min to max with the average value between them.

Charts are the most commonly used view for metric display, but other visualizations are also available.

Time Bar Charts emphasize the individual data samples while area charts highlight the magnitude of the value.

Stacking each of these types is useful when you want to visually sum different metrics. For example, Stacking User CPU time and System CPU time shows not only these two values, but also the total CPU usage over time.

Other views display the current value of a metric. Here are some example metric views in a drag and drop dashboard.

Best options for Presenting Metrics in Tables

Beyond charts, the workhorse for displaying metrics is a table. Which aspects of a given metric to display depend on how the table will be used.

For monitoring, the current value of a metric is most important to emphasize.

For reporting, the period average is usually the most important value.

Other aspects of a metric are also worthy of inclusion in a table.

A "Sparkline" gives a sense of the variability, shape and direction of the metric. A sparkline is a "data-intense, design-simple, word-sized graphic" invented by data visualization pioneer Edward Tufte. This renderer provides a tremendous amount of information about the metric value history within a small number of pixels.

Supporting the average value with the Max or Max and Min provides context for understanding the range of the value.

Sum is useful for metrics like error count - how many errors occurred in the last 4 hours, last day, last week? How much time does the app spend processing logins compared to other requests

There are some good examples of all these displayed in the Java Requests dashboard.

Can you spot which aspects of the the RequestType metric are displayed in this table?

Completed=Period/Sum, Incomplete=Period/Sum, Current=sparkline+Current/Average, Avg=Period/Avg, Total=Period/Sum, Exception Exits=Period/Sum, Traces=Period/Summ

Good understanding of metrics will help you understand your monitored environment better.

Thoughtful selection of the visualization and aspects of the metrics to display will help you build better, more informative dashboards and reports.

Recommended