How Services and Service Levels Work

How Services and Service Levels Work

Foglight allows anything to be grouped into a service. A service level will measure the availability of that group. How does this work, under the covers? And how can you customize it?

(This is a repost of an old blog article - by request [:)])

How Do Services and Service Levels Work?

At the most basic level, a service is a group of objects that have some meaning. A service can be a list of related hosts, the elements of an application, or even just a list of things for which someone is responsible. Using the Services Operation Console (SOC), any user can create and monitor services. For more details on services, consult the documentation.What's interesting about services and service levels in Foglight is that they are implemented using standard model mechanisms. All of the services code is an add-on to the core.A service is implemented using the FSMService class. A service can have a name, a short description, and a long description. But the action happens inside the definitions collection.

The definitions collection contains references to all the objects that are considered part of the service. This is what does the grouping. Then, the aggregate state of the service is calculated by taking the worst state of the objects in thedefinitionscollection. This means that a service's state is the worst of any of the group.By default, we create a service level for every FSMService instance. This means service level measurement is turned on by default.A service level is implemented using the FSMServiceLevelPolicy class. This class references a service using the fsmTarget property. It uses this reference to a service to measure the service level. This is shown in the diagram below:

The diagram shows that the worst state of the service is used to determine theavailability of the service. Availability is a binary metric: 1 means the service is considered up, and 2 means the service is considered down. The availability metric is represented by the green arrow. Worth noting: the availability metric uses a state threshold variable called FMSServiceSLP_StateThreshold to determine whether a particular state is considered "unavailable". By default, the state threshold is set to Fatal (4). This means it will take a fatal alarm to make a service level register the service as unavailable.A second metric called baselineAvailability is calculated as a moving average of availability over an hour, converted to a percentage. In plain English, baselineAvailability takes the series of up and down indications from availability and converts it into a percentage that tells us approximately how available the service has been. The baselineAvailability metric is represented by the purple circle.Finally, a rule bound to FSMServiceLevelPolicy called ServiceLevelEvaluation - FMSServiceSLP evaluates the baselineAvailability to make sure it is meeting expectations. Remember that baselineAvailability produces a % availability like "98%". This rule compares that value to thresholds of acceptable availability. This is where users can plug in their own availability thresholds. Do you require that your service is available 80% of the time? 99%? You can change the thresholds and get the rule to fire accordingly. The ServiceLevelEvaluation - FSMServiceSLP rule is represented by the orange circle with an R in the middle.

Service Level Mysteries

Ever seen this?

How about this?

These seemingly inconsistent reports are a result of the time-driven nature of an SLA. There are two parts to an SLA that can introduce lag:

  • The baselineAvailability calculation looks at availability over an hour. A condition can be cleared, resulting in the service being considered available. However, the baselineAvailability may still be low because it considers all values over the last hour.
  • The rule that evaluates the state of the service level (ServiceLevelEvaluation - FMSServiceSLP) compares the baselineAvailability to a threshold. That means a condition can clear, availability pops back up to 1, and baselineAvailability creeps back up over time. However, until baselineAvailability crosses the acceptable threshold, the rule will not clear.

Sometimes math helps. Here is the pseudo-code for baselineAvailability:baselineAvailability = average(availability over the last hour expressed as a percentage) Here is the logic for ServiceLevelEvaluation - FMSServiceSLPFatal: baselineAvailability < AvailabilityFatal (Default 70%) Critical: baselineAvailability < AvailabilityCritical (Default 85%) Warning: baselineAvailability < AvailabilityWarning (Default 95%)

Here is a credible sequence of events that demonstrates the lag.

Time FrameEventavailabilitybaselineAvailabilityServiceLevelEvaluation Rule
Start of week until 8am Wednesday No Fatal alarms 1 100% Normal
8am Fatal alarm 0 Still 100% but about to decrease Normal
8:03am Fatal alarm still enabled 0 Creeps down to 95% Warning (less than 95% value of AvailabilityWarning)
8:09am Fatal alarm still enabled 0 Creeps down to 85% Critical (less than 85% value of AvailabilityCritical)
8:18am Fatal alarm still enabled 0 Creeps down to 70% Fatal (less than 70% value of AvailabilityFatal)
8:30am Fatal alarm clears 1 Hits bottom at 50% starts to creep back up Fatal
8:42am No event 1 Creeps up to 70% Fatal alarm replaced by Critical alarm
8:51am No event 1 Creeps up to 85% Critical alarm replaced by Warning alarm
8:57am No event 1 Creeps up to 95% SLA alarms clear, SLA state Normal

Why Should Service Levels Work This Way?

Service levels are intended to be a measurement on the level of service you are providing to your customers. They take a longer view than current availability. Both availability and SLA measurements are important.

  • Use instantaneous availability to find and address problems quickly.
  • Use SLA measurements to determine whether you're meeting customer's expectations

Customizing Service Levels

There are lots of options for customizing how service levels work.

Changing State sensitivity with FMSServiceSLP_StateThreshold

By default, a service is considered unavailable if it has a state of Fatal. This setting is controlled by the FMSServiceSLP_StateThreshold variable. Try setting it to Critical (3) or Warning (2) for different results.Since registry variables can be scoped to different objects, it is possible to have different thresholds for different services. To do this, you need to scope your value to the FSM ServiceLevelPolicy instance, as shown below:

In this example we're setting the availability threshold for the Baboon service to Critical. All other services will still consider themselves available as long as they don't have Fatal alarms. But Baboon will be unavailable if it has Critical or Fatal alarms.One other thing to note is the unfortunate name choice: FMS (as in FMSServiceSLP_StateThreshold). Most people assume FSM, since that is the name used everywhere else. Many apologies for this, I hope it doesn't trip you up.

Customizing SLA Rules with AvailabilityWarning, AvailabilityCritical, AvailabilityFatal

The rule that evaluates to see if an SLA has been violated is called ServiceLevelEvaluation - FMSServiceSLP. Recall that the rule conditions are: Fatal: baselineAvailability < AvailabilityFatal (Default 70%) Critical: baselineAvailability < AvailabilityCritical (Default 85%) Warning: baselineAvailability < AvailabilityWarning (Default 95%) As with any registry variable, it is possible to customize these values globally or locally for a particular service. So if must be 80% available or it is considered a Fatal violation, you can make the setting shown below:

Customizing the time range for baselineAvailability

Recall that baselineAvailability measures the average availability over a time period. By default that time period is an hour. The expression looks like this:avg( #availability for 1h#) * 100

Unfortunately there is no way to override this value using a registry variable. This means the easiest change to make is to change the behaviour of baselineAvailability for the entire server. The screen snapshot below shows how this can be done to measure for 4 hours:

Note that the scope for baselineAvailability is ServiceLevelPolicy, which is a base class of FSMServiceLevelPolicy.

Customizing availability

Earlier in this blog I described availability as being the worst state of the objects grouped inside the service. That's true, but it can be customized.The availability metric doesn't simply look at the aggregateState value of FSMService. Instead, it executes the following pseudo-code:Count the # of objects in FSMService.definition that have a state less than the threshold defined in FSMServiceSLP_StateThreshold Calculate a % of objects that are available If the % of available objects is greater than *FSMServiceSLP_PercentageAvailableThreshold* return 1 (available) else return 0 (unavailable)

In other words, availability can be used to ensure that the right number of children are available. This is ideal for clustering scenarios. Suppose a service had four children L, M, N and O that represent a cluster of Weblogic applications. If three of them need to be available, then we can set FSMServiceSLP_PercentageAvailableThreshold to 75% and measure availability. That's cool. [:)]

Service Level Management

Service Level Management is an underutilized part of the Foglight operations console. There are two places service levels can be managed: the Service Operations Console (SOC), and the Service Level Management page.In the SOC we try to provide a quick summary of the state of the service level. Click on the Service Level Agreement(s) tab to see the summary. Since there is currently only one, the list is short.

If you select the SLA name you are able to drill down to a more detailed summary of the service level performance. This dashboard shows the same summary, but adds information on SLA performance in the last 7 days and last month. This allows for a current vs historical comparison. Note that the list of alarms that might be contributing to an outage are also shown. This allows for quick remediation - clear the alarms, or call your peers and yell at them until things get better.