5 Steps to Avoid Network Alert Overload

Are you suffering from information overload?

Today’s corporate networks are increasingly dispersed. Networks may include remote office locations, mobile devices and third party cloud services. Network users may include telecommuters, mobile users, and business partners. Each location and type of user access requires a different network control policy and quite often a different network management tool. As a result, the IT operational staff receives data coming from a perforation of technology platforms, event logs, network monitoring tools and surveillance equipment.

The volume of data coming into a network operational team can be huge. Sorting through this data to enable actionable decisions is one of the most significant challenges facing IT operations today. This article provides you with the five steps to ensure you are looking at the right data in a way that will enable you to take the right action.

Step 1: Know what you are trying to protect

It is important to ensure that network resources are monitored for errors and network failures as well as possible security breaches. However you first need to ensure that the essential parts of your network operations are healthy. So the first step in starting to make your operational data is to identify what is important to your business. To do this you need to answer the three directional questions.

Question 1: What parts of the system and network are critical to the day-to-day business operations? If you are a manufacturing company this might be the factory processes, or if you are a hospital it may be access to patient care information. The purpose of this question is to make sure that the life pulse of the organization is identified. Day-to-day operations are generally concerned with availability and reliability. You need to look at this data first and make sure that the mission critical parts of your business are effectively monitored.

Question 2: What parts of the system are essential to the long term survivability of the company? If you are a software company this could be where the source code is managed or if you are a law firm it could be where the customer files are maintained. Security and protection of these systems may be your primary issue with these network resources. You need to make sure these critical areas are safe before investigating other areas and determine if there are any actions that you need to take.

Question 3: What are the two or three things that will have the most impact on improving the operations of the business and on increasing customer satisfaction? Having ensured that the critical components of the network and system are healthy, you can now focus on collecting and analyzing data so you can make decisions that will improve the safety and performance of the network.

You now know what are the highest priority areas for data collection and analysis. You can now move to step 2, selecting your operational data collect and analysis tools.

Step 2: Choosing your network monitoring tools

There are two primary reasons why you should monitor network resources and record what is happening. The first reason is to ensure the reliability, availability and integrity of the network. The second reason is to provide you with the data necessary for you to analyze, troubleshoot and if necessary recreate specific network conditions. Distinguishing between these two distinct roles is essential for ensuring that you generate meaningful, relevant and actionable network data.

Along with event logs, network monitors and protocol analyzers are the primary source of information about what is happening on the network. These tools ideally continually collect data on each layer of the OSI model, have the capability to capture traffic on multiple network segments simultaneously, collect frames sent to or from individual network nodes, monitor the status of network nodes and capture data on multiple protocols and topologies. Capturing this information can provide both a snap shot of the network at a moment in time and an historical record of the network activities over a period of time.

In addition to being able to collect your network data your tools should have to help you manage and interpret your network data. Four capabilities to look for are:

First, you should be able to see real-time data or close to real-time data. The tool should be able to push the information to you rather than requiring you to request reports or sign-on to a web page. The tool should also allow you to select different profiles so that only operational data relevant to what you are monitoring is displayed.

Second, you should be able to do a detailed investigation. The tool should allow you to get down to the lowest level of detail so you can track down root causes of network problems. The capability to do a deep packet inspection, to access remote equipment and to identify the root source is essential for troubleshooting network problems.

Third, you will need analytical and statistical capabilities that you can use to dissect and analyze the data without loss of accuracy. Identifying trends, utilization rates, pattern matching, and behavioral analysis are all important for understanding what is happening on the network. However be warned that statistical representations of data are often misunderstood and can result in critical problems being overlooked. For vital statistics it is absolutely essential to have a clear definition on how the data is collected, assembled, aggregated, analyzed and presented.

Finally, graphical user interfaces (GUI) and effective use of colors. One can never forget the human factor in network operations. GUIs are essential considerations for analyzing traffic flows across a network and displaying large amounts of data in a recognizable and actionable format. The figure below shows the graphical capability and colors used in the Microsoft Network Monitor tool. In this illustration green is being used to highlight wlcomm.exe frames. This may be useful in tracking security problems as some malware can disguise itself as a wlcomm.exe.


Figure 1: Example of a network monitoring tool GUI

Now you have the tools to capture your network data you are almost ready to assess what is happening on the network. But first you have to establish a baseline for expected network behavior.

Step 3: Know your network baseline behavior

Step three is developing an understanding of what your normal network profile is. By monitoring the network during normal business operations over an extended period, you can form a baseline of normal network operations and typical variance. Network baseline measurements typically include network utilization, average and peak throughput, protocol usage, latency, packet errors and packet loss. Traffic flow is an important source for monitoring network anomalies.

Baseline measurements enable you measure variance so can you detect abnormal network behavior. It is particularly valuable for detecting network security incidences such as denial of service and worm attacks. They also enable you so set meaningful thresholds values that can trigger warnings and alarms.

Thresholds can be set for monitoring parameters on individual devices such as memory utilization and server temperature, and network traffic such as collision rates and the number of malformed TCP/IP packets. Triggers can warn you of pending problems and allow you to take preventative action. Determining the threshold values and triggers is quite difficult. Set the threshold too high and a problem may go undetected, set it too low and it will cause a high false alarm rate. One approach that will help make sure you have meaningful warnings is to tie the warnings to specific actions that need to be taken when thresholds are exceeded.

Baseline measurements also provide a benchmark measurement against which improvements can be measured. Earlier in this article it discussed the importance of defining the top two or three improvements that you are focused on delivering. Baseline measurements provide a basis to measure the impact of your improvements. For example if your goal is to prove better performance at peak load, then the baseline measurements need to include peak loading.

How long the measurements should be collected over depends on the type of business. Normally 2 to 4 days is sufficient to get an accurate baseline. A baseline measurement should not include non-typical business loading. For example retail and other consumer driven industries experience seasonal demands, and it is not uncommon for network traffic to spike during national holidays. You should not take your baseline measurements during these time period.

Now you have the data and know what normal behavior is, you need to make sure your operational staff can effectively use the data.

Step 4: Make your data manageable

It is important that you are familiar with the data that is flowing on your network. There is no need to capture all the data but it is important to know the types of protocols being used. This will give you both an understanding of how the bandwidth is being used and highlight potential problems such as whether there is excessive broadcast data or if a Trojan is transmitting in the background.

Protocol analyzers and network monitoring tools however can generate voluminous amounts of data which is far more than most IT staff can analyze. The effectiveness of communicating adverse events to operational staff is diminished by the problems over-warning. To prevent information overload you can apply filters to the data being gathered and displayed.

Filters can be applied such that you only capture the data you are interested in, or you can capture all the data and then apply the filter. The former approach is recommended if you are planning to capture data over an extended period as it will reduce the amount of data you need to store. However the latter does allow you to look at other traffic if you find you subsequently need it.

The figure below shows a HTTP filter being applied using the Microsoft Network Monitor tool to the data flowing in and out of a server. In this example the HTTP filter is parsing just packets that contain the HTTP post command. Note network monitoring tools capture packets on a specific port. Switches represent an issue for monitoring tools as switches segment traffic onto different ports. When you are monitoring switches you will need to set up port mirroring, which as the name suggested copies the traffic onto another port.


Figure 2. Setting up filters

A single network fault can generate a large number of events and warnings. Many of these events do not need immediate attention. Good practice is to define a set of policies for warning notification and action. Ideally escalation should happen when the problem is not resolved in the intended timeframe. However it would be tedious to define all the events and warnings that require resolution within different time periods. So it is important to remember to focus on those that are vital to your business, such as critical application servers.

Step 5: Decide how much data you should store

The temptation may be to collect every bit of network monitoring and packet capture data and store it all on numerous disks and tapes. Even though the probability is slim that you will need this data sometime in the future, you could if needed conduct an in-depth investigation of an historical event. This “keep it all” approach may make you feel safe, as no matter what happens you will have huge piles amounts of data that you can sift through and find the culprit.

The question you need to consider is: is it more efficient for you to spend your time finding the culprit or to spend your time improving how to detect and resolve problems before damage is done to the network?

Storage prices have certainly dropped considerably in the past few years. However developing a storage solution for enterprises requires a significant capital and time investment. Unless there is a regulatory or compliance requirement that requires you to keep network data, keeping all the data is not the most efficient approach.

So how do you make a decision on what data you should collect, analyze and then discard versus what data you should collect, analyze and store? The answer goes back to understanding what you are trying to protect. If you are concerned about security vulnerabilities and attacks, then you could scale back your data capture to the area of most concern, such as the data between the enterprise network and the Internet, and consider investing your time and money in to a network anomaly and intrusion detection system.

Can’t sleep at night

If you cannot sleep unless you know you have every bit of data that goes across the network and you have heaps of money and a team of folks that can shift through reams of data, then you should buy heaps of storage and keep all your network monitoring and packet capture data. A more proactive and efficient approach is to focus data capture and analyze on the areas that are vital to your company’s success, set up the filters and escalation policies to effectively manage your network monitoring data, and invest in intrusion and anomaly detection systems.

Remember that you will sleep better knowing that the burglar cannot get in, rather than knowing you can find the burglar after he has left.

Anonymous