Tracking Down Network Storms

Most network administrators have experienced unexpected network conditions. Whether it was a misconfigured router, a broken cable, a power outage, or one of a million other causes, network problems crop up. Part of a network administrator’s job is to understand how to identify and remediate the issue quickly while also understanding the root cause. The root cause is often just as important as resolving the symptoms because, unless the core problem is identified, the symptoms will come back.

Network storms are one of those symptoms that cry out to be treated. They impact user access to online systems, inter-system communication, and even the network paths that vendors use to identify and repair the very issues that cause the storms. They’re a huge mess to address.

Most of the network training that I’ve seen instructs IT personnel to isolate and shut down the cause of a network storm as quickly as possible. That’s not a bad default reaction. After all, when someone is poking you, your first reaction is to stop their poking. When someone points a weapon in our direction, we duck first and ask questions later. It’s our nature to stop the discomfort first and then reason out the cause after the immediate danger has passed.

Networks storms are a bit different. By definition, they are large amounts of uncontrolled network traffic that are negatively impacting business systems and processes. They can be caused by attacks but they can also be simple mistakes or over-reactions to normal conditions. How an IT department reacts to a network storm should have two important elements

  • Why did the network storm occur in the first place?
  • How can I stop this storm and prevent future storms?

Let’s look at both of these in the context of a real scenario.

Why Did the Network Storm Occur?

The network is flooded with SYN packets. Most internal network routers are reporting, when they can reach the management system, that they are near or at traffic capacity. Users are calling the help desk to report slow and failed connections to servers. All levels of IT support have been alerted by the network reporting system that the networks are falling below required service levels. The business is beginning to lose money as productivity slows and customer orders are impacted.

Where do you begin to address this issue? If you’re like many network administrators, you’re not fully prepared for this kind of near-complete network meltdown. You want to simply stop the SYN packets, and you can do that by telling your routers to not pass them between each other or between VLANs. You know that measure will stop the network storm in its tracks. But you know that doing that will shut down most TCP/IP communications. Most of your business applications will fail until SYN traffic is re-enabled across the routers. And you don’t know if re-enabling it will recreate the network storm.

That last point is the critical one. You don’t know what caused the problem yet. But you need to know before taking any significant corrective action. Otherwise, the problem could come right back. Or it could resurface in an hour, a day, a week… there’s simply no telling.

So how do you find out the root cause of the network storm? This question is actually more easily answered than you might think. And an important point to remember is that you don’t necessarily need to stop the remediation efforts in order to identify the root cause.

Take a Sniff!

No pun intended. Experience shows that the most helpful thing you can do during a network storm is capture and save as much network traffic as possible.

Network monitoring hardware and software (often called sniffers, a derivative of the name of the first dedicated network monitor devices) are common tools for most network administrators. Software tools for network monitoring are actually built into many versions of Windows and Linux. More advanced tools are available from a variety of vendors and often scale in functionality based on cost (read: pay more money, get more features and automation). Most network cards, including inexpensive built-in cards, can be used as network capture devices. That means that businesses of any size can usually afford an effective monitoring solution.

During a network storm, the best thing you can do is setup one or more (preferably one per network segment or at least one per physical location) network capture stations. Have them listen to all traffic and save it for analysis. You can use this setup to accomplish several goals:

  • Perform early analysis on the storm, such as what protocol the packets are using
  • Identify the most and least impacted segments by analyzing the network traffic volume
  • Capture traffic for further pattern analysis and problem resolution
  • In the case of an attack, capture traffic for criminal prosecution
  • Monitor the network state to determine the impact of your remediation efforts

Once the network monitoring is setup, you can move to the “stop the pain” phase of network storm response.

In our scenario, the network traffic is all reporting one MAC address as the traffic source. The MAC address seems, somehow, to be connected to three ports of the same router. The SYN packets are flowing out of the router to all other parts of your network at an alarming rate. It doesn’t appear that the outgoing SYN packets are tied to any authorized network application or process.

So how do we deal with this?

How Can I Stop and Prevent Network Storms?

The process to stop and prevent network storms is quite simple:

  • Stop the storm
  • Prevent future storms

Using our scenario, stopping the storm is as easy as shutting down the ports on the router that all the network storm traffic is flowing from. If the router itself is broken instead of the ports, it may require unplugging the router. That’s even easier—pull power from the router. It will impact more users than disabling individual ports, but re-enabling your business is probably worth the time it will take to re-cable the authorized users to another router.

A network storm is frequently mitigated by shutting down segments of a network or the connections between them (often called bridges, backbones, etc.). This is a great technique because it does let you isolate the cause of the storm. For example, if you have five buildings connected to each other, disabling the connections between buildings may show that four of the buildings return to normal patterns rapidly, while the fifth building continues to exhibit network storm symptoms. IT response at this stage is usually a matter of management and resource trade-offs – whether to focus on restoring the four buildings to full functionality first, to continue isolating the problem in the fifth building, or to split the IT staff and work both problems.

There is, of course, no perfect answer, and it depends heavily on resource availability and business need.

Once the problem is isolated down to the smallest impacted piece possible, further monitoring can continue a bit longer. In our scenario, we’ve reduced the problem to one router that is now completely isolated. The rest of the infrastructure has returned to normal operation within an hour. But this one router is still experiencing a storm of SYN packets on several ports. In this case, placing a network monitoring device on that router to capture as much as possible is the best way to identify the current issue.

Identifying the root cause of this network storm is the key to preventing future storms. This is much more easily accomplished with a few key pieces of data including:

  • Network maps and inventories
  • Network monitor captures
  • Change log from the network storm incident response

In our scenario, we find that several unapproved network cards were used to replace cards in one department. A recent driver update for those cards was downloaded and applied to the computers with those cards. The driver update seems to have reprogrammed the MAC address for all the cards to be the exact same address and has, among other flaws, a behavior that occasionally causes it to continuously send SYN packets when certain network criteria occur. These criteria occurred and triggered the driver bug that then cascaded into a network failure. The remediation for this scenario is two-fold:

  • Replace the faulty network cards with approved cards
  • Review and correct the process that allowed the faulty cards to be introduced in the first place

The network traffic captures are important because they provide detailed evidence of the fault happening. They prove what happened, when, and where. They usually indicate the why as well. This becomes important when legal processes happen, such as terminating an employee for installing unauthorized network cards or filing a civil lawsuit for damages caused by the faulty network drivers.

Just in case you’re wondering, this is a real scenario that happened not very long ago to several small and medium sized businesses in the United States. No names are used to avoid embarrassment.

Summary

Network storms are a nasty business. They are never expected, and never exhibit the same symptoms twice. They often cannot be predicted – if they could, we’d prevent them. They can, however, be responded to in very different ways. The right way to respond is with a combination of analysis and repair activities – understanding the root cause of the problem (or capturing enough data to later analyze and identify it) and stopping the problem from impacting the business.

An important tool central to both of these phases is using an effective network monitoring solution. The tool will capture data for later analysis and provide real-time feedback about how the network is responding to the remediation efforts. Even low-cost or free network monitoring solutions are powerful tools in this effort. Don’t discount the free tools – many vendors provide them as a service and they can be exceptionally effective. In fact, some of the best network monitoring software available today is free. You should select and use the one you’re most comfortable with, the one your staff is trained on, the one recommended by your hardware vendor, or try some out for yourself.

Anonymous