网络维护经验谈:广播风暴分析与排除
2005-12-31    安恒公司 技术部   
打印自: 安恒公司
地址: HTTP://es-lan.anheng.com.cn/news/article.php?articleid=760
网络维护经验谈:广播风暴

关于广播风暴这一网络维护中常见的故障有着不同的认识和观点,早期对于广播风暴来说就是产生滚雪球效应并产生网络阻塞的故障,而现在多数认为是由于持续发生的大量广播而造成的网络阻塞或瘫痪既为广播风暴。


broadcast storm

A state in which a message that has been broadcast across a network results in even more responses, and each response results in still more responses in a snowball effect. A severe broadcast storm can block all other network traffic, resulting in a network meltdown. Broadcast storms can usually be prevented by carefully configuring a network to block illegal broadcast messages.

broadcast storm

<networking> An broadcast on a network that causes multiple hosts to respond by broadcasting themselves, causing the storm to grow exponentially in severity.

broadcast storm

n. [common] An incorrect packet broadcast on a network that causes most hosts to respond all at once, typically with wrong answers that start the process over again. See network meltdown; compare mail storm.

广播风暴

当主机系统响应一个在网上不断循环的报文分组或者试图响应一个没有应答的系统时就会发生广播风暴。一般为了改变这种状态,请求或者响应分组源源不断地产生出来,常使情况变得更糕。随着网络上分组数目的增加,拥塞会随之出现,从而降低网络的性能以至于使之陷入瘫痪。


从交换机原理看网络广播风暴的几种原因

网吧行业竞争的加剧,出现了一些规模比较大的网吧。目前在网吧行业内,百台以上的网吧已经随处可见了。由于网吧在进行网络建设时,缺乏专业的网络技术支持,使得网吧的网络故障频繁出现。在网吧的网络故障中,由于网络广播风暴引起的网络故障,占网吧网络故障的九成以上。网络广播风暴到底是如何形成的呢?

要想正确理解广播风暴的具体含义,我们必须了解一下工作在网络中的网络设备的工作原理。目前,工作在网吧网络中的网络设备,基本上都是交换机了。对于交换机,大家并没有真正的了解其工作原理。

一、交换机基础知识

  1. 交换机的定义:交换机是一种基于MAC(网卡的硬件地址)识别,能完成封装转发数据包功能的网络设备。交换机可以“学习”MAC地址,并把其存放在内部地址表中,通过在数据帧的始发者和目标接收者之间建立临时的交换路径,使数据帧直接由源地址到达目的地址。
    现在,交换机已经替代了我们原来比较熟悉的网络设备集线器,又称Hub。但是这并不意味着,我们不需要了解Hub的基本知识。
  2. 集线器的定义:集线器(HUB)属于数据通信系统中的基础设备,它和双绞线等传输介质一样,是一种不需任何软件支持或只需很少管理软件管理的硬件设备。它被广泛应用到各种场合。集线器工作在局域网(LAN)环境,像网卡一样,应用于OSI参考模型第一层,因此又被称为物理层设备。集线器内部采用了电器互联,当维护LAN的环境是逻辑总线或环型结构时,完全可以用集线器建立一个物理上的星型或树型网络结构。在这方面,集线器所起的作用相当于多端口的中继器。其实,集线器实际上就是中继器的一种,其区别仅在于集线器能够提供更多的端口服务,所以集线器又叫多口中继器。

二、交换机与集线器的区别

现在,我们经常会存在这样一个技术误区,我们用的是交换机,数据全部是点对点转发的,为什么还会产生广播风暴呢?我们在充分了解了交换机与集线器的功能区别后,就会明白,使用交换机作为网络设备的网络,为什么会出现广播风暴。

  1. 交换机与集线器的本质区别:用集线器组成的网络称为共享式网络,而用交换机组成的网络称为交换式网络。 共享式以太网存在的主要问题是所有用户共享带宽,每个用户的实际可用带宽随网络用户数的增加而递减。这是因为当信息繁忙时,多个用户可能同时“争用”一个信道,而一个信道在某一时刻只允许一个用户占用,所以大量的用户经常处于监测等待状态,致使信号传输时产生抖动、停滞或失真,严重影响了网络的性能。
  2. 在交换式以太网中,交换机提供给每个用户专用的信息通道,除非两个源端口企图同时将信息发往同一个目的端口,否则多个源端口与目的端口之间可同时进行通信而不会发生冲突。通过实验测得,在多服务器组成的LAN 中,处于半双工模式下的交换式以太网的实际最大传输速度是共享式网络的1.7倍,而工作在全双工状态下的交换式以太网的实际最大传输速度可达到共享式网络的3.8倍。 交换机只是在工作方式上与集线器不同,其他的如连接方式、速度选择等与集线器基本相同,目前的交换机同样从速度上分为10M、100M和1000M几种,所提供的端口数多为8口、16口和24口几种。交换机在局域网中主要用于连接工作站、Hub、服务器或用于分散式主干网。

三、产生广播风暴的原因

通过对以上网络设备的了解,我们就可以简单分析出来,网络产生广播风暴的原因了。一般情况下,产生网络广播风暴的原因,主要有以下几种:

  1. 网络设备原因:我们经常会有这样一个误区,交换机是点对点转发,不会产生广播风暴。在我们购买网络设置时,购买的交换机,通常是智能型的Hub,却被奸商当做交换机来卖。这样,在网络稍微繁忙的时候,肯定会产生广播风暴了。
  2. 网卡损坏:如果网络机器的网卡损坏,也同样会产生广播风暴。损坏的网卡,不停向交换机发送大量的数据包,产生了大量无用的数据包,产生了广播风暴。由于网卡物理损坏引起的广播风暴,故障比较难排除,由于损坏的网卡一般还能上网,我们一般借用Sniffer局域网管理软件,查看网络数据流量,来判断故障点的位置。
  3. 网络环路:曾经在一次的网络故障排除中,发现一个很可笑的错误,一条双绞线,两端插在同一个交换机的不同端口上,导致了网络性能急骤下降,打开网页都非常困难。这种故障,就是典型的网络环路。网络环路的产生,一般是由于一条物理网络线路的两端,同时接在了一台网络设备中。
  4. 网络病毒:目前,一些比较流行的网络病毒,Funlove、震荡波、RPC等病毒,一旦有机器中毒后,会立即通过网络进行传播。网络病毒的传播,就会损耗大量的网络带宽,引起网络堵塞,引起广播风暴。
  5. 黑客软件的使用:目前,一些上网者,经常利用网络执法官、网络剪刀手等黑客软件,对网吧的内部网络进行攻击,由于这些软件的使用,网络也可能会引起广播风暴。

要想做到对故障的快速判断,良好扎实的基础知识,是不可缺少的。因此大家在日后的学习中,不要忽略了对基础知识的学习!


网管必备!平息环路造成的广播风暴

黄迎 

笔者是一个网管。最近我们的局域网进行改造,网线全部重新走,原来整个网有70个点,全部是星型结构,现在增加了30多个用户,也就是说只要增加一个交换机就可以了。现在的网络结构是这样的:整个网全部采用的是星型拓扑结构,中心机房配备一台TCL的三层交换机,服务器主要提供FTP、文件服务、Web等多项服务。各楼层采用TCL的交换机,全部带有VLAN技术,各个终端通过百兆Avaya网线接入网络。网络刚开始用的时候各方面还不错,可是最近网络却时常掉线,有时甚至干脆罢工。领导要俺三天就解决问题,要不就和他说拜拜。呜,呜,呜……好在功夫不负有心人,经过俺三个日夜的苦思冥想终于搞定了这个问题。现在,同事们都喊我黄工,工资恐怕也会大幅上涨,哈哈哈……下面言归正传:

话说某年某月某日,有N个用户反映网络连接情况时通时断,网上邻居有时也不能互访,由于故障用户分布在多个楼层,故障点不集中,所以影响特别不好。刚开始以为是信息量过大交换机的端口堵塞,把交换机、服务器重启了N遍,还是不行。得,我们赶紧商量从软件着手,先从服务器上杀毒,然后把各个交换机关掉,对每台机器杀毒,可是故障仍然存在。在Ping网络中的部分服务器或计算机时,依旧丢包,网络时断时续。

天呀,我都快疯了!当天晚上回家,仔细地回忆了一下当天的故障,发现当我在杀毒并断开个别的交换机时,网络恢复了正常,而我再次插上去的时候,就再次断线。想到这里,好像有了一点眉目。第二天上班时,我断掉所有的交换机,然后一个个地插上,直到找到出问题的那个。经过仔细的观察,终于发现问题所在。原来当时做这个网时,核心交换机和子交换机之间都留了两根级联线,平常都只插一个,另一个备用。而在这个交换机上两根都插上了,当交换机上的用户向另一个用户发送信息时,数据包就会沿着另一根备用的线,通过核心交换机循环地发送数据包。当信息量大的时候,就会浪费核心交换机的资源,造成网速下降直到断线。

此次故障主要是由于网络中有环路存在,造成每一帧都在网络中重复广播,引起了广播风暴。要消除这种网络循环连接带来的网络广播风暴可以使用STP协议,大家可以查阅思科等相关的书籍,在这就不多说了。另外,以后我们在做比较大型的网络时,一定要建立详细的档案,包括网络布线图、IP及MAC对应表等,并在网线上套数码管。这次故障经查证就是因为一个新来的同事粗心造成的。


网络经常瘫痪是为何

问:网吧有70多台计算机,网络每天都会瘫痪一到三次。通常情况下,只需将一级交换机的网线全部拔出后再连上,即可恢复正常,而有时则不得不重启一下交换机。把原来的10Mbps的网卡更换为10/100Mbps网卡后,有近一个星期的时间网络没有瘫痪。然而,这几天网络又开始不正常了。集线设备采用16口和24口的10/100Mbps交换机,代理服务器采用Windows 2000的ICS(Windows连接共享)。请问这一现象的原因是什么?

答:在排除了病毒向网络疯狂发送数据包的可能后?可以认为这是典型的由广播风暴导致的网络瘫痪。广播风暴爆发后,网络中传输的全部是广播包,计算机处理的也全部都是广播包,正常的数据包无法得到转发和处理。拔掉网线或关掉交换机后,广播风暴得到扼制,从而恢复正常通讯。

广播可以理解为一个人对在场的所有人说话。这样做的好处是通话效率高,信息一下子就可以传递到网络中的所有计算机。即使没有用户人为地发送广播帧,网络上也会出现一定数量的广播帧。需要注意的是,广播不仅会占用大量的网络带宽,而且还将占用计算机大量的CPU处理时间。广播风暴就是网络长时间被大量的广播数据包所占用,使正常的点对点通信无法正常进行,其外在表现为网络速度奇慢无比,甚至导致网络瘫痪。

导致广播风暴的原因有很多,一块故障网卡、或者一个故障端口都有可能引发广播风暴。

需要注意的是,交换机只能隔离碰撞域,而不能隔离广播域。事实上,当广播包的数量占到通讯总量的30%时,网络的传输效率就会明显下降。

通常情况下,在采用多种通讯协议的网络中,计算机不应多于100台,在采用一种通讯协议的网络中,计算机不应多于150台。如果计算机的数量较多,应采用划分VLAN的方式将网络分隔开来,将大的广播域划分为若干个小的广播域,以减小广播风暴可能造成的危害。


Broadcast Storm Analysis

When encountering a broadcast storm in a network baseline session, analyst can apply a specific technique to isolate the cause of the storm and the possible effect of the broadcast event on the internetwork.

A broadcast storm is a sequence of broadcast operations from a specific device or group of devices that occurs at a rapid frame-per-second rate that could cause network problems.

Network architecture, topology design, and layout configurations determine the network's tolerance level as it relates to frame-per-second broadcasts.

Consider, for example, a frame-per-second rate related to a broadcast storm generation of a specific protocol (Address Resolution Protocol [ARP], for example). Such generation, at more than 500 frames per second and on a continuing basis, is considered an abnormal protocol-sequencing event and can be extremely problematic.

The key here is to understand the difference between a normal broadcast event and an actual broadcast storm. When a normal broadcast event occurs, the broadcast is engaged from a specific physical device on a network for the express purpose of achieving a network communication cycle. There are conditions when a device, such as a router, broadcasts information to update other routers on the network to ensure that routing tables are maintained as consecutive and consistent related to internal route table information. Another standard broadcast event is when a device attempts to locate another device and requires the physical address or IP address of another device.

When a specific workstation device has a default gateway assigned, a "normal" broadcast event can occur. The device knows, for example, the target IP address of a device on the internetwork. It is common for this device to broadcast an ARP sequence to attempt to locate the target hardware address. ARP broadcasting is discussed in detail later in this book.

A workstation that broadcasts an ARP sequence to locate a target server but doesn't establish a broadcast resolve and doesn't receive a target hardware address for the server provides an example of an "abnormal" broadcast event. If the target device fails or the source broadcast operation mechanism or protocol-sequencing mechanism of the device fails, the source workstation device could start performing a loop ARP sequence that could be interpreted as a broadcast storm. Such an event in itself could cause a broadcast storm.

广播风暴分析

Broadcast storm analysis.

The point to be made here is that the frame-per-second rate of the broadcast sequence and the frequency of the broadcast sequence event occurrence can constitute an abnormal event.

Another example can be found in a Novell environment, when the Service Advertising Protocol (SAP) sequencing is engaged by specific servers. If the servers are broadcasting an SAP on standard NetWare sequence timing, the occurrence may take place on 60-second intervals. If there are hundreds or thousands of servers, the SAP sequence packets generated may become highly cumulative and affect areas of the enterprise internetwork that are not utilizing Novell processes.

In large internetworks, many of these concerns are addressed through protocol filtering within routers and switches in the network Layer 3 routing design. When a problem does occur because of an anomaly or possible misconfiguration of an internetwork, it is important to capture the information upon occurrence.

By applying an exact technique with a protocol analyzer, an analyst can very quickly capture a broadcast storm and identify the cause of the broadcast storm and develop a method to resolve the storm. Many different tools enable an analyst to achieve this. Almost all management systems for internetwork hubs, routers, and switches facilitate broadcast storm identification. The threshold that determines what is an actual broadcast occurrence versus an actual broadcast storm is usually set by the network manager or the configuring analyst of the network management platform.

The following discussion details the use of a protocol analyzer for broadcast storm analysis. When performing a data-analysis capture, a protocol analyzer is a useful tool for capturing a broadcast storm. Many protocol analyzers have thresholds that allow for an artificial intelligent–based Expert system to identify a broadcast storm. A storm can be identified by preconfiguring and studying a trigger or threshold for determining what would constitute a storm occurrence. When performing a network baseline, an analyst should always engage the threshold setting on the protocol analyzer prior to a baseline session.

Using a Protocol Analyzer for a Broadcast Storm

Based on the network architecture, the protocols, and the node count on a site being studied, an analyst must determine what constitutes a broadcast storm. This requires the analyst to be quite familiar with the topology and types of protocols and applications being deployed. A general benchmark is that a broadcast sequence occurring from a single device or a group of devices, either rapidly or on an intermittent cycle at more than 500 frames per second, is a storm event. At the very least, the sequence should be investigated if it is occurring at 500 frames per second (relative to just a few devices and a specific protocol operation).

After the threshold has been set on the protocol analyzer, a data-trace capture should be started. After the capture has been invoked, and a broadcast storm event has occurred in the Expert system with notification or in the statistics screen, the time of the storm and the devices related to the storm should be carefully noted. The addresses should be noted in a log along with the time of the storm and the frame-per-second count. Most protocol analyzers provide this information before the capture is even stopped. As soon as the broadcast storm occurrence takes place, the analyzer should be immediately stopped to ensure that the internal data-trace information is still within the memory buffer of the protocol analyzer. The data trace should then be saved to a disk drive or printed to a file to ensure that the information can be reviewed. The data-trace capture should then be opened and the actual absolute storm time noted from the Expert system or the statistical screen. Based on the absolute time, it may be possible on the protocol analyzer to turn on an absolute time feature. When turned on in the data trace, the absolute time feature enables an analyst to search on the actual storm for the absolute time event. This may immediately isolate and identify the cause of the broadcast storm.

Certain protocol analyzers offer hotkey filtering to move directly within the data-trace analysis results of the storm event. Either way, by using absolute time or hotkey filtering, the broadcast storm should be located within the data-trace capture.

Other metrics can be turned on in a protocol analysis display view when examining a broadcast storm, such as relative time and packet size. After the start of the storm has been located, the key devices starting and invoking the storm should be logged. Sometimes only one or two devices cause a cyclical broadcast storm occurrence throughout an internetwork, resulting in a broadcast storm event across many different network areas. The devices communicating at the time closest to the start of the storm inside the data-trace analysis results may be the devices causing the event.

After the storm has been located, the Relative Time field should be zeroed out and the storm should be closely reviewed by examining all packets or frames involved in the storm. If 500 or 1,000 frames are involved, all frames should be closely examined by paging through the trace. After the end of the storm has been located, the time between the start of the storm and the end of the storm should be measured by using a relative time process. This is achieved by just zeroing out the relative time at the beginning of the storm occurrence and examining the cumulative relative time at the end of the sequence. This provides a clear picture of the storm device participation and processes, the packet-size generation during the storm, and the source of the storm location. The initial several packets located for the broadcast storm should be investigated for the physical, network, and transport layer addressing schemes that may relate to the storm occurrence. This helps an analyst to understand the sequence of the storm event.

This is an extremely important process in network baselining and should be engaged in proactive and reactive analysis. In proactive baselining, an analyst must configure the proper broadcast storm thresholds on the protocol analyzer. This way, the storm events will show during the network baseline session. In a troubleshooting (reactive) event, it is important to know whether certain failure occurrences or site network failures are also being reported by the users; these may relate to the time of the storm occurrence. If this is the case, just isolating and identifying the broadcast storm may make it possible to isolate the devices causing the storm or the protocol operations involved. It may then be possible to stop the storm occurrence. This will increase performance levels and optimize the network.


Detecting a Broadcast Storm - Link Analyzer

Fluke Networks served as the Distributed Analysis and Troubleshooting sponsor for the Event Network (eNet) at Interop 2005 in Las Vegas.

Various analysis products were deployed throughout the eNet to discover end-user devices, analyze traffic statistics on key links, and monitor for problems.

In an unusual case, the OptiView Link Analyzer and Protocol Expert were used to monitor the VoIP VLAN for unexplained activity.

During testing on the VoIP VLAN, phones stopped connecting to the Call Management servers. Also, calls in process suffered poor quality. Using the Local Statistics test on the OptiView Integrated Network Analyzer, broadcast traffic was discovered as using 100% of the utilization on the VoIP VLAN

A loop introduced into the network created a broadcast storm. An exhibitor at the show accidentally looped a cable in his booth, causing the broadcast event.

To be more proactive in monitoring these types of events, Link Analyzer was configured to watch for broadcast storms. Network Management Applications now are alerted when a broadcast storm occurs.
 

Configuring the Link Analyzer to Monitor Traffic

  1. Connect the Link Analyzer (LA) to the VoIP VLAN on a non-mirrored port. 

    Broadcast traffic on the voice network also would be forwarded to the Link Analyzer as part of the VLAN.  

  2. De-select the capture feature
    There was no need to continually capture and save potential broadcast-storm traffic, so the capture function of the LA was disabled. Only the monitor feature is used on the analyzer.
    ·Click the capture button on Protocol Expert to de-select the capture feature (as shown below)

 

Configuring an Excessive Broadcast Alarm

Create an alarm to alert if Broadcast levels exceeded 5% of utilization: 

  1. Right-click the Link Analyzer IMM module in the PE resource browser
  2. Select Alarms to configure the Alarm variables
  3. Select a New Alarm
  4. In the Broadcast Frames settings, set:
    ·the count to Delta
    ·the number of occurrences to 10,000
    ·the action to SNMP Trap
    ·the Interval to 1.
    This means if 10,000 broadcast packets are observed (a good amount for a Gigabit connection) during a one second time period, then Link Analyzer will send an SNMP Trap alerting that a storm is in progress.

  5. Enable the alarm by checking the enabled box on the right

 

Set The Destination For The SNMP Trap

To configure trap destinations for a remote Link Analyzer device:

  1. select the device in the Resource Browser
  2. select Host | Alarms Settings | SNMP Trap settings from the menu bar

    The SNMP Traps dialog box will appear.
  3. Use the Community Settings area to add or delete communities
  4. List all IP addresses for the management systems in the Trap Destinations area.

A maximum of 15 trap destinations can be assigned to each community. All alarms are sent to all specified trap destinations.

At Interop 2005, this method was used to send SNMP Traps to the Computer Associates UniCenter Network Management Platform. These traps alerted eNet engineers of unusual broadcast traffic on the VoIP network within 1 second of the event.

 

责任编辑: admin