T.R | Title | User | Personal Name | Date | Lines |
---|
2268.1 | The cluster from hell. 8^) | CGOS01::DMARLOWE | Wow! Reality, what a concept! | Wed May 10 1995 15:31 | 37 |
|
>> They moved the 2100 to
>> another 900 port but the problem remained relatively the same. The 2100 was
>> then moved to another floor on another 900EF port. This seemed to resolve the
>> problem.
How many nodes are on the port that the 2100 is now on?
How many on the ports before?
>> Is there any method to find out how many packets are going thru the 900EF?
Double click on a 900EF. The first view includes the 7 ports with
packet in/out counts for the ports.
>> Checking with
>> HubWatch shows the deferred packet count, on various ports, is increasing
>> rapidly. The nodes have the same problem in their "line" counters. Lot of
>> single and multiple collisions occur too.
Deferred means traffic was already on the wire so the packet was held
back. Collisions are another matter, especially multiple collisions.
Sounds like you have too much traffic. Also are there any nodes that
may not be fully complying with the 9.6uS IPG time?
>> If the FDDI ring was to break (DB900EF A/B or DC900MX A/B port failure of
>> fiber failure), will the cluster nodes exhibit any problems? What is
>> FDDI wrap time?
Wrap time is quite small. Can take as low as 10mS to beacon and wrap
(heal) but maybe someone might be able to provide a closer answer. This
is mega times faster than a STP reconfig however. The only thing that will
happen is than any packets on the ring at the time of the wrap will be
lost. Packets will have to be retransmitted based on timers in the
upper protocol stack.
dave
|
2268.2 | | NPSS::WADE | Network Systems Support | Wed May 10 1995 16:30 | 12 |
|
Are you seeing any resets on the 900EF?
Any other bridges on the net?
Any STP topology changes on the E-LAN?
I assume you have RECNXINTERVAL set to >20 (default) seconds on all
cluster nodes?
Bill
|
2268.3 | | NETCAD::ANIL | | Wed May 10 1995 21:55 | 10 |
| The sum of out packets on all ports is the total number of packets
forwarded -- this includes spanning tree hellos and SNMP management
packets, but these will be a negligible percentage of actual
traffic.
Deferred frames, single, and multiple collisions are normal in
an active network. The thing to look at is excessive collisions
for indication of high levels of congestion on the Ethernet.
Anil
|
2268.4 | Update on Cluster From Hell :-( | TROFS::WEBSTER | NIS, London, Canada | Thu May 11 1995 16:28 | 38 |
| I put a sniffer on their net today. Multicast/Broadcast storms seems to be
common on all segments that have more than 1 LAVC node. Greater than 3 LAVC
nodes causes LAN overload conditions. One segment I measured had some
sustained averages of >50% utilization, lasting >10 seconds.
There are no resets on the bridges (up time exceeds 63 days all bridges...
about the time of the last power failure we had in the building (Digital is
actually in the same building as the customer....our downsizing has given them
more floor space...)).
There are only the 4 DB900EFs on the net and 1 Cisco 2500 router. Many IP
nodes, but not a lot of traffic from them. PC count is now over 50 and most
are running Workgroups and there is 1 Netware server, 1 remote mail server,
1 internal mail server and 1 FAXserver (all PC based).
REXCNXINTERVAL is the default 20. They have not adjusted this, so I assume it
is the same on all nodes.
I have some concerns about the performance of the DETTR, which is an Allied
Telisis 10bT to 10b2 repeater. The customer's wiring is all coax. We replaced
their DECrepeater 90Cs with the DB900EFs, so DETTRs and DECXMs were supplied
to connect the coax. The sniffer found CRC errors and runt packets on several
segments, even ones that were not excessivly busy. On one segment, we started
shutting down the nodes one by one to isolate, but the errors persisted.
Segments measured, that came off the DECXMs, did not have these errors.
>Any STP topology changes on the E-LAN?
What do you mean by this Bill?
MCS did get the DEFPA working today on the 2100 server running OpenVMS 6.1,
so the LAVC traffic is now on the FDDI for this one node. Other 2100 will
be upgrade next week (MCS has to make sure all the console code and hardware
is proper revs) and 4 other nodes will be added shortly after (mix of vax
3100's and alpha 3000's). That should help reduce enet traffic.
-Larry
|
2268.5 | | NETCAD::ANIL | | Thu May 11 1995 20:47 | 11 |
| If a large percentage of the traffic is broadcast or multicast, you can
turn on "rate limiting" for the specific addresses in the switches to
stop them from propagating; in any case the reason still needs to be
figured out. Note that the Sniffer tends to report "storms" falsely
if its trigger for such detection is set low. If a large percentage
is error traffic, that would indicate a physical (MAU/repeater) problem.
I would also check to make sure that no configuration rules, such
as repeaters in series, etc, are being violated. I've seen these
causing the kind of network slowdown you're describing.
Anil
|
2268.6 | | NPSS::WADE | Network Systems Support | Fri May 12 1995 10:06 | 8 |
| Off the track for this conference but, RECNXINTERVAL = 20 seconds is
the default for a CI cluster. Increasing this to 60-90 (on all nodes)
should stop the pedriver errors while you fix the problem on the E-LAN.
And it is advised to leave it at 60-90 for a cluster that includes NI
nodes.
Bill
|
2268.7 | Cluster transitions when DC900MX removed. | TROFS::WEBSTER | NIS, London, Canada | Mon Aug 21 1995 17:53 | 23 |
| Back to one of the original question regarding FDDI wrap time and
cluster state transistions.
Last week we had the privledge of removing a concentrator 900 from
ring to change a PMD. This unit had no SAS's connected, as the nodes
were just being installed (old problem of DEFTA-UA's not supported
by VMS < version 6.2 so UTP PMD were changed to MMF PMDs).
The DC900MX was in the middle of the ring created in the backplane.
(see diagram in note .0)
Using HUBwatch, we pulled the B port off the channel. As soon as we
did this, the cluster went into state transition. Any node that was
talking to 2 FDDI nodes on the other DC900MX, in the backplane, were
affected.
Will changing the RECNXinterval timer value up to the 60-90 value, as
mentioned in -.1, resolve this problem?
The customer was not to happy about this and I would like to go back
and say "We told you to increase your counters and you didn't, so
that's why there were transitions!".
-Larry
|
2268.8 | | NETCAD::DOODY | Michael Doody | Tue Aug 22 1995 10:29 | 15 |
| I think you are seeing a problem, not with FDDI wrap time but rather
with the bridge ports going into pre-forwarding state.
When you removed the concentrator from the ring, the adjacent modules'
ports are then connected to each other to heal the ring. When this
disconnect/reconnect happens, one or more of the bridge FDDI ports go
into pre-forwarding state as they learn the new topology (Anil could
give a better answer). During the 30-second pre-forwarding, no packets
are forwarded between FDDI/ethernet.
So clearly a RECNXinterval = 20 will expire before the bridge ports
come back.
-Mike
|