|
An explanation:
Subj: DECbridge-90 "1801" messages
18 means "Beginning of self test."
01 means "End of self test".
That no other numbers appear means that no tests were performed. The
DECbridge-90 will do this whenever it thinks that it may have lost track
of some memory.
The DECbridge-90 does this under any of the following conditions:
- An overrun error on the Intel 82590 Ethernet controller. This isn't
supposed to ever happen, because the bus is allocated in fixed time
slots, and there is always enough time for the controller. However, if
the controller reports this error, the bridge will respond with an 1801.
If this is what is causing the problem, it is either bad hardware
(something stuck accessing the memory bus), or the network "traffic" is
some kind of noise that is badly confusing the controller. These aren't
counted, so neither SHOW DISPLAY nor SHOW PORT will help you.
- Two consecutive lifetime exceeded errors (indicating substantial
outbound congestion failure, or a bad tranceiver). Normally, a port's
"lifetime exceeded" counter never increments above 0. If this counter
is incrementing, then this could be the problem. It would indicate
some kind of problem either with excessive traffic on the network (sum
of traffic on both wires exceedig 14,000 packets/second), or some kind
of wiring problem that makes transmitting packets difficult, and
involving many retries. This could also result from using some
non-conforming Ethernet devices which are too aggressive in their
back-off and retry algorithm. SUN was famous for shipping system like
that a few years back, but I assume that problem is gone by now.
- Any system buffer unavailable event will trigger it. Running phases
18 and 01 of the self test recovers any memory that may have been lost
due to excessive numbers of runt frames or lifetime exceeded errors.
However, this trigger is protected by a timer that will prevent it from
happening any more often than once every 10 minutes. Normally, the
system buffer unavailable counter will never increment above 0. If this
counter is moving, then this may be the problem. It would indicate an
unusually high number of runt frams and collision fragments, which is
indicative of improperly configured wiring. Check the "repeater count"
limits, check wires for proper termination, and check to be sure the 180
meter limit on the length of a thinwire segment is not exceeded.
- ^C received on manufacturing diagnostic console does an 1801.
However, this requires power-on with password reset button depressed to
enter manufacturing diagnostic mode (which turns the backplane
management port into a diagnostic console). This is also possible if
the ASO_L backplane pin is shorted to ground. DEChub 90 normally
doesn't use control characters in the hub management protocol. However,
if this is in a -900, it might be possible that the ASO_L pin is
asserted due to some hub problem, and it might be polling the devices
that are responding in some binary protocol or at a different baud
rate, such that the bridge is interpreting repeater responses as ^C.
That it happens every 30 seconds in another hint. 30 seconds is
related to the interval of spanning tree hello messages. However, bad
hellos cause a port loopback test to be scheduled, which appears as
180501 or 180201, not as just 1801.
Please post this as a reply to your note, and let me know if this helps.
Is this a bridge under test, or in production use?
|
| I have a customer that is seeing this problem also.
He has the following configuration.
Two DEChub 90's connected via 20M of thin wire and a management cable.
In the top hub he has a DECBridge 90FL REV 3.0 and a DECrepeater 90C
and a DECserver 90L and a DS90L+. The DECbridge 90FL is connected via
the AUI port to Thick wire transceiver with Heart beat disabled.
In the bottom hub he has a DECrepeater 90C and two DECserver 90TL's.
He has a 5 node cluster connectd to the two repeaters on this hub, and a
couple of Xwindow terminals.
This has been running ok for a few months. On Sep 22 he lost connection with
nodes on the backbone. The cluster on the hub was not affected. He tried to
do a conenct node to the DECBridge 90FL from one of the cluster members and
would get th 1801 a few times then a target does not respond.
He had a spare DB90FL so he did a quick swap, at 19:30. By 07:30 the next
morning the spare DECbridge was in the same condition. Next he swapped the
DECBridge 90FL with a DECbeidge 90 and has not had any problems since.
I have asked him to replace the DECbridge 90FL and if the problem reoccurs to
remove the Thin wire cabel from the Hub and terminate the hub. To see if the
bridge can recover.
I have also asked him try and break the two hubs apart with their own
DECbridge 90FL.
Can a five node cluster accross these two hubs cause enough traffic to be a
problem for a DB90FL?
The Cluster members are not logging any send or recieve failures. And with the
exception of losing communication with nodes out side of the hub these cluster
members wer not affected in any way by the outage. This tells me that the work
group side was not having a problem.
He has the 3.1 update on order.
Thanks for any input
Alan S. Anderson
Network Support CSC CS
|