[Search for users]
[Overall Top Noters]
[List of all Conferences]
[Download this site]
Title: | DEC Network Integration Server (DECNIS) |
Notice: | Please read note 1 to use this conference effectively |
Moderator: | MARVIN::WELCH |
|
Created: | Wed Sep 18 1991 |
Last Modified: | Thu Jun 05 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 3660 |
Total number of notes: | 15082 |
3646.0. "self-test error "line card failed" when all cards ok" by CSC32::J_RYER (MCI Mission Critical Support Team) Thu May 22 1997 18:59
After a DECnis 600 router rebooted unexpectedly, my customer noticed
that there was a "1" in the top led of one of that router. The last
reboot reason showed "Unknown". An NCL "sho hardware all" command
to the router reported a self-test error with a reason of "Line Card
Failed" for the latest boot; however, none of the line cards
showed any sort of fault indication, and all routing circuits were
passing traffic successfully.
In checking out the router's history, we found that there had been
self-test errors on the last eight boots (numbers 15 through 22).
Some showed "Line Card Failed" and others showed "System Unusable"
as the reason. Those boots spanned a period of over six months;
none of them were less than four or five days apart.
That night, customer powered the router down and back up, and
it came up cleanly (no self-test failure). Note that this was
the first boot in quite a long time which had not resulted in a
self-test error.
The router has been up for about three weeks now with no further
indication of any problem. However, customer is saying "yes, but
it went as long as four months without problems, and still failed
self-test on the next load". He wants an explanation of how the
DECnis could have reported a self-test failure reason of "line card
failed" without any indication of which line card was bad.
Comments?
Jane Ryer
MCI Mission Critical Support Team
ncl> sho node scm001 last reboot reason
Node scm001
AT 1997-05-22-15:02:02.130+00:00I-----
Status
Last Reboot Reason = Power Down
ncl> sho node scm001 hard all
Node scm001 Hardware
AT 1997-05-22-15:23:46.630+00:00I-----
Status
UID =
CCCAE792-5158-11CF-8000-000000000000
Type = DEC Network Integration Server
600
Temperature Level = Normal
Boot Number = 23
Self Test Errors =
(
[
Boot Number = 22 ,
Reason = Line Card Failed ,
Device Slot = <Default value>
] ,
[
Boot Number = 21 ,
Reason = System Unusable ,
Device Slot = <Default value>
] ,
[
Boot Number = 20 ,
Reason = System Unusable ,
Device Slot = <Default value>
] ,
[
Boot Number = 19 ,
Reason = Line Card Failed ,
Device Slot = <Default value>
] ,
[
Boot Number = 18 ,
Reason = System Unusable ,
Device Slot = <Default value>
] ,
[
Boot Number = 17 ,
Reason = Line Card Failed ,
Device Slot = <Default value>
] ,
[
Boot Number = 16 ,
Reason = Line Card Failed ,
Device Slot = <Default value>
] ,
[
Boot Number = 15 ,
Reason = System Unusable ,
Device Slot = <Default value>
] ,
[
Boot Number = 2 ,
Reason = System Unusable ,
Device Slot = <Default value>
] ,
[
Boot Number = 1 ,
Reason = System Unusable ,
Device Slot = <Default value>
]
)
Characteristics
Temperature Alarm Holddown Interval = 5 MINUTES
Dump Control = Full Dump
Self Test Control = Full Test
Debug Flags = 0
Counters
Last Reboot Time =
1997-04-30-00:47:58.020+00:00I-----
Times Temperature Critical = 0
Total Duration Ambient Over Temperature = 0 SECONDS
Total Duration System Over Temperature = 0 SECONDS
Duration Ambient Over Temperature Since Reboot = 0 SECONDS
Duration System Over Temperature Since Reboot = 0 SECONDS
Times Correctable Memory Error = 0
Creation Time =
1996-01-18-05:26:45.186+00:00I-----
ncl>
T.R | Title | User | Personal Name | Date | Lines |
---|
3646.1 | What's turning the light out!!? | MARVIN::WELCH | | Fri May 30 1997 09:27 | 21 |
| Hi Jane,
The '1' in the top display remains until the box is power-cycled. I
believed the line-card fault lights did the same, but from what you say it
seems they don't. The scenario I would propose is that a reload is done,
where system self-test runs on the line-cards and a card fails. The
fault light comes on and the system continues to try and load the DECNIS
image file.
If the system fails to load an image, say because the line-card it needs to
use is the failed one, it records the 'system unusable' reason and resets.
During this or some other reset the failed line-card is re-loaded/booted and
the fault light goes out. The '1' in the top display still records an error
took place, but now the system can load the image and comes up successfully.
Obviously it's not very easy to reproduce this and therefore test the
scenario. Please keep monitoring this box and let me know of any further
self-test failures. The ideal would be to look at the state of the box after
a know failed reboot.
Steve.
|