| Details about this problem. Console version V4.8-5 and beyond clear
MCES under certain circumstances, but the solution is not foolproof. I
believe VMS has stated they will be clearing MCES much earlier in crash
scenarios on 4100/4000 as of VMS 7.2.
<<< MVBLAB::SYS$SYSDEVICE:[NOTES$LIBRARY]ALPHASERVER_4100.NOTE;1 >>>
-< AlphaServer 4100 >-
================================================================================
Note 423.12 XDELTA breakpoint with fan failure? 12 of 12
LANDO::CUMMINS 78 lines 3-FEB-1997 17:42
-< Problem found; fix in works; Steps to diagnose down systems.. >-
--------------------------------------------------------------------------------
We have duplicated the console dropping into XDELTA breakpoint and
have isolated the problem. It should only occur on VMS machines and
typically only on systems that have a VGA card present in the PCI
backplane (even if CONSOLE is set SERIAL).
Until this is fixed via console, VMS, or a combination of both, there
are things you can do to gather more information about a particular
crash:
1. If possible, remove the VGA card and wait for the next crash.
2. Use the XDELTA instruction sequence described in the previous
reply note and either post it here or send it to me.
3. If an environmental MCHK event (fan, temp failure), use the SRM
console SHOW POWER command to determine what failed.
We did not reproduce the fan failure MCHK drop into XDELTA because all
of our early on (back in Spring '95) and recent testing was performed
by holding a finger to the fan for five seconds or so - this always
results in a successful crash dump. It wasn't until we held it for a
longer period of time that we reproduced the problem.
A recent memo on the problem description is attached for reference.
From: LANDO::CUMMINS "Bill Cummins, PKO3-2/Q21, 223-4641" 3-FEB-1997 17:05:16.60
To: STAR::JHUBER,STAR::KFOLLIEN,STAR::JANETOS
CC: MAYO,LEMIEUX,CUMMINS
Subj: VMS not clearing MCES before bugcheck on Rawhide (and TurboLaser)..
Hi Jeff, Ken, Jim,
We've recently been working some customer problems where the Rawhide console is
unable to successfully complete a crash dump following a 670 or 660 MCHK. We've
tracked the problem down to the following sequence of events:
1. A 660 or 670 occurs (e.g. fan failure, DTAG PE, fill error, etc.)
2. VMS eventually calls CB_OPEN to begin the process of writing the crash.
3. To regain control of the I/O, we reset the PCIs, including a VGA re-init,
if VGA is present.
4. Since we use the BIOS emulator for VGA init, we take MCHKs while the BIOS
option ROM code probes the PCI. We normally handle/dismiss these MCHKs via
a handler we install. On a UNIX dump, we handle the MCHKs gracefully since
MCES is clear, and then go on to successfully dump. However, during a VMS
dump, PAL sees MCES set and throws up its hands with a double error halt.
5. Console eventually regains control, but can find no valid context since it
has been "consumed" as part of the callback entry.
6. Console has nowhere to go so it restarts its krn$_idle flow (basically a
console restart) versus it's halt entry flow. Upon realizing that there's
no valid context, console throws up its hands and breakpoints in XDELTA.
7. Meanwhile, secondaries are coming back through krn$_idle and printing
messages like "console starting on CPU n" which only further adds to the
confusion (and can hang the console terminal since these messages are
issued via pprintfs versus printfs and are normally staggered by node ID).
Basically, it's a mess. And it's a rapidly growing problem in the Field..
Stephen Shirron checked the TurboLaser VMS error routines and found that MCES
is not cleared on a TurboLaser bugcheck either. It is cleared by VMS on Sable.
My reasons for sending this mail are two-fold:
1. To alert you to the problem, in the event you have seen or will see QARs
logged against Rawhide/VMS.
2. To ask for your opinion on the matter. In a quick perusal of the Alpha
SRM, I couldn't find any detailed discussion re: MCES usage, specifically
in the area of PAL/console entries and MCES settings. Is there a technical
reason why VMS' Rawhide (and TurboLaser) error routine flows don't clear
MCES, but Sable's VMS routines do?
As an FYI, we're leaning heavily toward changing Rawhide (and TurboLaser,
Sable, etc.) console to clear MCES<0> on primary and secondary CPUs prior to
entering our PCI reset / VGA re-init flow (which we call on a CB_OPEN); even
if you were to issue a patch to change VMS' treatment of MCES during MCHK
handling.
What are your thoughts/comments on this matter?
Thanks,
BC
|
|
reply .*
thanks for the help.
check the log sent by user found after the note I posted,
There look like problem with the fan, I will send out the
firmware v3.9 to user. let user know vms v7.2 will fix the
660 problem.
rgds
/stanley
Your system has previously experienced and logged a Temperature, Fan=
,
or Power environmental error event. Type SHOW POWER for more details=
.
=20
AlphaServer 4000 Console V3.0-10, 19-NOV-1996 13:57:07
=20
Halt Button is IN, BOOT NOT POSSIBLE
=20
P00>>>
P00>>>
P00>>>show power
=20
Status
Power Supply 0 good =20
Power Supply 1 not present
Power Supply 2 good =20
System Fans good =20
CPU Fans good =20
Temperature good =20
=20
The system was last reset via a front-panel (OCP) reset
=20
7 Environmental events are logged in nvram
Do you want to view the events? (Y/<N>) y
=20
Total Environmental Events: 7 (7 logged)
=20
1 DEC 31 13:14 Temperature, Fans, Power Supplies Normal
2 JAN 3 10:14 Temperature, Fans, Power Supplies Normal
3 JAN 3 10:17 Temperature, Fans, Power Supplies Normal
4 JAN 6 12:36 Temperature, Fans, Power Supplies Normal
5 JAN 8 12:59 Temperature, Fans, Power Supplies Normal
6 JAN 20 17:50 Power Supply 2 Failure
7 JAN 20 17:50 Temperature, Fans, Power Supplies Normal
=20
Do you want to clear all events from nvram? (Y/<N>) n
P00>>>=08=08 =20
SROM V1.1 on cpu0
XSROM V3.0 on cpu0
BCache testing complete on cpu0
mem_pair0 - 128 MB=20
mem_pair1 - 128 MB=20
20..21..23..
please wait 9 seconds for T24 to complete
24..
Memory testing complete on cpu0
starting console on CPU 0
sizing memory
0 128 MB SYNC
1 128 MB SYNC
=1F=1F=7F=7Fprobing IOD1 hose 1=20
bus 0 slot 1 - NCR 53C810
bus 0 slot 2 - NCR 53C825
bus 0 slot 4 - NCR 53C825
probing IOD0 hose 0=20
bus 0 slot 1 - PCEB
probing EISA Bridge, bus 1
bus 0 slot 2 - DECchip 21041-AA
bus 0 slot 3 - DECchip 21041-AA
bus 0 slot 4 - PCI-PCI Bridge
probing PCI-PCI Bridge, bus 2
bus 2 slot 0 - ISP1020
configuring I/O adapters...
ncr0, hose 1, bus 0, slot 1
kfpsa0, hose 1, bus 0, slot 2
kfpsa1, hose 1, bus 0, slot 4
floppy0, hose 0, bus 1, slot 0
tulip0, hose 0, bus 0, slot 2
tulip1, hose 0, bus 0, slot 3
isp0, hose 0, bus 2, slot 0
System temperature is 26 degrees C
=20
Your system has previously experienced and logged a Temperature, Fan=
,
or Power environmental error event. Type SHOW POWER for more details=
.
=20
AlphaServer 4000 Console V3.0-10, 19-NOV-1996 13:57:07
=20
CPU 0 booting
=20
(boot dua111.1.0.2.1 -flags 0,0)
FRU table creation disabled
block 0 of dua111.1.0.2.1 is a valid boot block
reading 904 blocks from dua111.1.0.2.1
bootstrap code read in
base =3D 200000, image_start =3D 0, image_bytes =3D 71000
initializing HWRPB at 2000
initializing page table at 1f2000
initializing machine state
setting affinity to the primary CPU
jumping to bootstrap code
=20
=20
OpenVMS (TM) Alpha Operating System, Version V7.1 =20
=20
%DECnet-I-LOADED, network base image loaded, version =3D 05.0C.00
=20
$! Copyright (c) 1996 Digital Equipment Corporation. All rights re=
served.
%STDRV-I-STARTUP, OpenVMS startup begun at 5-MAY-1997 09:06:15.85
=07=07=07%EWA0, Twisted-Pair(10baseT) mode set by console
%RUN-S-PROC_ID, identification of created process is 00000085
%RUN-S-PROC_ID, identification of created process is 00000086
%SET-I-NEWAUDSRV, identification of new audit server process is 0000=
008A
=07%%%%%%%%%%% OPCOM 5-MAY-1997 09:06:38.14 %%%%%%%%%%%
Operator _STP002$OPA0: has been enabled, username SYSTEM
=20
=07%%%%%%%%%%% OPCOM 5-MAY-1997 09:06:38.29 %%%%%%%%%%%
Operator status for operator _STP002$OPA0:
CENTRAL, PRINTER, TAPES, DISKS, DEVICES, CARDS, NETWORK, CLUSTER, SE=
CURITY,
LICENSE, OPER1, OPER2, OPER3, OPER4, OPER5, OPER6, OPER7, OPER8, OPE=
R9, OPER10,
OPER11, OPER12
=20
|
| hi stanley
just in case its looks like from the show power log
power supply 2 had a problem. this could be caused by bad power
supply or its connections or someone pulling off its ac power plug
Total Environmental Events: 7 (7 logged)
=20
1 DEC 31 13:14 Temperature, Fans, Power Supplies Normal
2 JAN 3 10:14 Temperature, Fans, Power Supplies Normal
3 JAN 3 10:17 Temperature, Fans, Power Supplies Normal
4 JAN 6 12:36 Temperature, Fans, Power Supplies Normal
5 JAN 8 12:59 Temperature, Fans, Power Supplies Normal
bad***** 6 JAN 20 17:50 Power Supply 2 Failure
now okay 7 JAN 20 17:50 Temperature, Fans, Power Supplies Normal
weird date though doesn't match 95.0 timestamps??
power supplies labled 0, 1, 2 looking from the front of system
jim hutmacher mvhs colorado csc 800-354-9000 ext 25561
|