[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference vaxaxp::alphanotes

Title:Alpha Support Conference
Notice:This is a new Alphanotes, please read note 2.2
Moderator:VAXAXP::BERNARDO
Created:Thu Jan 02 1997
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:128
Total number of notes:617

95.0. "alphaserver 4000 machinecheck 660 " by GIDDAY::SHCHIU () Tue May 06 1997 01:41

ALP/VMS V7.1
Standalone  alpha server 4000-5/400 console version  v3.0-10 

system machine check 660 and left no entry in errlog.sys NOR dump been
written ....

Where could the problem be ?

Could someone point me where to look at the problem ?


rgds

/stanley 



System Information:
System Type    AlphaServer 4000 5/400 4MB             Primary CPU ID 00
Cycle Time     2.5 nsec (400 MHz)                     Pagesize       8192 Byte

Memory Configuration:
Cluster    PFN Start    PFN Count         Range (MByte)        Usage
 #03             0          256         0.0 MB -     2.0 MB    Console
 #04           256        32511         2.0 MB -   255.9 MB    System
 #05         32767            1       255.9 MB -   256.0 MB    Console

CPU ID         00                        CPU State    rc,pa,pp,cv,pv,pmv,pl
CPU Type       EV56  Pass 1 (21164A)     Halt PC      00000000.20000000
  PAL Code       1.19-2                    Halt PS      00000000.00001F00
CPU Revision   ....                      Halt Code    00000000.00000000
Serial Number  ..........                "Bootstrap or Powerfail"
Console Vers   V3.0-10

Adapter Configuration:
dapter Configuration:



TR Adapter     ADP      Hose Bus   BusArrayEntry  Node Device Name / HW-Id
-- ----------- -------- ---- -------------------- ---- ------------------------
-
 1 KA1605      81048E40    0 GLOBAL_BUS
 2 MC_BUS      81049200    7 MC_BUS
                                   81049418          5 KA1605_PCI
                                   81049450          4 KA1605_PCI
5                                  81049568          1 KA1605_MEMORY
 3 PCI         81049600   61 PCI
                                   81049850  PKA:    1 NCR 53C810 SCSI
                                   81049888  PIA:    2 KFPSA DSSI
                                   810498F8  PIB:    4 KFPSA DSSI
 4 PCI         81049B80   60 PCI
                                   81049DD0          1 MERCURY

                                  81049E08  EWA:    2 DC21041 - 10 mbit NI (Tul
ip)
                                   81049E40  EWB:    3 DC21041 - 10 mbit NI (
Tul
ip)
                                   81049E78          4 PBB
 5 EISA        8104A080   60 EISA
                                   8104A298          0 System Board
 6 XBUS        8104A640   60 XBUS
                                   8104A818          0 EISA_SYSTEM_BOARD
3                                  8104A850  DVA:    1 Floppy
                                   8104A888  LRA:    2 Line Printer (parallel
po
rt)
                                   8104A8C0  TTA:    3 NS16450 Serial Port
 7 PCI         8104AB00   60 PCI
                                   8104AD18  PKB:    0 Qlogic ISP1020 SCSI-2







 **ERROR_ROUTINES_1605**MACHINECHECK 660**CPU=3D 0
 ***** About to MachineCheck!=C0!!! Console entry context is not vali=
d -=20
 Reset the system !!!
=20
 Brk 0 at 0006133C
=20
 0006133C ! BPT          =20
 =20
 Eh?
 =20
 =08=08 =20
  SROM V1.1 on cpu0
 XSROM V3.0 on cpu0
 BCache testing complete on cpu0
 mem_pair0 - 128 MB=20
 mem_pair1 - 128 MB=20
 20..21..23..
 please wait 9 seconds for T24 to complete
 24..
 Memory testing complete on cpu0
 starting console on CPU 0
 sizing memory
   0    128 MB SYNC
   1    128 MB SYNC
 =1F=1F=7F=7Fprobing IOD1 hose 1=20
   bus 0 slot 1 - NCR 53C810
   bus 0 slot 2 - NCR 53C825
   bus 0 slot 4 - NCR 53C825
probing IOD0 hose 0=20
   bus 0 slot 1 - PCEB
     probing EISA Bridge, bus 1
   bus 0 slot 2 - DECchip 21041-AA
   bus 0 slot 3 - DECchip 21041-AA
   bus 0 slot 4 - PCI-PCI Bridge
     probing PCI-PCI Bridge, bus 2
       bus 2 slot 0 - ISP1020
 configuring I/O adapters...
   ncr0, hose 1, bus 0, slot 1
   kfpsa0, hose 1, bus 0, slot 2
   kfpsa1, hose 1, bus 0, slot 4
   floppy0, hose 0, bus 1, slot 0
   tulip0, hose 0, bus 0, slot 2
   tulip1, hose 0, bus 0, slot 3
   isp0, hose 0, bus 2, slot 0
 System temperature is 26 degrees C
=20
 Your system has previously experienced and logged a Temperature, Fan=
,
 or Power environmental error event. Type SHOW POWER for more details=
.
=20
 AlphaServer 4000 Console V3.0-10, 19-NOV-1996 13:57:07
=20
 CPU 0 booting
=20
 (boot dua111.1.0.2.1 -flags 0,0)
 FRU table creation disabled
 block 0 of dua111.1.0.2.1 is a valid boot block
 reading 904 blocks from dua111.1.0.2.1
 bootstrap code read in
 base =3D 200000, image_start =3D 0, image_bytes =3D 71000
 initializing HWRPB at 2000
 initializing page table at 1f2000
 initializing machine state
 setting affinity to the primary CPU
 jumping to bootstrap code
=20
=20
     OpenVMS (TM) Alpha Operating System, Version V7.1   =20
=20
 %DECnet-I-LOADED, network base image loaded, version =3D 05.0C.00
=20
 $!  Copyright (c) 1996 Digital Equipment Corporation.  All rights re=
served.
 %STDRV-I-STARTUP, OpenVMS startup begun at  5-MAY-1997 09:03:44.31
 halted CPU 0
 %STDRV-I-STARTUP, OpenVMS startup begun at  5-MAY-1997 09:03:44.31
 halted CPU 0
=20
 halt code =3D 1
 operator initiated halt
 PC =3D ffffffff800a560c
 P00>>>=C0=08=08 =20
  SROM V1.1 on cpu0
 XSROM V3.0 on cpu0
 BCache testing complete on cpu0
 mem_pair0 - 128 MB=20
 mem_pair1 - 128 MB=20
 20..21..23..
 please wait 9 seconds for T24 to complete
 24..
 Memory testing complete on cpu0
 starting console on CPU 0
 sizing memory
Memory testing complete on cpu0
 starting console on CPU 0
 sizing memory
   0    128 MB SYNC
   1    128 MB SYNC
 =1F=1F=7F=7Fprobing IOD1 hose 1=20
   bus 0 slot 1 - NCR 53C810
   bus 0 slot 2 - NCR 53C825
   bus 0 slot 4 - NCR 53C825
 probing IOD0 hose 0=20
   bus 0 slot 1 - PCEB
     probing EISA Bridge, bus 1
   bus 0 slot 2 - DECchip 21041-AA
   bus 0 slot 3 - DECchip 21041-AA
   bus 0 slot 4 - PCI-PCI Bridge
     probing PCI-PCI Bridge, bus 2
       bus 2 slot 0 - ISP1020
 configuring I/O adapters...
   ncr0, hose 1, bus 0, slot 1
  kfpsa0, hose 1, bus 0, slot 2
   kfpsa1, hose 1, bus 0, slot 4
   floppy0, hose 0, bus 1, slot 0
   tulip0, hose 0, bus 0, slot 2
   tulip1, hose 0, bus 0, slot 3
   isp0, hose 0, bus 2, slot 0
 System temperature is 26 degrees C
=20
 Your system has previously experienced and logged a Temperature, Fan=
,
 or Power environmental error event. Type SHOW POWER for more details=
.
=20
 AlphaServer 4000 Console V3.0-10, 19-NOV-1996 13:57:07
=20
 Halt Button is IN, BOOT NOT POSSIBLE


    
T.RTitleUserPersonal
Name
DateLines
95.1need higher rev srm and enviromental prblmCSC32::HUTMACHERTue May 06 1997 09:5732
    Hi Stanley
    
    there's a console issue with asrv4100/4000 running vms where what should
    be a cpu/memory/enviromental error ends up crashing system to xdelta
    console mode instead of writing dump file or errlog entries
    
    your system is running srm console version v3.0-10 you need to get to
    srm V4.8-5 off firmware cd v3.9 AG-PTMWW-BS 
    or off internet
    ftp://ftp.digital.com/pub/Digital/Alpha/firmware/v3.9/as4x00/
    
    this part of crash "message" is a clue that you had either power supply
    problem or fan failure either cpu muffin fan or the large fans behind
    power supplies fault
    
    "Your system has previously experienced and logged a Temperature,
    Fan=, or Power environmental error event."
    
    >>>show power     --- this will print event log for power/fan faults
    
     
    looking from the front of system   power supplies and fans behind them 
    are numbered   0  1  2   so if had sys fan 2 fault this would be bad fan
                         *
    
    there also a asrv4100/4000 notes file
    MVBLAB::ALPHASERVER_4100 note:423 has a similar issue/problem
    
    take care
    
    jim hutmacher mvhs colorado csc 800-354-9000 ext 25561
    
95.2HARMNY::CUMMINSTue May 06 1997 10:5890
    Details about this problem. Console version V4.8-5 and beyond clear
    MCES under certain circumstances, but the solution is not foolproof. I
    believe VMS has stated they will be clearing MCES much earlier in crash
    scenarios on 4100/4000 as of VMS 7.2.
    
      <<< MVBLAB::SYS$SYSDEVICE:[NOTES$LIBRARY]ALPHASERVER_4100.NOTE;1 >>>
                             -< AlphaServer 4100 >-
================================================================================
Note 423.12            XDELTA breakpoint with fan failure?              12 of 12
LANDO::CUMMINS                                       78 lines   3-FEB-1997 17:42
       -< Problem found; fix in works; Steps to diagnose down systems.. >-
--------------------------------------------------------------------------------
    We have duplicated the console dropping into XDELTA breakpoint and
    have isolated the problem. It should only occur on VMS machines and
    typically only on systems that have a VGA card present in the PCI
    backplane (even if CONSOLE is set SERIAL).
    
    Until this is fixed via console, VMS, or a combination of both, there
    are things you can do to gather more information about a particular
    crash:
    
      1. If possible, remove the VGA card and wait for the next crash.
      2. Use the XDELTA instruction sequence described in the previous
         reply note and either post it here or send it to me.
      3. If an environmental MCHK event (fan, temp failure), use the SRM
         console SHOW POWER command to determine what failed.
    
    We did not reproduce the fan failure MCHK drop into XDELTA because all
    of our early on (back in Spring '95) and recent testing was performed
    by holding a finger to the fan for five seconds or so - this always
    results in a successful crash dump. It wasn't until we held it for a
    longer period of time that we reproduced the problem.
    
    A recent memo on the problem description is attached for reference.
    
From:	LANDO::CUMMINS      "Bill Cummins, PKO3-2/Q21, 223-4641"  3-FEB-1997 17:05:16.60
To:	STAR::JHUBER,STAR::KFOLLIEN,STAR::JANETOS
CC:	MAYO,LEMIEUX,CUMMINS
Subj:	VMS not clearing MCES before bugcheck on Rawhide (and TurboLaser)..

Hi Jeff, Ken, Jim,

We've recently been working some customer problems where the Rawhide console is
unable to successfully complete a crash dump following a 670 or 660 MCHK. We've
tracked the problem down to the following sequence of events:

  1. A 660 or 670 occurs (e.g. fan failure, DTAG PE, fill error, etc.)
  2. VMS eventually calls CB_OPEN to begin the process of writing the crash.
  3. To regain control of the I/O, we reset the PCIs, including a VGA re-init,
     if VGA is present.
  4. Since we use the BIOS emulator for VGA init, we take MCHKs while the BIOS
     option ROM code probes the PCI. We normally handle/dismiss these MCHKs via
     a handler we install. On a UNIX dump, we handle the MCHKs gracefully since
     MCES is clear, and then go on to successfully dump. However, during a VMS
     dump, PAL sees MCES set and throws up its hands with a double error halt.
  5. Console eventually regains control, but can find no valid context since it
     has been "consumed" as part of the callback entry.
  6. Console has nowhere to go so it restarts its krn$_idle flow (basically a
     console restart) versus it's halt entry flow. Upon realizing that there's
     no valid context, console throws up its hands and breakpoints in XDELTA.
  7. Meanwhile, secondaries are coming back through krn$_idle and printing
     messages like "console starting on CPU n" which only further adds to the
     confusion (and can hang the console terminal since these messages are 
     issued via pprintfs versus printfs and are normally staggered by node ID).

Basically, it's a mess. And it's a rapidly growing problem in the Field..

Stephen Shirron checked the TurboLaser VMS error routines and found that MCES
is not cleared on a TurboLaser bugcheck either. It is cleared by VMS on Sable.

My reasons for sending this mail are two-fold:

  1. To alert you to the problem, in the event you have seen or will see QARs
     logged against Rawhide/VMS.
  2. To ask for your opinion on the matter. In a quick perusal of the Alpha
     SRM, I couldn't find any detailed discussion re: MCES usage, specifically
     in the area of PAL/console entries and MCES settings. Is there a technical
     reason why VMS' Rawhide (and TurboLaser) error routine flows don't clear
     MCES, but Sable's VMS routines do?

As an FYI, we're leaning heavily toward changing Rawhide (and TurboLaser,
Sable, etc.) console to clear MCES<0> on primary and secondary CPUs prior to
entering our PCI reset / VGA re-init flow (which we call on a CB_OPEN); even
if you were to issue a patch to change VMS' treatment of MCES during MCHK
handling.

What are your thoughts/comments on this matter?

Thanks,
BC
95.3thanks for the help.GIDDAY::SHCHIUTue May 06 1997 21:49145
    
    
    reply .*
    
    thanks for the help.
    
    check the log sent by user found after the note I posted, 
    
    There look like problem with the fan, I will send out the 
    firmware v3.9 to user. let user know vms v7.2 will fix the
    660 problem.
    
    
    rgds
    
    
    /stanley 
    
    
    Your system has previously experienced and logged a Temperature, Fan=
    ,
     or Power environmental error event. Type SHOW POWER for more details=
    .
    =20
     AlphaServer 4000 Console V3.0-10, 19-NOV-1996 13:57:07
    =20
     Halt Button is IN, BOOT NOT POSSIBLE
    =20
     P00>>>
     P00>>>
     P00>>>show power
    =20
                         Status
     Power Supply 0       good        =20
     Power Supply 1       not present
     Power Supply 2       good        =20
     System Fans          good        =20
     CPU Fans             good        =20
     Temperature          good        =20
    =20
     The system was last reset via a front-panel (OCP) reset
    =20
     7 Environmental events are logged in nvram
     Do you want to view the events? (Y/<N>) y
    =20
     Total Environmental Events: 7  (7 logged)
    =20
     1  DEC 31 13:14  Temperature, Fans, Power Supplies Normal
     2  JAN  3 10:14  Temperature, Fans, Power Supplies Normal
     3  JAN  3 10:17  Temperature, Fans, Power Supplies Normal
     4  JAN  6 12:36  Temperature, Fans, Power Supplies Normal
     5  JAN  8 12:59  Temperature, Fans, Power Supplies Normal
     6  JAN 20 17:50  Power Supply 2 Failure
     7  JAN 20 17:50  Temperature, Fans, Power Supplies Normal
    =20
     Do you want to clear all events from nvram? (Y/<N>) n
     P00>>>=08=08 =20
      SROM V1.1 on cpu0
    XSROM V3.0 on cpu0
     BCache testing complete on cpu0
     mem_pair0 - 128 MB=20
     mem_pair1 - 128 MB=20
     20..21..23..
     please wait 9 seconds for T24 to complete
     24..
     Memory testing complete on cpu0
     starting console on CPU 0
     sizing memory
       0    128 MB SYNC
       1    128 MB SYNC
     =1F=1F=7F=7Fprobing IOD1 hose 1=20
       bus 0 slot 1 - NCR 53C810
       bus 0 slot 2 - NCR 53C825
       bus 0 slot 4 - NCR 53C825
    
    
     probing IOD0 hose 0=20
       bus 0 slot 1 - PCEB
         probing EISA Bridge, bus 1
       bus 0 slot 2 - DECchip 21041-AA
       bus 0 slot 3 - DECchip 21041-AA
       bus 0 slot 4 - PCI-PCI Bridge
         probing PCI-PCI Bridge, bus 2
           bus 2 slot 0 - ISP1020
     configuring I/O adapters...
       ncr0, hose 1, bus 0, slot 1
       kfpsa0, hose 1, bus 0, slot 2
       kfpsa1, hose 1, bus 0, slot 4
       floppy0, hose 0, bus 1, slot 0
       tulip0, hose 0, bus 0, slot 2
       tulip1, hose 0, bus 0, slot 3
       isp0, hose 0, bus 2, slot 0
     System temperature is 26 degrees C
    =20
     Your system has previously experienced and logged a Temperature, Fan=
    ,
     or Power environmental error event. Type SHOW POWER for more details=
    .
    =20
     AlphaServer 4000 Console V3.0-10, 19-NOV-1996 13:57:07
    =20
     CPU 0 booting
    =20
     (boot dua111.1.0.2.1 -flags 0,0)
     FRU table creation disabled
     block 0 of dua111.1.0.2.1 is a valid boot block
     reading 904 blocks from dua111.1.0.2.1
     bootstrap code read in
    
    base =3D 200000, image_start =3D 0, image_bytes =3D 71000
     initializing HWRPB at 2000
     initializing page table at 1f2000
     initializing machine state
     setting affinity to the primary CPU
     jumping to bootstrap code
    =20
    =20
         OpenVMS (TM) Alpha Operating System, Version V7.1   =20
    =20
     %DECnet-I-LOADED, network base image loaded, version =3D 05.0C.00
    =20
     $!  Copyright (c) 1996 Digital Equipment Corporation.  All rights re=
    served.
     %STDRV-I-STARTUP, OpenVMS startup begun at  5-MAY-1997 09:06:15.85
     =07=07=07%EWA0, Twisted-Pair(10baseT) mode set by console
    
    
     %RUN-S-PROC_ID, identification of created process is 00000085
     %RUN-S-PROC_ID, identification of created process is 00000086
     %SET-I-NEWAUDSRV, identification of new audit server process is 0000=
    008A
     =07%%%%%%%%%%%  OPCOM   5-MAY-1997 09:06:38.14  %%%%%%%%%%%
     Operator _STP002$OPA0: has been enabled, username SYSTEM
    =20
     =07%%%%%%%%%%%  OPCOM   5-MAY-1997 09:06:38.29  %%%%%%%%%%%
     Operator status for operator _STP002$OPA0:
     CENTRAL, PRINTER, TAPES, DISKS, DEVICES, CARDS, NETWORK, CLUSTER, SE=
    CURITY,
     LICENSE, OPER1, OPER2, OPER3, OPER4, OPER5, OPER6, OPER7, OPER8, OPE=
    R9, OPER10,
     OPER11, OPER12
    =20
    
    
    
95.4alphaserver 4000 machinecheck 660CSC32::HUTMACHERWed May 07 1997 10:2524
    hi stanley
    
    just in case its looks like from the show power log 
    power supply 2 had a problem. this could be caused by bad power
    supply or its connections or someone pulling off its ac power plug
    
    
         Total Environmental Events: 7  (7 logged)
        =20
          1  DEC 31 13:14  Temperature, Fans, Power Supplies Normal
          2  JAN  3 10:14  Temperature, Fans, Power Supplies Normal
          3  JAN  3 10:17  Temperature, Fans, Power Supplies Normal
          4  JAN  6 12:36  Temperature, Fans, Power Supplies Normal
          5  JAN  8 12:59  Temperature, Fans, Power Supplies Normal
 bad***** 6  JAN 20 17:50  Power Supply 2 Failure
 now okay 7  JAN 20 17:50  Temperature, Fans, Power Supplies Normal
    
    weird date though doesn't match 95.0 timestamps??
    
    
    power supplies labled   0, 1, 2  looking from the front of system
    
    jim hutmacher mvhs colorado csc 800-354-9000 ext 25561