[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::alphaserver_4100

Title:AlphaServer 4100
Moderator:MOVMON::DAVISS
Created:Tue Apr 16 1996
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:648
Total number of notes:3158

423.0. "XDELTA breakpoint with fan failure?" by VIVIAN::D_BONO () Tue Jan 21 1997 06:14

T.RTitleUserPersonal
Name
DateLines
423.1Cross posted to VMSNOTESVIVIAN::D_BONOWed Jan 22 1997 04:454
423.2LANDO::CUMMINSWed Jan 22 1997 09:4211
423.3LANDO::CUMMINSThu Jan 23 1997 12:1912
423.4LANDO::CUMMINSThu Jan 23 1997 15:281
423.5F03 PCMMOVMON::DAVISFri Jan 24 1997 12:193
    The most recent version of the PCM is actually F03.
    
    Todd
423.6F02/F03 still the same.VIVIAN::D_BONOMon Jan 27 1997 10:1521
    
    Hi,
    
    We ordered up a new PCM from Logistics.
    
    We reproduced a fan failure with the original module and experienced
    the same symptoms i.e. Brk 0 at 0005FD0C. then 0005FD0C ! BPT
    
    We then removed the original PCM and found it to be REV F03.
    
    We put the module in from Logistics which was REV F02 however this also
    does the same. 
    
    Any ideas?
    
    We are running firmware V2.0-3 PAL code V1.18-8.
    
    Thanks,
    
    Dave Bono (MCS London)
                                                    
423.7LANDO::CUMMINSMon Jan 27 1997 14:0421
    Our proto experienced the problem you see with an older rev PCM. When
    we upgraded our PCM, the failure/crash worked fine. Our testing had
    been with V4.8 console (soon to be released on V3.9 CD) and VMS V7.1.
    We had done the same experiment at FRS using V1.2-4 console and OpenVMS
    V6.2-1H3 (and Digital UNIX V3.2F).
    
    After you posted reply .5, we re-tested yet again with V2.0-3 console.
    Worked fine.
    
    Bottom line: we cannot reproduce the behavior you see with the new PCM.
    
    Is there anyone out there that'd be willing to try the "finger in the
    fan" experiment on another proto (with an up-to-rev PCM)?
    
From:	DANGER::LEMIEUX      "Relax, you've been erased" 27-JAN-1997 13:08:03.26
To:	LANDO::CUMMINS
CC:	LEMIEUX
Subj:	fan

V2 works fine. Tried stopping primary's fan and also secondary's fan.
Dumped and rebooted no problem.
423.8Different Machine check gives XDELTAVIVIAN::D_BONOWed Jan 29 1997 07:0716
    
    I don't know if this is relevant, but I went to another site last night
    that had a failing 4100 that was Machine checking (660) and dropping to
    XDELTA. This problem was not fan/power related as >>> show power was
    clear. This machine was running V6.2-1H3 and V2.0 console firmware.
    
    The problem on this machine appeared to be an intermittent PCI bridge 
    module?
    
    Hence is it possible that any MAchine check 660 causes the drop to 
    XDELTA, not even PCM related?
    
    Thanks,
    
    Dave Bono.
              
423.9XDELTA 0006133c ! BPTCSC32::HUTMACHERFri Jan 31 1997 11:5620
    hi we have another site reporting this at the support center.
    the problem seems not  to be fan/power related >>>show power
    is clean. 
    
    **error_routines_1605**machinecheck 670* o! ! cpu1
    
    **error_routines_1605**machinecheck 670e!o n  cpu3
    
    Brk 0 at 0006133c
    
    0006133c ! BPT
    
    field service has swapped out 4 cpus and all of memory at this point
    >>>show power is not capturing enviromental issue
    having field service check PCM rev F03 and guess we should try B3040
    horse module?
    is there a better way to troubleshoot these BPT's?
    does 0006133c address give us any clue?
    
    any input appreciated  jim hutmacher mvhs colorado csc 592-5561
423.10LANDO::CUMMINSFri Jan 31 1997 18:282
    We have further information on at least some of these mysterious
    failure behaviors. Watch this space.. Monday?
423.11Something to try when sitting at XDELTA breakpoint..LANDO::CUMMINSFri Jan 31 1997 18:3110
    At XDELTA breakpoint, do the following and capture the output:
    
      [Q			# quadword mode
      4400/			# display contents of address 4400
    
      then lean on ^J key until you're at address 7FFC..
    
    Send me the results or post them here. This data is PALcode's
    impure/MCHK logout area which will have a snapshot of entry state
    and MCHK/CRD error state, etc. 
423.12Problem found; fix in works; Steps to diagnose down systems..LANDO::CUMMINSMon Feb 03 1997 17:4278
    We have duplicated the console dropping into XDELTA breakpoint and
    have isolated the problem. It should only occur on VMS machines and
    typically only on systems that have a VGA card present in the PCI
    backplane (even if CONSOLE is set SERIAL).
    
    Until this is fixed via console, VMS, or a combination of both, there
    are things you can do to gather more information about a particular
    crash:
    
      1. If possible, remove the VGA card and wait for the next crash.
      2. Use the XDELTA instruction sequence described in the previous
         reply note and either post it here or send it to me.
      3. If an environmental MCHK event (fan, temp failure), use the SRM
         console SHOW POWER command to determine what failed.
    
    We did not reproduce the fan failure MCHK drop into XDELTA because all
    of our early on (back in Spring '95) and recent testing was performed
    by holding a finger to the fan for five seconds or so - this always
    results in a successful crash dump. It wasn't until we held it for a
    longer period of time that we reproduced the problem.
    
    A recent memo on the problem description is attached for reference.
    
From:	LANDO::CUMMINS      "Bill Cummins, PKO3-2/Q21, 223-4641"  3-FEB-1997 17:05:16.60
To:	STAR::JHUBER,STAR::KFOLLIEN,STAR::JANETOS
CC:	MAYO,LEMIEUX,CUMMINS
Subj:	VMS not clearing MCES before bugcheck on Rawhide (and TurboLaser)..

Hi Jeff, Ken, Jim,

We've recently been working some customer problems where the Rawhide console is
unable to successfully complete a crash dump following a 670 or 660 MCHK. We've
tracked the problem down to the following sequence of events:

  1. A 660 or 670 occurs (e.g. fan failure, DTAG PE, fill error, etc.)
  2. VMS eventually calls CB_OPEN to begin the process of writing the crash.
  3. To regain control of the I/O, we reset the PCIs, including a VGA re-init,
     if VGA is present.
  4. Since we use the BIOS emulator for VGA init, we take MCHKs while the BIOS
     option ROM code probes the PCI. We normally handle/dismiss these MCHKs via
     a handler we install. On a UNIX dump, we handle the MCHKs gracefully since
     MCES is clear, and then go on to successfully dump. However, during a VMS
     dump, PAL sees MCES set and throws up its hands with a double error halt.
  5. Console eventually regains control, but can find no valid context since it
     has been "consumed" as part of the callback entry.
  6. Console has nowhere to go so it restarts its krn$_idle flow (basically a
     console restart) versus it's halt entry flow. Upon realizing that there's
     no valid context, console throws up its hands and breakpoints in XDELTA.
  7. Meanwhile, secondaries are coming back through krn$_idle and printing
     messages like "console starting on CPU n" which only further adds to the
     confusion (and can hang the console terminal since these messages are 
     issued via pprintfs versus printfs and are normally staggered by node ID).

Basically, it's a mess. And it's a rapidly growing problem in the Field..

Stephen Shirron checked the TurboLaser VMS error routines and found that MCES
is not cleared on a TurboLaser bugcheck either. It is cleared by VMS on Sable.

My reasons for sending this mail are two-fold:

  1. To alert you to the problem, in the event you have seen or will see QARs
     logged against Rawhide/VMS.
  2. To ask for your opinion on the matter. In a quick perusal of the Alpha
     SRM, I couldn't find any detailed discussion re: MCES usage, specifically
     in the area of PAL/console entries and MCES settings. Is there a technical
     reason why VMS' Rawhide (and TurboLaser) error routine flows don't clear
     MCES, but Sable's VMS routines do?

As an FYI, we're leaning heavily toward changing Rawhide (and TurboLaser,
Sable, etc.) console to clear MCES<0> on primary and secondary CPUs prior to
entering our PCI reset / VGA re-init flow (which we call on a CB_OPEN); even
if you were to issue a patch to change VMS' treatment of MCES during MCHK
handling.

What are your thoughts/comments on this matter?

Thanks,
BC
423.13Is there a fix for this problem?VIVIAN::D_BONOFri May 16 1997 06:418
    
    Hi,
    
    Is there a fix for this problem. New console firmware perhaps?
    
    Thanks,
    
    Dave Bono.
423.14HARMNY::CUMMINSFri May 16 1997 10:337
    VMS will be fixing this in the V&.2 timeframe. Console versions V4.8-5
    and beyond have added a "fix" for this problem that will allow us to
    work around many cases, but possibly not all.. V4.8-5 is on the V3.9
    CD. V4.8-7 is the latest and that's in the /interim/as4x00 area on the
    Web. [V4.8-7 fixes a PALcode register (R1) corruption problem that
    occurs during recoverable and non-recoverable environmental error event
    handling..]
423.15MAY21::CUMMINSFri May 16 1997 11:149
    Note 11.* contains FW release readme files, etc.
    
    Snippet from V4.8-5 readme file posted in note 11.14.
    
    --> Fix for the problem on systems running OpenVMS where a machine check
        660 or 670 results in the system stopping at an XDELTA breakpoint and
        not continuing with a crash dump.  This problem occurs on OpenVMS
        systems configured with a VGA adapter in the PCI backplane and is
        independent of the CONSOLE setting (SERIAL or GRAPHICS).