[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::alphaserver_4100

Title:AlphaServer 4100
Moderator:MOVMON::DAVISS
Created:Tue Apr 16 1996
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:648
Total number of notes:3158

589.0. "Memory > 1GB problem" by GIDDAY::HO (sunny Melbourne by the c) Tue May 06 1997 09:56

--------------------------------------------------------------------------------


We have done an upgrade on two EV5/400 systems by adding 1 GB memory about ten 
days ago. Both machines since then have been hanging a couple of times a day.
Most of the time we couldn't even halt the system by pushing the "HALT" button
on the OCP. The only way to recover is by power cycling.

Both systems were then backed out of the memory upgrade. They have been running
fine for six days now.

It appears that we cannot put in more than 1 GB of memory in these boxes. 
Both systems are running Alpha OpenVMS ver 6.2. Customer could halt one of the
systems only on one occasion. We managed to get them do an "info 5" on the
console.

The information below is from "info 5" at the console, unfortunately a 
crash dump was not taken.


halted CPU 0

halt code = 1
operator initiated halt
PC = ffffffff8007aa94
P00>>>info 5
                      cpu00
per_cpu logout area  00004838
mchk$crd_flag        00000320 : 0000
mchk$crd_flag+4      00000000 : 0004
mchk$crd_offsets     00000118 : 0008
mchk$crd_offsets+4   00001328 : 000c
mchk$crd_mchk_code   00980000 : 0010
mchk$crd_mchk_code+4 00000000 : 0014
mchk$crd_ei_stat     00000000 : 0018
mchk$crd_ei_stat+4   00000000 : 001c
mchk$crd_ei_addr     00000000 : 0020
mchk$crd_ei_addr+4   00000000 : 0024
mchk$crd_fill_syn    00000000 : 0028
mchk$crd_fill_syn+4  00000000 : 002c
mchk$crd_isr         00000000 : 0030
mchk$crd_isr+4       00000000 : 0034
mchk$crd_whoami      00000000 : 0038
mchk$crd_iic_environ 00004b36 : 003c
mchk$crd_base_addr   00000000 : 0040
mchk$crd_base_addr+4 00000000 : 0044
mchk$crd_pci_rev     00000000 : 0048
mchk$crd_mc_err0     00000000 : 004c
mchk$crd_mc_err1     00000000 : 0050
mchk$crd_cap_err     00000000 : 0054
mchk$crd_mdpa_stat   00000000 : 0058
mchk$crd_mdpa_syn    00000000 : 005c
mchk$crd_mdpb_stat   00000000 : 0060
mchk$crd_mdpb_syn    00000000 : 0064
mchk$flag            00000320 : 0000
mchk$flag+4          00000000 : 0004
mchk$offsets         00000118 : 0008
mchk$offsets+4       00001328 : 000c
mchk$mchk_code       00980000 : 0010
mchk$mchk_code+4     00000000 : 0014
mchk$shadow[0]       00000000 : 0018
mchk$shadow[0]+4     00000000 : 001c
mchk$shadow[1]       00000000 : 0020
mchk$shadow[1]+4     00000000 : 0024
mchk$shadow[2]       00000000 : 0028
mchk$shadow[2]+4     00000000 : 002c
mchk$shadow[3]       00000000 : 0030
mchk$shadow[3]+4     00000000 : 0034
mchk$shadow[4]       00000000 : 0038
mchk$shadow[4]+4     00004b36 : 003c
mchk$shadow[5]       00000000 : 0040
mchk$shadow[5]+4     00000000 : 0044
mchk$shadow[6]       00000000 : 0048
mchk$shadow[6]+4     00000000 : 004c
mchk$shadow[7]       00000000 : 0050
mchk$shadow[7]+4     00000000 : 0054
mchk$pt[0]           00000000 : 0058
mchk$pt[0]+4         00000000 : 005c
mchk$pt[1]           00000000 : 0060
mchk$pt[1]+4         00000000 : 0064
mchk$pt[2]           f22f6000 : 0068
mchk$pt[2]+4         ffffffff : 006c
mchk$pt[3]           00004400 : 0070
mchk$pt[3]+4         00000000 : 0074
mchk$pt[4]           87884d98 : 0078
mchk$pt[4]+4         ffffffff : 007c
mchk$pt[5]           00000007 : 0080
mchk$pt[5]+4         00000000 : 0084
mchk$pt[6]           81caa868 : 0088
mchk$pt[6]+4         ffffffff : 008c
mchk$pt[7]           0000001f : 0090
mchk$pt[7]+4         00000000 : 0094
mchk$pt[8]           00000000 : 0098
mchk$pt[8]+4         00000000 : 009c
mchk$pt[9]           00000000 : 00a0
mchk$pt[9]+4         00000000 : 00a4
mchk$pt[10]          8003a008 : 00a8
mchk$pt[10]+4        ffffffff : 00ac
mchk$pt[11]          00000000 : 00b0
mchk$pt[11]+4        00000000 : 00b4
mchk$pt[12]          878af360 : 00b8
mchk$pt[12]+4        ffffffff : 00bc
mchk$pt[13]          00006e80 : 00c0
mchk$pt[13]+4        00000000 : 00c4
mchk$pt[14]          00000000 : 00c8
mchk$pt[14]+4        00000000 : 00cc
mchk$pt[15]          000f0000 : 00d0
mchk$pt[15]+4        00000000 : 00d4
mchk$pt[16]          06700009 : 00d8
mchk$pt[16]+4        00000098 : 00dc
mchk$pt[17]          c39b2d85 : 00e0
mchk$pt[17]+4        00000000 : 00e4
mchk$pt[18]          81c28000 : 00e8
mchk$pt[18]+4        ffffffff : 00ec
mchk$pt[19]          00000000 : 00f0
mchk$pt[19]+4        00000000 : 00f4
mchk$pt[20]          001f2000 : 00f8
mchk$pt[20]+4        00000000 : 00fc
mchk$pt[21]          00000000 : 0100
mchk$pt[21]+4        00000002 : 0104
mchk$pt[22]          00374000 : 0108
mchk$pt[22]+4        00000000 : 010c
mchk$pt[23]          01c28080 : 0110
mchk$pt[23]+4        00000000 : 0114
mchk$exc_addr        8003a008 : 0118
mchk$exc_addr+4      ffffffff : 011c
mchk$exc_sum         00000000 : 0120
mchk$exc_sum+4       00000000 : 0124
mchk$exc_mask        00000000 : 0128
mchk$exc_mask+4      00000000 : 012c
mchk$pal_base        00008000 : 0130
mchk$pal_base+4      00000000 : 0134
mchk$isr             00400800 : 0138
mchk$isr+4           00000000 : 013c
mchk$icsr            40020000 : 0140
mchk$icsr+4          000000c1 : 0144
mchk$ic_perr_stat    00000000 : 0148
mchk$ic_perr_stat+4  00000000 : 014c
mchk$dc_perr_stat    00000000 : 0150
mchk$dc_perr_stat+4  00000000 : 0154
mchk$va              f2524000 : 0158
mchk$va+4            ffffffff : 015c
mchk$mm_stat         00014410 : 0160
mchk$mm_stat+4       00000000 : 0164
mchk$sc_addr         0000f42f : 0168
mchk$sc_addr+4       ffffff00 : 016c
mchk$sc_stat         00000000 : 0170
mchk$sc_stat+4       00000000 : 0174
mchk$bc_tag_addr     004d1fff : 0178
mchk$bc_tag_addr+4   ffffff80 : 017c
mchk$ei_addr         be00809f : 0180
mchk$ei_addr+4       ffffff00 : 0184
mchk$fill_syn        0000f961 : 0188
mchk$fill_syn+4      00000000 : 018c
mchk$ei_stat         01ffffff : 0190
mchk$ei_stat+4       fffffff0 : 0194
mchk$ld_lock         00005b6f : 0198
mchk$ld_lock+4       ffffff00 : 019c

IOD: 0 base address: f9e0000000
  WHOAMI:     00002e3a PCI_REV:    06008221 ENVIRON:    00000000
  CAP_CTL:    46490fb1 HAE_MEM:    00000000 HAE_IO:     00000000
  INT_CTL:    00000003 INT_REQ:    00800000 INT_MASK0:  00251000
  INT_MASK1:  00000000 MC_ERR0:    e0000000 MC_ERR1:    800e88f1
  CAP_ERR:    84000000 PCI_ERR:    00000000 MDPA_STAT:  00000000
  MDPA_SYN:   00000000 MDPB_STAT:  00000000 MDPB_SYN:   00000000

IOD: 1 base address: fbe0000000
  WHOAMI:     000004fa PCI_REV:    06000221 ENVIRON:    00000000
  CAP_CTL:    46490fb1 HAE_MEM:    00000000 HAE_IO:     00000000
  INT_CTL:    00000003 INT_REQ:    00800000 INT_MASK0:  00000000
  INT_MASK1:  00000000 MC_ERR0:    e0000000 MC_ERR1:    800e88f1
  CAP_ERR:    84000000 PCI_ERR:    00000000 MDPA_STAT:  00000000
  MDPA_SYN:   00000000 MDPB_STAT:  00000000 MDPB_SYN:   00000000
P00>>
 SROM V1.1 on cpu0
XSROM V3.0 on cpu0
BCache testing complete on cpu0
mem_pair0 - 1024 MB
mem_pair1 - 1024 MB
20..21..23..
please wait 72 seconds for T24 to complete
24..
Memory testing complete on cpu0
starting console on CPU 0
sizing memory
  0   1024 MB EDO
  1   1024 MB EDO
ZZrobing IOD1 hose 1
  bus 0 slot 1 - NCR 53C810
  bus 0 slot 2 - DECchip 21040-AA
  bus 0 slot 3 - PCI-PCI Bridge
    probing PCI-PCI Bridge, bus 2
      bus 2 slot 0 - ISP1020
  bus 0 slot 4 - NCR 53C810
probing IOD0 hose 0
  bus 0 slot 1 - PCEB
    probing EISA Bridge, bus 1
  bus 0 slot 2 - PCI-PCI Bridge
    probing PCI-PCI Bridge, bus 2
      bus 2 slot 0 - ISP1020
  bus 0 slot 3 - DEC PCI FDDI
  bus 0 slot 5 - CIPCA
configuring I/O adapters...
  ncr0, hose 1, bus 0, slot 1
  tulip0, hose 1, bus 0, slot 2
  isp0, hose 1, bus 2, slot 0
  ncr1, hose 1, bus 0, slot 4
  floppy0, hose 0, bus 1, slot 0
  isp1, hose 0, bus 2, slot 0
  pfi0, hose 0, bus 0, slot 3
fwa0.0.0.3.0 pdq_state_k_link_unavail
DEFPA Error: fwa0.0.0.3.0 can not be started
DEFPA Error: please check FDDI connection
  cipca0, hose 0, bus 0, slot 5
System temperature is 23 degrees C
AlphaServer 4100 Console V3.0-10, 19-NOV-1996 13:57:07

CPU 0 booting

(boot dua360.0.0.5.0 -flags 0)
FRU table creation disabled
block 0 of dua360.0.0.5.0 is a valid boot block
reading 1004 blocks from dua360.0.0.5.0
P00>>>^C
P00>>>
P00>>>sho conf
                           Digital Equipment Corporation
                                 AlphaServer 4100

 Console V3.0-10  OpenVMS PALcode V1.19-2, Digital UNIX PALcode V1.21-14

 Module                          Type     Rev    Name
 System Motherboard              0        0000   mthrbrd0
 Memory 1024 MB EDO              0        0000   mem0
 Memory 1024 MB EDO              0        0000   mem1
 CPU (4MB Cache)                 3        0002   cpu0
 Bridge (IOD0/IOD1)              600      0021   iod0/iod1
 PCI Motherboard                 8        0002   saddle0

 Bus 0  iod0 (PCI0)
 Slot   Option Name              Type     Rev    Name
 1      PCEB                     4828086  0005   pceb0
 2      PCI-PCI Bridge           11011    0002   pcb1
 3      DEC PCI FDDI             f1011    0000   pfi0
 5      CIPCA                    6601095  0001   cipca0

 Bus 1  pceb0 (EISA Bridge connected to iod0, slot 1)
 Slot   Option Name              Type     Rev    Name
 Bus 2  pcb1 (PCI-PCI Bridge connected to iod0, slot 2)
 Slot   Option Name              Type     Rev    Name
 0      ISP1020                  10201077 0002   isp1

 Bus 0  iod1 (PCI1)
 Slot   Option Name              Type     Rev    Name
 1      NCR 53C810               11000    0002   ncr0
 2      DECchip 21040-AA         21011    0024   tulip0
 3      PCI-PCI Bridge           11011    0002   pcb0
 4      NCR 53C810               11000    0002   ncr1

 Bus 2  pcb0 (PCI-PCI Bridge connected to iod1, slot 3)
 Slot   Option Name              Type     Rev    Name
 0      ISP1020                  10201077 0002   isp0
P00>>>b -fl 0,1
(boot dua360.0.0.5.0 -flags 0,1)
FRU table creation disabled
block 0 of dua360.0.0.5.0 is a valid boot block
reading 1004 blocks from dua360.0.0.5.0
bootstrap code read in
base = 200000, image_start = 0, image_bytes = 7d800
initializing HWRPB at 2000
initializing page table at 1f2000
initializing machine state
setting affinity to the primary CPU
jumping to bootstrap code


Has anyone seen problem like this before ? We appreciate any help we can get.
Regards,

Peter
    
                          
T.RTitleUserPersonal
Name
DateLines
589.1MAY21::CUMMINSTue May 06 1997 15:0630
    Nothing stands out.. The 84000000 in CAP_ERR from the INFO 5 output is
    from the last MCHK taken (VMS takes NXM when probing the system bus
    (empty slot)). Not a real error. Just a stale sizing error. There's
    also soft error environment data in the frame, but that's from the 
    Power, Fan, Temp Status Normal message one gets across a reset/boot.
    
    Some thoughts/comments:
    
      1. Have you tried taking the newly-added memory pair and swapping it
         for the original pair (and running with only 1GB) to see whether
         the problem is software versus hardware? Problem could possibly be
         the motherboard's second memory slot pair, though unlikely, so the
         results of said experiment would not be foolproof / 100% obvious.
    
      2. Next time you boot, do the following:
    
          P00>>> b -h <device,flag,file_list>
          .
          .
          P00>>> info 1
    
          Are there any bad pages marked out of the bitmap passed to VMS?
    
       3. Nothing in the VMS system error log? No recoverable errors, etc.?
          Have you tried running V2.4 or V2.3 (with the latest KNL updates)
          DECevent on this system?
    
    Other than the above, I'm fresh out of ideas.. Without more data..
    
    BC
589.2may be graphics cardWRKSYS::RICHARDSONTue May 06 1997 15:105
    What graphics card is in this system?  Several of them won't work with
    >1G memory (not sure which are even supported on this particular system
    anyhow).
    
    /Charlotte
589.3could PCI0 problems cause this ?GIDDAY::FLAWNTue May 06 1997 15:3936
    Thanks,

    I'm waiting for detailed config information from the system - it looks
    like there's no graphics card in it from what's shown, though if there
    was it would be a reasonable idea to go back to the old S3 TRIO. 
                                                   
    My inclination is to look at the hardware revs on parts and see if
    anything shows.

    I'm not sure if DECevent is on the system (it should be !) but because
    very recent errors are going to be still held in the in memory errorlog
    buffers we may luck out there but it's worth making sure it' been
    checked - thanks.

    Something I'm not fully clear on .... with the system not even
    resetting would that imply that either :

    - we're hung at hardware IPL
    - PCI bus 0 is potentially having a problem

    I'm not sure what happens with a reset if we're at hardware IPL here.

    The other thing I was thinking of was to change the bus 0 configuration
    by moving something like the FDDI and ethernet adapters away (assuming 
    both machines have the same - which is why I need the info). Or
    removing the FDDI adapter altogethre if it's not conneccted (the
    link_unavail makes it look maybe unconnected), This would more or less
    presuppose some kind of weirdo problem with the configuration, but it's
    probably not all that common (to include CIPCA and mutiple of what seem
    to be KZPDA's etc).

    If this is a path not worth pursuing please let us know !
      
    Regards and thanks,
    Dave Flawn
    CSC Sydney
589.4MAY21::CUMMINSTue May 06 1997 16:1615
    The only problem I'm aware of is that certain older rev DEFPAs
    were causing problems when in the same PCI segment as certain
    other devices. And I can't remember any more details than this..
    I believe the symptom was hangs, though. Will try to get more
    details and report back..
    
    You are correct that halts come in over PCI bus zero logic. Halts
    are unmaskable. Unless IOD0 is wedged, the halt should occur. It's
    possible PAL/console got the halt interrupt (IPL 31), but was unable
    to restart - though this would be an extremely unlikely scenario.
    
    More likely, one of the PCI options is wedging the bus. The DEFPA
    would be a good first candidate for removal.
    
    BC
589.5becoming clearer now.... GIDDAY::FLAWNThu May 08 1997 11:1747
Hi,

Happened again, but more info now.

What I was thinking was not correct. I was working off the same info as in .0
which I took to mean the FDDI card was in PCI0 - it was in PCI1 (it seems the
PCI-PCI bridges make it really weird, though I've yet to physically see this).
So I don't think I understand the slot layout when the KZPDAs are there.

Anyway, the more important information I now have is that the system actually 
loops with :

halted CPU 0
 halt code = 2
 kernel stack not valid halt
 PC = ffffffff8004d290

 CPU 0 restarting

 halted CPU 0

 halt code = 2
 kernel stack not valid halt
 PC = ffffffff8004bde0

 CPU 0 restarting

etc.

So my take on this is that what I thought about PCI0 being stuck is wrong since
the console output (serial console) is still working. In any case, the unused
VGA card has now also been removed.

It looks like this is actually OpenVMS taking a kernel stack not valid crash,
probably because of a software problem. What I don't understand is why it's
looping like that, rather than taking a crash. My understanding is that the
console has control right at that point and outputs the kernel stack not valid
halt message.... but it should produce a dump.... In order to fix what now 
looks like a software problem we need to get a crash dump out of it so I'm 
open to any ideas on this, particularly as I may be missing something obvious
here.

AUTO_ACTION is set to RESTART.


Regards and thanks,
Dave.
589.6HARMNY::CUMMINSThu May 08 1997 12:0430
    Set auto_action halt and disable VMS bugcheck reboots.
    
    Then use the console crash command if/when you get another KSNV. You
    might want to type INFO 4, INFO 5, and INFO 8 before forcing the crash
    (in case the crash doesn't work for some reason). The INFO output will
    give you all GPR, FPR, IPR, and CSR state that you'd find in the crash
    dump file. I.e. except the system memory dump..
    
    I looked back at .0 and the FDDI was indeed in PCI0.
    
    The 4100 has two separate PCI buses/hoses (0 and 1). The KZPDA is a
    bridged QLOGIC option; i.e. it had a QLOGIC ISP1040 behind a PCI-PCI
    bridge. The aforementioned hoses 0 and 1 are the top-level PCI buses.
    The PCI-PCI bridge (PPB) spawns a secondary PCI bus, upon which the 
    QLOGIC device sits. Console (and VMS/UNIX) always reserve secondary
    bus 1's for EISA, regardless of whether a given PCI hose spawns an
    EISA bus. Therefore, console assigns the secondary buses associated
    with the KZPDAs as bus 2 off their respective primary PCI buses.
    
    Finally, there is a VMS issue with crash dumps when a VGA is in the
    system. See note 423.12 for details (inability to crash dump on VMS
    when VGA present). VMS will be changing how/when it clears MCES in
    VMS 7.2. Console V4.8 and beyond includes a hack of sorts to help
    resolve the inability to crash problem in most cases. You should
    therefore update the console to V4.8-5 (V3.9 CD) at some point for
    this customer - assuming he/she wants to put back the VGA card at
    some point.
    
    Let us know how things turn out.
    BC
589.7thanks, will go take a look myself shortly ...GIDDAY::FLAWNThu May 08 1997 17:4043
Thanks,

And for explaining the bridge bus numbering. I'm told that the FDDI adapter was
in PCI bus 1 before .... (even though it looks like we both see it as being on
bus 0).

I'm travelling to the site today and will update here with the results. At this
stage I plan on (in only approximate order) :

1. Write a quick program to mungle a process kernel stack pointer and try 
   to reproduce. If reproduceable try a different type of crash to see if
   it's just these or all crash types.

2. Rerun ECU in case that does some good

3. Fix something I've seen now on console output showing 4 billion (looks like 
   MAXINT of some reasonably sized bitfield) environmental events - first 
   clear the events, if that doesn't work do the neat save/clear/restore NVRAM 
   thing (console is 4.8-6). If still no work try replacing the XICOR NVRAM
   and finally saddle.
 
4. Increase sysgen parmeter KSTACKPAGES to 6.
  
5. Sort out the device naming issue - move parts around till it makes sense
   or indicates a problem.

6. Review software configuration and if nothing else has worked to nail the 
   problem down and the few software ECOs for known Digital caused kernel 
   stack invalid VMS crashes are relevant then apply as appropriate.

If I  can reproduce this it should be possible to fix, even if we had to try
different console versions or a 300Mhz CPU to make it dump.

Thanks for taking the interest in this, and to Rawhide engineering in general
for their assistance thru notes. Problems such as this, while formally
warranting escalation due to severity, really need to be narrowed down or
resolved in the field. 

Sorry about not having all the info on this earlier!

Thanks,
Dave.

589.8Comments on last replyMAY21::CUMMINSThu May 08 1997 17:5962
Feedback on your most recent reply..
    
And for explaining the bridge bus numbering. I'm told that the FDDI adapter was
in PCI bus 1 before .... (even though it looks like we both see it as being on
bus 0).

    BC> Yes, log from base note definitely shows it hanging off PCI hose 0.
    
1. Write a quick program to mungle a process kernel stack pointer and try 
   to reproduce. If reproduceable try a different type of crash to see if
   it's just these or all crash types.

2. Rerun ECU in case that does some good

    BC> I'm 99% sure you'll be wasting your time here.
    
3. Fix something I've seen now on console output showing 4 billion (looks like 
   MAXINT of some reasonably sized bitfield) environmental events - first 
   clear the events, if that doesn't work do the neat save/clear/restore NVRAM 
   thing (console is 4.8-6). If still no work try replacing the XICOR NVRAM
   and finally saddle.
 
    BC> This is a known problem that was caused when some systems slipped
    BC> thru MFG without having their RCM NVRAMs properly initialized. PAL
    BC> stores environmental events in the RCM NVRAM. You could have broken
    BC> HW, but it's more likely you have one of the uninit'd machines.
    BC>
    BC> To check for HW presence/okay, type the following:
    BC>
    BC>   P00>>>ls iic_rcm*
    BC>   iic_rcm_nvram0  iic_rcm_nvram1  iic_rcm_nvram2  iic_rcm_nvram3  iic_rcm_nvram4
    BC>   iic_rcm_nvram5  iic_rcm_nvram6  iic_rcm_nvram7  iic_rcm_temp
    BC>
    BC> You should see all of the above devices. If not, you're either missing
    BC> all or part of the COMBO/RCM logic.
    BC>
    BC> Most likely you simply need to init the NVRAM. Type the following:
    BC> 
    BC>   P00>>> d iic_rcm_nvram6:4 -q 20000010057
    BC>
    BC> Does this make the SHOW POWER problem go away?
    
4. Increase sysgen parmeter KSTACKPAGES to 6.
  
5. Sort out the device naming issue - move parts around till it makes sense
   or indicates a problem.

    BC> No problem with what I saw from the base note.. Other than someone
    BC> apparently telling you DEFPA was in PCI1 (hose 1).
    
6. Review software configuration and if nothing else has worked to nail the 
   problem down and the few software ECOs for known Digital caused kernel 
   stack invalid VMS crashes are relevant then apply as appropriate.

If I  can reproduce this it should be possible to fix, even if we had to try
different console versions or a 300Mhz CPU to make it dump.

    BC> Don't understand the 300 MHz CPU comment. Latest V4.8 console does
    BC> work around the VMS MCES and crash dumping issue (when VGA present).
    BC> So, if VGA present, you should update to latest console. Not a bad
    BC> idea to do so anyway, since provides various new features/fixes.
    BC> See LFU release notes for details..
589.9looks like hw or environmentGIDDAY::FLAWNMon May 26 1997 08:581249
Hi,

It turns out this should be a hardware problem. We got a similar failure
after returning to the 2GB configuration but this time, with ERLBUFFERPAGES
high enough for good error logging and DECevent installed we got some 
information and the system started to dmp but then hit what looks like the
VMS problem were with MCES it falls into XDELTA. While the customer is running
latest console it looks like we may have hit an instance where that can still
happen (removing the VGA may help get more info). 
Unfortunately I hadn't given them instructions on dumping the mchk
logout area as I hadn't expected this one....

It looks to me that this new behaviour (getting some error info and starting
to dump, rather than doing the kernel stack not valid halt loop) may  be
because the machine ceck handler was overflowing the kernel stack before...
so we didn't get the machine check crash but instead got the KSTACKNV. Just 
bad luck I suppose....

The error log info makes this look like a memory problem (the IOD and CPU 
were seeing errors at this time) but given the number of memory options 
attempted we can probably rule it out unless we've been unfortunate with
the spares.

Instead, and given that these failures seem to occur at about the same time
each day, but not on weekends, I think we're either seeing environmental
problems or a motherboard problem.... showing up when the higher physical
addresses are hit (though the customer says this is happening a bit after 
their heaviest load time). This happens on two machines (we've not
tried 2GB with the higher KSTACKPAGES in the other machine which was doing
the KSTACKNV, so I can't be  sure it's the same but it appears likely).

The parts in these systems are all fairly early FRS but I can't see any
known issues that would match these symptoms, so I'll try to reproduce 
this with DECVET and/or swap the motherboard, but am open to any suggestions.
We'l also set up a Dranetz with RI monitoring to see if it shows anything.

Regards and thanks,
Dave.






******************************** ENTRY 4555 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5790. 
Timestamp of occurrence              26-MAY-1997 11:09:01   
Time since reboot                    2 Day(s) 15:00:19 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0086  Alpha Chip Detected ECC Error, From Memory 

Ext Interface Status Reg  xFFFFFFF0C1FFFFFF 
                                     DATA SOURCE IS MEMORY OR SYSTEM 
                                     CORRECTABLE ECC ERROR 
                                     D-ref fill 
Ext Interface Address Reg xFFFFFF0066C181CF 
Fill Syndrome Reg         x000000000000D900 
Interrupt Summary Reg     x0000000100000000 
                                     Correctable ECC Errors (IPL31) 
                                     AST Requests 3-0:  x0000000000000000 
                                       
WHOAMI                    x00000000  CPU0 Detected This Error 
                                       
--IOD REGISTERS FOLLOW--               
Base Addr of Bridge       x0000000000000000 
                                     Register Contents Not Valid For This Error 
Dev Type & Rev Register   x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 0  x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 1  x00000000  Register Contents Not Valid For This Error 
CAP Error Register        x00000000  Register Contents Not Valid For This Error 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4556 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5791. 
Timestamp of occurrence              26-MAY-1997 11:09:01   
Time since reboot                    2 Day(s) 15:00:19 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000F9E0000000 
                                     IOD# 0 
Dev Type & Rev Register   x06008221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     PCI-EISA Bus Bridge Present on PCI Segment 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x66C181C0 
                                     MC Bus Trans Addr<31:4>: 66C181C0 
MC Error Info Register 1  x800E8900  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read1-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4557 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5792. 
Timestamp of occurrence              26-MAY-1997 11:09:01   
Time since reboot                    2 Day(s) 15:00:19 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000FBE0000000 
                                     IOD# 1 
Dev Type & Rev Register   x06000221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     Internal CAP Chip Arbiter: Enabled 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x66C181C0 
                                     MC Bus Trans Addr<31:4>: 66C181C0 
MC Error Info Register 1  x800E8900  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read1-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4558 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5793. 
Timestamp of occurrence              26-MAY-1997 11:09:02   
Time since reboot                    2 Day(s) 15:00:20 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0086  Alpha Chip Detected ECC Error, From Memory 

Ext Interface Status Reg  xFFFFFFF0C1FFFFFF 
                                     DATA SOURCE IS MEMORY OR SYSTEM 
                                     CORRECTABLE ECC ERROR 
                                     D-ref fill 
Ext Interface Address Reg xFFFFFF0066C9A1CF 
Fill Syndrome Reg         x000000000000D600 
Interrupt Summary Reg     x0000000100000000 
                                     Correctable ECC Errors (IPL31) 
                                     AST Requests 3-0:  x0000000000000000 
                                       
WHOAMI                    x00000000  CPU0 Detected This Error 
                                       
--IOD REGISTERS FOLLOW--               
Base Addr of Bridge       x0000000000000000 
                                     Register Contents Not Valid For This Error 
Dev Type & Rev Register   x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 0  x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 1  x00000000  Register Contents Not Valid For This Error 
CAP Error Register        x00000000  Register Contents Not Valid For This Error 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4559 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5794. 
Timestamp of occurrence              26-MAY-1997 11:09:02   
Time since reboot                    2 Day(s) 15:00:20 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000F9E0000000 
                                     IOD# 0 
Dev Type & Rev Register   x06008221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     PCI-EISA Bus Bridge Present on PCI Segment 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x66C9A1D0 
                                     MC Bus Trans Addr<31:4>: 66C9A1D0 
MC Error Info Register 1  x800E9800  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read0-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4560 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5795. 
Timestamp of occurrence              26-MAY-1997 11:09:02   
Time since reboot                    2 Day(s) 15:00:20 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000FBE0000000 
                                     IOD# 1 
Dev Type & Rev Register   x06000221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     Internal CAP Chip Arbiter: Enabled 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x66C9A1D0 
                                     MC Bus Trans Addr<31:4>: 66C9A1D0 
MC Error Info Register 1  x800E9800  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read0-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4561 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5796. 
Timestamp of occurrence              26-MAY-1997 11:09:02   
Time since reboot                    2 Day(s) 15:00:20 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0086  Alpha Chip Detected ECC Error, From Memory 

Ext Interface Status Reg  xFFFFFFF0C1FFFFFF 
                                     DATA SOURCE IS MEMORY OR SYSTEM 
                                     CORRECTABLE ECC ERROR 
                                     D-ref fill 
Ext Interface Address Reg xFFFFFF0066C9A1CF 
Fill Syndrome Reg         x000000000000D600 
Interrupt Summary Reg     x0000000100000000 
                                     Correctable ECC Errors (IPL31) 
                                     AST Requests 3-0:  x0000000000000000 
                                       
WHOAMI                    x00000000  CPU0 Detected This Error 
                                       
--IOD REGISTERS FOLLOW--               
Base Addr of Bridge       x0000000000000000 
                                     Register Contents Not Valid For This Error 
Dev Type & Rev Register   x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 0  x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 1  x00000000  Register Contents Not Valid For This Error 
CAP Error Register        x00000000  Register Contents Not Valid For This Error 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4562 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5797. 
Timestamp of occurrence              26-MAY-1997 11:09:02   
Time since reboot                    2 Day(s) 15:00:20 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000F9E0000000 
                                     IOD# 0 
Dev Type & Rev Register   x06008221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     PCI-EISA Bus Bridge Present on PCI Segment 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x66C9A1D0 
                                     MC Bus Trans Addr<31:4>: 66C9A1D0 
MC Error Info Register 1  x800E9900  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read1-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4563 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5798. 
Timestamp of occurrence              26-MAY-1997 11:09:02   
Time since reboot                    2 Day(s) 15:00:20 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000FBE0000000 
                                     IOD# 1 
Dev Type & Rev Register   x06000221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     Internal CAP Chip Arbiter: Enabled 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x66C9A1D0 
                                     MC Bus Trans Addr<31:4>: 66C9A1D0 
MC Error Info Register 1  x800E9900  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read1-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4564 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5799. 
Timestamp of occurrence              26-MAY-1997 11:09:15   
Time since reboot                    2 Day(s) 15:00:33 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0086  Alpha Chip Detected ECC Error, From Memory 

Ext Interface Status Reg  xFFFFFFF0C1FFFFFF 
                                     DATA SOURCE IS MEMORY OR SYSTEM 
                                     CORRECTABLE ECC ERROR 
                                     D-ref fill 
Ext Interface Address Reg xFFFFFF0066C9A1CF 
Fill Syndrome Reg         x000000000000D600 
Interrupt Summary Reg     x0000000100000000 
                                     Correctable ECC Errors (IPL31) 
                                     AST Requests 3-0:  x0000000000000000 
                                       
WHOAMI                    x00000000  CPU0 Detected This Error 
                                       
--IOD REGISTERS FOLLOW--               
Base Addr of Bridge       x0000000000000000 
                                     Register Contents Not Valid For This Error 
Dev Type & Rev Register   x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 0  x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 1  x00000000  Register Contents Not Valid For This Error 
CAP Error Register        x00000000  Register Contents Not Valid For This Error 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4565 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5800. 
Timestamp of occurrence              26-MAY-1997 11:09:15   
Time since reboot                    2 Day(s) 15:00:33 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000F9E0000000 
                                     IOD# 0 
Dev Type & Rev Register   x06008221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     PCI-EISA Bus Bridge Present on PCI Segment 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x66C9A1D0 
                                     MC Bus Trans Addr<31:4>: 66C9A1D0 
MC Error Info Register 1  x800E9800  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read0-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4566 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5801. 
Timestamp of occurrence              26-MAY-1997 11:09:15   
Time since reboot                    2 Day(s) 15:00:33 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000FBE0000000 
                                     IOD# 1 
Dev Type & Rev Register   x06000221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     Internal CAP Chip Arbiter: Enabled 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x66C9A1D0 
                                     MC Bus Trans Addr<31:4>: 66C9A1D0 
MC Error Info Register 1  x800E9800  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read0-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4567 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5802. 
Timestamp of occurrence              26-MAY-1997 11:09:31   
Time since reboot                    2 Day(s) 15:00:49 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                       38. Time Stamp Entry 

SWI Minor class                   7. Timestamp 


******************************** ENTRY 4568 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5803. 
Timestamp of occurrence              26-MAY-1997 11:11:27   
Time since reboot                    2 Day(s) 15:02:45 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0086  Alpha Chip Detected ECC Error, From Memory 

Ext Interface Status Reg  xFFFFFFF4C1FFFFFF 
                                     DATA SOURCE IS MEMORY OR SYSTEM 
                                     CORRECTABLE ECC ERROR 
                                     I-ref fill 
Ext Interface Address Reg xFFFFFF006703B71F 
Fill Syndrome Reg         x000000000000DC00 
Interrupt Summary Reg     x0000000100000000 
                                     Correctable ECC Errors (IPL31) 
                                     AST Requests 3-0:  x0000000000000000 
                                       
WHOAMI                    x00000000  CPU0 Detected This Error 
                                       
--IOD REGISTERS FOLLOW--               
Base Addr of Bridge       x0000000000000000 
                                     Register Contents Not Valid For This Error 
Dev Type & Rev Register   x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 0  x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 1  x00000000  Register Contents Not Valid For This Error 
CAP Error Register        x00000000  Register Contents Not Valid For This Error 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4569 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5804. 
Timestamp of occurrence              26-MAY-1997 11:11:27   
Time since reboot                    2 Day(s) 15:02:45 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000F9E0000000 
                                     IOD# 0 
Dev Type & Rev Register   x06008221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     PCI-EISA Bus Bridge Present on PCI Segment 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x6703B700 
                                     MC Bus Trans Addr<31:4>: 6703B700 
MC Error Info Register 1  x800E8800  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read0-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4570 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5805. 
Timestamp of occurrence              26-MAY-1997 11:11:27   
Time since reboot                    2 Day(s) 15:02:45 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        6. Soft ECC Error 

Memory Minor class                1. Soft ECC error 

Software Flags            x0000000000000000 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000FBE0000000 
                                     IOD# 1 
Dev Type & Rev Register   x06000221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     Internal CAP Chip Arbiter: Enabled 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x6703B700 
                                     MC Bus Trans Addr<31:4>: 6703B700 
MC Error Info Register 1  x800E8800  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read0-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        x90000000  Correctable ECC err det by MDPB 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4571 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5806. 
Timestamp of occurrence              26-MAY-1997 11:14:58   
Time since reboot                    2 Day(s) 15:06:16 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                        2. Machine Check  

CPU Minor class                   1. Machine check (670 entry) 

Software Flags            x0000000300000000 
                                     IOD 0 Register Subpkt Pres 
                                     IOD 1 Register Subpkt Pres 
Active CPUs               x00000001 
Hardware Rev              x00000000 
System Serial Number                   
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

* MCHK 670 Regs *                      
Flags:                    x00000000 
PCI Mask                      x0000 
Machine Check Reason          x0098  Fatal Alpha Chip Detected Hard Error 
PAL SHADOW REG 0          x0000000000000000 
PAL SHADOW REG 1          x0000000000000000 
PAL SHADOW REG 2          x0000000000000000 
PAL SHADOW REG 3          x0000000000000000 
PAL SHADOW REG 4          x0000000000000000 
PAL SHADOW REG 5          x000000FBE0000000 
PAL SHADOW REG 6          x6703B70006000221 
PAL SHADOW REG 7          x90000000800E8800 
PALTEMP0                  x0000000000000001 
PALTEMP1                  x0000000000000004 
PALTEMP2                  xFFFFFFFF92850918 
PALTEMP3                  x0000000000004400 
PALTEMP4                  x00000000098BE130 
PALTEMP5                  x0000000000000180 
PALTEMP6                  x0000000000000004 
PALTEMP7                  x0000000000000016 
PALTEMP8                  x0000000000000004 
PALTEMP9                  x0000000000000003 
PALTEMP10                 x000000000012A04C 
PALTEMP11                 x0000000000000000 
PALTEMP12                 xFFFFFFFF83A25C80 
PALTEMP13                 x0000000000006E80 
PALTEMP14                 x0000000000000000 
PALTEMP15                 x00000000000F0000 
PALTEMP16                 x0000009806700001 
PALTEMP17                 x0000529323B0E1EE 
PALTEMP18                 xFFFFFFFF81C20000 
PALTEMP19                 x000000007FF92000 
PALTEMP20                 x0000000042FA2000 
PALTEMP21                 x0000000200000000 
PALTEMP22                 x0000000000CF0000 
PALTEMP23                 x0000000045308080 
Exception Address Reg     x000000000012A04C 
                                     Native-mode Instruction 
                                     Exception PC  x000000000004A813 
Exception Summary Reg     x0000000000000000 
Exception Mask Reg        x0000000000000000 
PAL Base Address Reg      x0000000000008000 
                                     Base Addr for PALcode:  x0000000000000002 
Interrupt Summary Reg     x0000000000200000 
                                     External HW Interrupt at IPL21 
                                     AST Requests 3-0:  x0000000000000000 
IBOX Ctrl and Status Reg  x000000C144020000 
                                     Timeout Counter Bit Clear. 
                                     IBOX Timeout Counter Enabled. 
                                     Floating Point Instr's May be Issued. 
                                     PAL Shadow Registers Enabled. 
                                     Correctable Error Interrupts Enabled. 
                                     ICACHE BIST (Self Test) Was Successful. 
                                     TEST_STATUS_H Pin Asserted 
Icache Par Err Stat Reg   x0000000000000000 
Dcache Par Err Stat Reg   x0000000000000000 
Virtual Address Reg       x00000000009DF9C0 
Memory Mgmt Flt Sts Reg   x0000000000014310 
                                     If Err, Reference Resulted in DTB Miss 
                                     Fault Inst RA Field:  x000000000000000C 

                                     Fault Inst Opcode:  x0000000000000028 
Scache Address Reg        xFFFFFF000000F3AF 
Scache Status Reg         x0000000000000000 
Bcache Tag Address Reg    xFFFFFF80290D0FFF 
                                     Last Bcache Access Resulted in a Miss. 
                                     Value of Parity Bit for Tag Control Status 
                                        Bits Dirty, Shared & Valid is Clear. 
                                     Value of Tag Control Dirty Bit is Clear. 
                                     Value of Tag Control Shared Bit is Clear. 
                                     Value of Tag Control Valid Bit is Set. 
                                     Value of Parity Bit Covering Tag Store 
                                        Address Bits is Clear. 
                                     Tag Address<38:20> Is:  x0000000000000290 
Ext Interface Address Reg xFFFFFF00669221CF 
Fill Syndrome Reg         x0000000000001800 
Ext Interface Status Reg  xFFFFFFF141FFFFFF 
                                     Error Source is Memory or System 
                                     UNCORRECTABLE ECC ERROR 
                                     Error Occurred During D-ref Fill 
LD LOCK                   xFFFFFF0043C5E7CF 

** IOD SUBPACKET -> **               IOD 0 Register Subpacket 

WHOAMI                    x000004FA  Module Revision  1. 
                                     VCTY ASIC Rev = 0 
                                     Bcache Size = 4MB 
                                     CPU = 0 

This Bus Bridge Phy Addr  x000000F9E0000000 
                                     IOD# 0 
Dev Type & Rev Register   x06008221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     PCI-EISA Bus Bridge Present on PCI Segment 
                                     Device Class: Host Bus to PCI Bridge 
MC-PCI Command Register   x46490FB1  Module Self-Test Passed LED On. 
                                     Delayed PCI Bus Reads Protocol: Enabled 
                                     Bridge to PCI Transactions:     Enabled 
                                     Bridge WILL NOT REQUEST 64 Bit Data Trans 
                                     Bridge ACCEPTS 64 Bit Data Transactions 
                                     PCI Address Parity Check:       Enabled 
                                     MC Bus CMD/Addr Parity Check:   Enabled 
                                     MC Bus NXM Check:               Enabled 
                                     Check ALL Transactions for Errors 
                                     Use MC_BMSK for 16 Byte Align Blk Mem Wrt 
                                     Wrt PEND_NUM Threshold:  9. 
                                     RD_TYPE Memory Prefetch Algorithm: Short 
                                     RL_TYPE Mem Rd Line Prefetch Type: Medium 
                                     RM_TYPE Mem Rd Multiple Cmd Type:  Long 
                                     ARB_MODE PCI Arbitration: Round Robin 
Mem Host Address Ext Reg  x00000000  HAE Sparse Mem Adr<31:27> x00000000 
IO Host Adr Ext Register  x00000000  PCI Upper Adr Bits<31:25> x00000000 
Interrupt Ctrl Register   x00000003  Write Device Interrupt Info Struct:Enabled 
Interrupt Request         x00800000  Interrupts asserted  x00000000 
                                     Hard Error 
Interrupt Mask0 Register  x00E51000 
Interrupt Mask1 Register  x00000000 
MC Error Info Register 0  x669221D0 
                                     MC Bus Trans Addr<31:4>: 669221D0 
MC Error Info Register 1  x800E8900  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read1-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        xC0000000  Uncorrectable ECC err det by MDPB 
                                     MC error info latched 
PCI Bus Trans Error Adr   x00000000 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
** IOD SUBPACKET -> **               IOD 1 Register Subpacket 

WHOAMI                    x000004FA  Module Revision  1. 
                                     VCTY ASIC Rev = 0 
                                     Bcache Size = 4MB 
                                     CPU = 0 

This Bus Bridge Phy Addr  x000000FBE0000000 
                                     IOD# 1 
Dev Type & Rev Register   x06000221  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000002 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:       Left Hand 
                                     Internal CAP Chip Arbiter: Enabled 
                                     Device Class: Host Bus to PCI Bridge 
MC-PCI Command Register   x46490FB1  Module Self-Test Passed LED On. 
                                     Delayed PCI Bus Reads Protocol: Enabled 
                                     Bridge to PCI Transactions:     Enabled 
                                     Bridge WILL NOT REQUEST 64 Bit Data Trans 
                                     Bridge ACCEPTS 64 Bit Data Transactions 
                                     PCI Address Parity Check:       Enabled 
                                     MC Bus CMD/Addr Parity Check:   Enabled 
                                     MC Bus NXM Check:               Enabled 
                                     Check ALL Transactions for Errors 
                                     Use MC_BMSK for 16 Byte Align Blk Mem Wrt 
                                     Wrt PEND_NUM Threshold:  9. 
                                     RD_TYPE Memory Prefetch Algorithm: Short 
                                     RL_TYPE Mem Rd Line Prefetch Type: Medium 
                                     RM_TYPE Mem Rd Multiple Cmd Type:  Long 
                                     ARB_MODE PCI Arbitration: Round Robin 
Mem Host Address Ext Reg  x00000000  HAE Sparse Mem Adr<31:27> x00000000 
IO Host Adr Ext Register  x00000000  PCI Upper Adr Bits<31:25> x00000000 
Interrupt Ctrl Register   x00000003  Write Device Interrupt Info Struct:Enabled 
Interrupt Request         x00800000  Interrupts asserted  x00000000 
                                     Hard Error 
Interrupt Mask0 Register  x00C11111 
Interrupt Mask1 Register  x00000000 
MC Error Info Register 0  x669221D0 
                                     MC Bus Trans Addr<31:4>: 669221D0 
MC Error Info Register 1  x800E8900  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read1-Mem 
                                     CPU0 Master at Time of Error 
                                     Device ID:   x00000002 
                                     MC error info valid 
CAP Error Register        xC0000000  Uncorrectable ECC err det by MDPB 
                                     MC error info latched 
PCI Bus Trans Error Adr   x00000000 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       

PALcode Revision                     Palcode Rev: 1.19-2 


******************************** ENTRY 4572 ******************************** 


Logging OS                        1. OpenVMS 
System Architecture               2. Alpha 
OS version                           V6.2-1H3 
Event sequence number          5807. 
Timestamp of occurrence              26-MAY-1997 11:14:58   
Time since reboot                    2 Day(s) 15:06:16 
Host name                            AXP2     

System Model                         AlphaServer 4100 5/400 4MB 

Entry type                       37. Crash Re-Start 

Bugcheck Minor class              1. Crash Re-start 

Bugcheck Msg                         MACHINECHK, Machine check while in kernel 
                                     mode 
Process ID                x000600D7 
Process Name                           
KSP                       x000000007FF91EC0 
ESP                       x000000007FF96000 
SSP                       x000000007FF9C100 
USP                       x000000007ED12E70 
R0                        x0000000000000000 
R1                        x000000007FF91EE0 
R2                        xFFFFFFFF927AE2B8 
R3                        xFFFFFFFF927AE810 
R4                        x0000000000000000 
R5                        x0000000000000180 
R6                        x0000000000000004 
R7                        x000000000861C100 
R8                        x0000000000000006 
R9                        x0000000000000000 
R10                       x0000000000000001 
R11                       x0000000000000000 
R12                       x0000000000000000 
R13                       x0000000000000000 
R14                       x0000000000000000 
R15                       x00000000009DF9B0 
R16                       x0000000000000215 
R17                       x0000000000000001 
R18                       x0000000000000001 
R19                       xFFFFFFFF81C1DF18 
R20                       x0000000000000008 
R21                       xFFFFFFFF81C1DF18 
R22                       x0000000000000100 
R23                       x0000000000000180 
R24                       xFFFFFFFF81C1DC00 
R25                       x0000000000000003 
R26                       x0000000000000210 
R27                       xFFFFFFFF927B6560 
R28                       xFFFFFFFF8003F0EC 
FP                        x000000007FF91EC0 
SP                        x000000007FF91EC0 
PC                        xFFFFFFFF8004E610 
PS                        x0000000000001F00 
PTBR                      x00000000000217D1 
Process Ctl Block Base Re x0000000045308080 
PRBR                      xFFFFFFFF81C20000 
VPTB                      x0000000200000000 
System Ctl Block Base Reg x0000000000000678 
Software Interrupt Summar x0000000000000000 
ASN                       x0000000000000045 
ASTSR ASTEN               x000000000000000F 
FEN                       x0000000000000001 
ASN                       x0000000000000045 
IPL                       x000000000000001F 
MCES                      x0000000000000001 
589.10Try swapping high/low members of upper 1GB memory option?HARMNY::CUMMINSTue May 27 1997 11:2926
    Several different error addresses in this log. All in upper 1GB of
    memory. Correctables early on with uncorrectable eventually..
    
    Correctables:
    EI_ADDR: xFFFFFF0066C181CF    FILL_SYN: D900  -->  data bit 05
    EI_ADDR: xFFFFFF0066C9A1CF    FILL_SYN: D600  -->  data bit 04
    EI_ADDR: xFFFFFF0066C9A1CF    FILL_SYN: D600  -->  data bit 04
    EI_ADDR: xFFFFFF0066C9A1CF    FILL_SYN: D600  -->  data bit 04
    EI_ADDR: xFFFFFF006703B71F    FILL_SYN: DC00  -->  data bit 07
    
    Uncorrectable:
    EI_ADDR: xFFFFFF00669221CF
    
    IOD's MDPB chip saw the same errors..
    
    All errors originated in/from memory since no DIRTY bit set.
    
    Re: bad spares.. This is quite possible since from what I have seen the
    quality of the 4100/4000 spares, esp. memory, is just plain awful at best.
    
    Since the data points to MDPB always detecting the fault and the syndrome
    register always points to the high half of the transaction, if you're still
    in experimentation mode, you could try swapping the low/high halves of the
    upper 1GB pair to see if the problem follows the card or not. This would
    give you a good idea whether you are chasing a faulty memory spare versus a
    motherboard or some other systemic problem.
589.11HARMNY::CUMMINSWed May 28 1997 14:079
Note that the data bit callouts in reply -.1 are as described in the EV5 HW
spec. Since the upper byte of the syndrome is involved in each of the CRDs,
one actually needs to add 64 to these numbers..
 
    EI_ADDR: xFFFFFF0066C181CF    FILL_SYN: D900  -->  data bit 69
    EI_ADDR: xFFFFFF0066C9A1CF    FILL_SYN: D600  -->  data bit 68
    EI_ADDR: xFFFFFF0066C9A1CF    FILL_SYN: D600  -->  data bit 68
    EI_ADDR: xFFFFFF0066C9A1CF    FILL_SYN: D600  -->  data bit 68
    EI_ADDR: xFFFFFF006703B71F    FILL_SYN: DC00  -->  data bit 71
589.12looking like memory all along (sigh....)GIDDAY::FLAWNFri May 30 1997 10:0729
Thanks, this does now look like bad memory, with the earlier failure to get
information due to ERLBUFFERPAGES being too low and I think the machine check
handler blowing the kernel stack with SYSGEN param KSTACKPAGES at 1 (I set it
way up to 6, 2 would probably have done).

The customer has shifted to using a loan 2100 so we could run diags - it turns
out that DECVET wasn't necessary, the console diags pull consistent soft errors
at a reasonably regular rate which is what we'll chase.

Moving MEM1H down to MEM0L shifts the errors down to the low 1GB with the
syndrom bits indicting the low card. (Initially MEM1L and MEM1H were swapped
and this showed the low card faulty in MEM1, so we have consistency). We'll
now proceed to weed out the others.

This is a pretty rough rate of failure on new boards - do you think
manufacturing is aware of these instances already ?

I don't quite understand the distinction between looking at the syndrome info 
and which MDP ASIC saw the error - my understanding is that without syndrome 
information I can't pick the card (MDPB is the one seeing the errors even on 
the low card). At one stage I thought if the error was in the high 64 bits 
(seen by MDPB) then that meant the high card but that's clearly not the case 
(what I'm missing is why - i.e. which card the bits of a given physical 
address is on - I've looked at the SPM and the system spec ... don't have a HW
spec).... maybe this should be obvious anyway....

Thanks for the help with this mess,
Dave.
589.13Not sliced on QW boundariesPOBOXB::STEINMANFri May 30 1997 10:186
    
    The system bus data bits (127:0) are not perfectly sliced between the
    HIGH and LOW modules of a given memory pair.  Due to routing, timing,
    layout, etc. the bits are somewhat scattered. 
    
    mo
589.14Can only isolate to mem pair member on EV5 CRDsHARMNY::CUMMINSMon Jun 02 1997 10:3032
    Another issue is that all versions of the MDP chips have a bug in them
    which can result in data corruption if software accesses registers on the
    MDP chips that require involvement by the MDPB. There are four such CSRs:
    
      MDPA_STAT,
      MDPB_STAT,
      MDPA_SYNDROME, and
      MDPB_SYNDROME.
    
    Unfortunately, these are the registers you need to use to figure out which
    half of a given memory option is at fault in the case of IOD-detected CRDs.
    Accordingly, we changed PALcode early on to not collect state from these
    registers during PALcode CRD handling.
    
    At the time we made the SRM PALcode changes, we implemented a call to PAL
    (CSERVE) that could be used to enable reading of these registers on CRDs.
    The SRM console-based TEST command turns on this feature so that IOD stat
    and syndrome info is collected and displayed during error handling. TEST
    does no writes to media (in FIELD mode), so the risk of data corruption is
    essentially nil. Writes are performed in manufacturing mode TEST, but data
    corruption is recoverable in this environment, should it occur.
    
    Finally, UNIX/VMS PAL always scrubs CRDs, and so the thinking was that the
    MDP bug did not need to be fixed since there should be an EV5-detected CRD
    generated during the read of the faulty memory location, provided the error
    was not a transient. PAL collects syndrome information on EV5-detected CRD
    errors, and this data can be used to isolate to the correct half of the MEM
    pair. [I'm not sure whether HAL scrubs memory on CRDs in an NT environment,
    but will find out and post a reply here with the answer..]
    
    I'd be interested to know if you are seeing IOD-detected CRDs with no
    accompanying EV5-detected CRDs.