T.R | Title | User | Personal Name | Date | Lines |
---|
589.1 | | MAY21::CUMMINS | | Tue May 06 1997 15:06 | 30 |
| Nothing stands out.. The 84000000 in CAP_ERR from the INFO 5 output is
from the last MCHK taken (VMS takes NXM when probing the system bus
(empty slot)). Not a real error. Just a stale sizing error. There's
also soft error environment data in the frame, but that's from the
Power, Fan, Temp Status Normal message one gets across a reset/boot.
Some thoughts/comments:
1. Have you tried taking the newly-added memory pair and swapping it
for the original pair (and running with only 1GB) to see whether
the problem is software versus hardware? Problem could possibly be
the motherboard's second memory slot pair, though unlikely, so the
results of said experiment would not be foolproof / 100% obvious.
2. Next time you boot, do the following:
P00>>> b -h <device,flag,file_list>
.
.
P00>>> info 1
Are there any bad pages marked out of the bitmap passed to VMS?
3. Nothing in the VMS system error log? No recoverable errors, etc.?
Have you tried running V2.4 or V2.3 (with the latest KNL updates)
DECevent on this system?
Other than the above, I'm fresh out of ideas.. Without more data..
BC
|
589.2 | may be graphics card | WRKSYS::RICHARDSON | | Tue May 06 1997 15:10 | 5 |
| What graphics card is in this system? Several of them won't work with
>1G memory (not sure which are even supported on this particular system
anyhow).
/Charlotte
|
589.3 | could PCI0 problems cause this ? | GIDDAY::FLAWN | | Tue May 06 1997 15:39 | 36 |
| Thanks,
I'm waiting for detailed config information from the system - it looks
like there's no graphics card in it from what's shown, though if there
was it would be a reasonable idea to go back to the old S3 TRIO.
My inclination is to look at the hardware revs on parts and see if
anything shows.
I'm not sure if DECevent is on the system (it should be !) but because
very recent errors are going to be still held in the in memory errorlog
buffers we may luck out there but it's worth making sure it' been
checked - thanks.
Something I'm not fully clear on .... with the system not even
resetting would that imply that either :
- we're hung at hardware IPL
- PCI bus 0 is potentially having a problem
I'm not sure what happens with a reset if we're at hardware IPL here.
The other thing I was thinking of was to change the bus 0 configuration
by moving something like the FDDI and ethernet adapters away (assuming
both machines have the same - which is why I need the info). Or
removing the FDDI adapter altogethre if it's not conneccted (the
link_unavail makes it look maybe unconnected), This would more or less
presuppose some kind of weirdo problem with the configuration, but it's
probably not all that common (to include CIPCA and mutiple of what seem
to be KZPDA's etc).
If this is a path not worth pursuing please let us know !
Regards and thanks,
Dave Flawn
CSC Sydney
|
589.4 | | MAY21::CUMMINS | | Tue May 06 1997 16:16 | 15 |
| The only problem I'm aware of is that certain older rev DEFPAs
were causing problems when in the same PCI segment as certain
other devices. And I can't remember any more details than this..
I believe the symptom was hangs, though. Will try to get more
details and report back..
You are correct that halts come in over PCI bus zero logic. Halts
are unmaskable. Unless IOD0 is wedged, the halt should occur. It's
possible PAL/console got the halt interrupt (IPL 31), but was unable
to restart - though this would be an extremely unlikely scenario.
More likely, one of the PCI options is wedging the bus. The DEFPA
would be a good first candidate for removal.
BC
|
589.5 | becoming clearer now.... | GIDDAY::FLAWN | | Thu May 08 1997 11:17 | 47 |
| Hi,
Happened again, but more info now.
What I was thinking was not correct. I was working off the same info as in .0
which I took to mean the FDDI card was in PCI0 - it was in PCI1 (it seems the
PCI-PCI bridges make it really weird, though I've yet to physically see this).
So I don't think I understand the slot layout when the KZPDAs are there.
Anyway, the more important information I now have is that the system actually
loops with :
halted CPU 0
halt code = 2
kernel stack not valid halt
PC = ffffffff8004d290
CPU 0 restarting
halted CPU 0
halt code = 2
kernel stack not valid halt
PC = ffffffff8004bde0
CPU 0 restarting
etc.
So my take on this is that what I thought about PCI0 being stuck is wrong since
the console output (serial console) is still working. In any case, the unused
VGA card has now also been removed.
It looks like this is actually OpenVMS taking a kernel stack not valid crash,
probably because of a software problem. What I don't understand is why it's
looping like that, rather than taking a crash. My understanding is that the
console has control right at that point and outputs the kernel stack not valid
halt message.... but it should produce a dump.... In order to fix what now
looks like a software problem we need to get a crash dump out of it so I'm
open to any ideas on this, particularly as I may be missing something obvious
here.
AUTO_ACTION is set to RESTART.
Regards and thanks,
Dave.
|
589.6 | | HARMNY::CUMMINS | | Thu May 08 1997 12:04 | 30 |
| Set auto_action halt and disable VMS bugcheck reboots.
Then use the console crash command if/when you get another KSNV. You
might want to type INFO 4, INFO 5, and INFO 8 before forcing the crash
(in case the crash doesn't work for some reason). The INFO output will
give you all GPR, FPR, IPR, and CSR state that you'd find in the crash
dump file. I.e. except the system memory dump..
I looked back at .0 and the FDDI was indeed in PCI0.
The 4100 has two separate PCI buses/hoses (0 and 1). The KZPDA is a
bridged QLOGIC option; i.e. it had a QLOGIC ISP1040 behind a PCI-PCI
bridge. The aforementioned hoses 0 and 1 are the top-level PCI buses.
The PCI-PCI bridge (PPB) spawns a secondary PCI bus, upon which the
QLOGIC device sits. Console (and VMS/UNIX) always reserve secondary
bus 1's for EISA, regardless of whether a given PCI hose spawns an
EISA bus. Therefore, console assigns the secondary buses associated
with the KZPDAs as bus 2 off their respective primary PCI buses.
Finally, there is a VMS issue with crash dumps when a VGA is in the
system. See note 423.12 for details (inability to crash dump on VMS
when VGA present). VMS will be changing how/when it clears MCES in
VMS 7.2. Console V4.8 and beyond includes a hack of sorts to help
resolve the inability to crash problem in most cases. You should
therefore update the console to V4.8-5 (V3.9 CD) at some point for
this customer - assuming he/she wants to put back the VGA card at
some point.
Let us know how things turn out.
BC
|
589.7 | thanks, will go take a look myself shortly ... | GIDDAY::FLAWN | | Thu May 08 1997 17:40 | 43 |
| Thanks,
And for explaining the bridge bus numbering. I'm told that the FDDI adapter was
in PCI bus 1 before .... (even though it looks like we both see it as being on
bus 0).
I'm travelling to the site today and will update here with the results. At this
stage I plan on (in only approximate order) :
1. Write a quick program to mungle a process kernel stack pointer and try
to reproduce. If reproduceable try a different type of crash to see if
it's just these or all crash types.
2. Rerun ECU in case that does some good
3. Fix something I've seen now on console output showing 4 billion (looks like
MAXINT of some reasonably sized bitfield) environmental events - first
clear the events, if that doesn't work do the neat save/clear/restore NVRAM
thing (console is 4.8-6). If still no work try replacing the XICOR NVRAM
and finally saddle.
4. Increase sysgen parmeter KSTACKPAGES to 6.
5. Sort out the device naming issue - move parts around till it makes sense
or indicates a problem.
6. Review software configuration and if nothing else has worked to nail the
problem down and the few software ECOs for known Digital caused kernel
stack invalid VMS crashes are relevant then apply as appropriate.
If I can reproduce this it should be possible to fix, even if we had to try
different console versions or a 300Mhz CPU to make it dump.
Thanks for taking the interest in this, and to Rawhide engineering in general
for their assistance thru notes. Problems such as this, while formally
warranting escalation due to severity, really need to be narrowed down or
resolved in the field.
Sorry about not having all the info on this earlier!
Thanks,
Dave.
|
589.8 | Comments on last reply | MAY21::CUMMINS | | Thu May 08 1997 17:59 | 62 |
| Feedback on your most recent reply..
And for explaining the bridge bus numbering. I'm told that the FDDI adapter was
in PCI bus 1 before .... (even though it looks like we both see it as being on
bus 0).
BC> Yes, log from base note definitely shows it hanging off PCI hose 0.
1. Write a quick program to mungle a process kernel stack pointer and try
to reproduce. If reproduceable try a different type of crash to see if
it's just these or all crash types.
2. Rerun ECU in case that does some good
BC> I'm 99% sure you'll be wasting your time here.
3. Fix something I've seen now on console output showing 4 billion (looks like
MAXINT of some reasonably sized bitfield) environmental events - first
clear the events, if that doesn't work do the neat save/clear/restore NVRAM
thing (console is 4.8-6). If still no work try replacing the XICOR NVRAM
and finally saddle.
BC> This is a known problem that was caused when some systems slipped
BC> thru MFG without having their RCM NVRAMs properly initialized. PAL
BC> stores environmental events in the RCM NVRAM. You could have broken
BC> HW, but it's more likely you have one of the uninit'd machines.
BC>
BC> To check for HW presence/okay, type the following:
BC>
BC> P00>>>ls iic_rcm*
BC> iic_rcm_nvram0 iic_rcm_nvram1 iic_rcm_nvram2 iic_rcm_nvram3 iic_rcm_nvram4
BC> iic_rcm_nvram5 iic_rcm_nvram6 iic_rcm_nvram7 iic_rcm_temp
BC>
BC> You should see all of the above devices. If not, you're either missing
BC> all or part of the COMBO/RCM logic.
BC>
BC> Most likely you simply need to init the NVRAM. Type the following:
BC>
BC> P00>>> d iic_rcm_nvram6:4 -q 20000010057
BC>
BC> Does this make the SHOW POWER problem go away?
4. Increase sysgen parmeter KSTACKPAGES to 6.
5. Sort out the device naming issue - move parts around till it makes sense
or indicates a problem.
BC> No problem with what I saw from the base note.. Other than someone
BC> apparently telling you DEFPA was in PCI1 (hose 1).
6. Review software configuration and if nothing else has worked to nail the
problem down and the few software ECOs for known Digital caused kernel
stack invalid VMS crashes are relevant then apply as appropriate.
If I can reproduce this it should be possible to fix, even if we had to try
different console versions or a 300Mhz CPU to make it dump.
BC> Don't understand the 300 MHz CPU comment. Latest V4.8 console does
BC> work around the VMS MCES and crash dumping issue (when VGA present).
BC> So, if VGA present, you should update to latest console. Not a bad
BC> idea to do so anyway, since provides various new features/fixes.
BC> See LFU release notes for details..
|
589.9 | looks like hw or environment | GIDDAY::FLAWN | | Mon May 26 1997 08:58 | 1249 |
| Hi,
It turns out this should be a hardware problem. We got a similar failure
after returning to the 2GB configuration but this time, with ERLBUFFERPAGES
high enough for good error logging and DECevent installed we got some
information and the system started to dmp but then hit what looks like the
VMS problem were with MCES it falls into XDELTA. While the customer is running
latest console it looks like we may have hit an instance where that can still
happen (removing the VGA may help get more info).
Unfortunately I hadn't given them instructions on dumping the mchk
logout area as I hadn't expected this one....
It looks to me that this new behaviour (getting some error info and starting
to dump, rather than doing the kernel stack not valid halt loop) may be
because the machine ceck handler was overflowing the kernel stack before...
so we didn't get the machine check crash but instead got the KSTACKNV. Just
bad luck I suppose....
The error log info makes this look like a memory problem (the IOD and CPU
were seeing errors at this time) but given the number of memory options
attempted we can probably rule it out unless we've been unfortunate with
the spares.
Instead, and given that these failures seem to occur at about the same time
each day, but not on weekends, I think we're either seeing environmental
problems or a motherboard problem.... showing up when the higher physical
addresses are hit (though the customer says this is happening a bit after
their heaviest load time). This happens on two machines (we've not
tried 2GB with the higher KSTACKPAGES in the other machine which was doing
the KSTACKNV, so I can't be sure it's the same but it appears likely).
The parts in these systems are all fairly early FRS but I can't see any
known issues that would match these symptoms, so I'll try to reproduce
this with DECVET and/or swap the motherboard, but am open to any suggestions.
We'l also set up a Dranetz with RI monitoring to see if it shows anything.
Regards and thanks,
Dave.
******************************** ENTRY 4555 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5790.
Timestamp of occurrence 26-MAY-1997 11:09:01
Time since reboot 2 Day(s) 15:00:19
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0086 Alpha Chip Detected ECC Error, From Memory
Ext Interface Status Reg xFFFFFFF0C1FFFFFF
DATA SOURCE IS MEMORY OR SYSTEM
CORRECTABLE ECC ERROR
D-ref fill
Ext Interface Address Reg xFFFFFF0066C181CF
Fill Syndrome Reg x000000000000D900
Interrupt Summary Reg x0000000100000000
Correctable ECC Errors (IPL31)
AST Requests 3-0: x0000000000000000
WHOAMI x00000000 CPU0 Detected This Error
--IOD REGISTERS FOLLOW--
Base Addr of Bridge x0000000000000000
Register Contents Not Valid For This Error
Dev Type & Rev Register x00000000 Register Contents Not Valid For This Error
MC Error Info Register 0 x00000000 Register Contents Not Valid For This Error
MC Error Info Register 1 x00000000 Register Contents Not Valid For This Error
CAP Error Register x00000000 Register Contents Not Valid For This Error
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4556 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5791.
Timestamp of occurrence 26-MAY-1997 11:09:01
Time since reboot 2 Day(s) 15:00:19
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000F9E0000000
IOD# 0
Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x66C181C0
MC Bus Trans Addr<31:4>: 66C181C0
MC Error Info Register 1 x800E8900 MC bus trans addr <39:32> x00000000
MC Command is Read1-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4557 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5792.
Timestamp of occurrence 26-MAY-1997 11:09:01
Time since reboot 2 Day(s) 15:00:19
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000FBE0000000
IOD# 1
Dev Type & Rev Register x06000221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
Internal CAP Chip Arbiter: Enabled
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x66C181C0
MC Bus Trans Addr<31:4>: 66C181C0
MC Error Info Register 1 x800E8900 MC bus trans addr <39:32> x00000000
MC Command is Read1-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4558 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5793.
Timestamp of occurrence 26-MAY-1997 11:09:02
Time since reboot 2 Day(s) 15:00:20
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0086 Alpha Chip Detected ECC Error, From Memory
Ext Interface Status Reg xFFFFFFF0C1FFFFFF
DATA SOURCE IS MEMORY OR SYSTEM
CORRECTABLE ECC ERROR
D-ref fill
Ext Interface Address Reg xFFFFFF0066C9A1CF
Fill Syndrome Reg x000000000000D600
Interrupt Summary Reg x0000000100000000
Correctable ECC Errors (IPL31)
AST Requests 3-0: x0000000000000000
WHOAMI x00000000 CPU0 Detected This Error
--IOD REGISTERS FOLLOW--
Base Addr of Bridge x0000000000000000
Register Contents Not Valid For This Error
Dev Type & Rev Register x00000000 Register Contents Not Valid For This Error
MC Error Info Register 0 x00000000 Register Contents Not Valid For This Error
MC Error Info Register 1 x00000000 Register Contents Not Valid For This Error
CAP Error Register x00000000 Register Contents Not Valid For This Error
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4559 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5794.
Timestamp of occurrence 26-MAY-1997 11:09:02
Time since reboot 2 Day(s) 15:00:20
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000F9E0000000
IOD# 0
Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x66C9A1D0
MC Bus Trans Addr<31:4>: 66C9A1D0
MC Error Info Register 1 x800E9800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4560 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5795.
Timestamp of occurrence 26-MAY-1997 11:09:02
Time since reboot 2 Day(s) 15:00:20
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000FBE0000000
IOD# 1
Dev Type & Rev Register x06000221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
Internal CAP Chip Arbiter: Enabled
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x66C9A1D0
MC Bus Trans Addr<31:4>: 66C9A1D0
MC Error Info Register 1 x800E9800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4561 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5796.
Timestamp of occurrence 26-MAY-1997 11:09:02
Time since reboot 2 Day(s) 15:00:20
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0086 Alpha Chip Detected ECC Error, From Memory
Ext Interface Status Reg xFFFFFFF0C1FFFFFF
DATA SOURCE IS MEMORY OR SYSTEM
CORRECTABLE ECC ERROR
D-ref fill
Ext Interface Address Reg xFFFFFF0066C9A1CF
Fill Syndrome Reg x000000000000D600
Interrupt Summary Reg x0000000100000000
Correctable ECC Errors (IPL31)
AST Requests 3-0: x0000000000000000
WHOAMI x00000000 CPU0 Detected This Error
--IOD REGISTERS FOLLOW--
Base Addr of Bridge x0000000000000000
Register Contents Not Valid For This Error
Dev Type & Rev Register x00000000 Register Contents Not Valid For This Error
MC Error Info Register 0 x00000000 Register Contents Not Valid For This Error
MC Error Info Register 1 x00000000 Register Contents Not Valid For This Error
CAP Error Register x00000000 Register Contents Not Valid For This Error
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4562 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5797.
Timestamp of occurrence 26-MAY-1997 11:09:02
Time since reboot 2 Day(s) 15:00:20
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000F9E0000000
IOD# 0
Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x66C9A1D0
MC Bus Trans Addr<31:4>: 66C9A1D0
MC Error Info Register 1 x800E9900 MC bus trans addr <39:32> x00000000
MC Command is Read1-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4563 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5798.
Timestamp of occurrence 26-MAY-1997 11:09:02
Time since reboot 2 Day(s) 15:00:20
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000FBE0000000
IOD# 1
Dev Type & Rev Register x06000221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
Internal CAP Chip Arbiter: Enabled
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x66C9A1D0
MC Bus Trans Addr<31:4>: 66C9A1D0
MC Error Info Register 1 x800E9900 MC bus trans addr <39:32> x00000000
MC Command is Read1-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4564 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5799.
Timestamp of occurrence 26-MAY-1997 11:09:15
Time since reboot 2 Day(s) 15:00:33
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0086 Alpha Chip Detected ECC Error, From Memory
Ext Interface Status Reg xFFFFFFF0C1FFFFFF
DATA SOURCE IS MEMORY OR SYSTEM
CORRECTABLE ECC ERROR
D-ref fill
Ext Interface Address Reg xFFFFFF0066C9A1CF
Fill Syndrome Reg x000000000000D600
Interrupt Summary Reg x0000000100000000
Correctable ECC Errors (IPL31)
AST Requests 3-0: x0000000000000000
WHOAMI x00000000 CPU0 Detected This Error
--IOD REGISTERS FOLLOW--
Base Addr of Bridge x0000000000000000
Register Contents Not Valid For This Error
Dev Type & Rev Register x00000000 Register Contents Not Valid For This Error
MC Error Info Register 0 x00000000 Register Contents Not Valid For This Error
MC Error Info Register 1 x00000000 Register Contents Not Valid For This Error
CAP Error Register x00000000 Register Contents Not Valid For This Error
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4565 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5800.
Timestamp of occurrence 26-MAY-1997 11:09:15
Time since reboot 2 Day(s) 15:00:33
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000F9E0000000
IOD# 0
Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x66C9A1D0
MC Bus Trans Addr<31:4>: 66C9A1D0
MC Error Info Register 1 x800E9800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4566 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5801.
Timestamp of occurrence 26-MAY-1997 11:09:15
Time since reboot 2 Day(s) 15:00:33
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000FBE0000000
IOD# 1
Dev Type & Rev Register x06000221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
Internal CAP Chip Arbiter: Enabled
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x66C9A1D0
MC Bus Trans Addr<31:4>: 66C9A1D0
MC Error Info Register 1 x800E9800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4567 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5802.
Timestamp of occurrence 26-MAY-1997 11:09:31
Time since reboot 2 Day(s) 15:00:49
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 38. Time Stamp Entry
SWI Minor class 7. Timestamp
******************************** ENTRY 4568 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5803.
Timestamp of occurrence 26-MAY-1997 11:11:27
Time since reboot 2 Day(s) 15:02:45
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0086 Alpha Chip Detected ECC Error, From Memory
Ext Interface Status Reg xFFFFFFF4C1FFFFFF
DATA SOURCE IS MEMORY OR SYSTEM
CORRECTABLE ECC ERROR
I-ref fill
Ext Interface Address Reg xFFFFFF006703B71F
Fill Syndrome Reg x000000000000DC00
Interrupt Summary Reg x0000000100000000
Correctable ECC Errors (IPL31)
AST Requests 3-0: x0000000000000000
WHOAMI x00000000 CPU0 Detected This Error
--IOD REGISTERS FOLLOW--
Base Addr of Bridge x0000000000000000
Register Contents Not Valid For This Error
Dev Type & Rev Register x00000000 Register Contents Not Valid For This Error
MC Error Info Register 0 x00000000 Register Contents Not Valid For This Error
MC Error Info Register 1 x00000000 Register Contents Not Valid For This Error
CAP Error Register x00000000 Register Contents Not Valid For This Error
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4569 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5804.
Timestamp of occurrence 26-MAY-1997 11:11:27
Time since reboot 2 Day(s) 15:02:45
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000F9E0000000
IOD# 0
Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x6703B700
MC Bus Trans Addr<31:4>: 6703B700
MC Error Info Register 1 x800E8800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4570 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5805.
Timestamp of occurrence 26-MAY-1997 11:11:27
Time since reboot 2 Day(s) 15:02:45
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 6. Soft ECC Error
Memory Minor class 1. Soft ECC error
Software Flags x0000000000000000
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000FBE0000000
IOD# 1
Dev Type & Rev Register x06000221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
Internal CAP Chip Arbiter: Enabled
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x6703B700
MC Bus Trans Addr<31:4>: 6703B700
MC Error Info Register 1 x800E8800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x90000000 Correctable ECC err det by MDPB
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4571 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5806.
Timestamp of occurrence 26-MAY-1997 11:14:58
Time since reboot 2 Day(s) 15:06:16
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 2. Machine Check
CPU Minor class 1. Machine check (670 entry)
Software Flags x0000000300000000
IOD 0 Register Subpkt Pres
IOD 1 Register Subpkt Pres
Active CPUs x00000001
Hardware Rev x00000000
System Serial Number
Module Serial Number
Module Type x0000
System Revision x00000000
* MCHK 670 Regs *
Flags: x00000000
PCI Mask x0000
Machine Check Reason x0098 Fatal Alpha Chip Detected Hard Error
PAL SHADOW REG 0 x0000000000000000
PAL SHADOW REG 1 x0000000000000000
PAL SHADOW REG 2 x0000000000000000
PAL SHADOW REG 3 x0000000000000000
PAL SHADOW REG 4 x0000000000000000
PAL SHADOW REG 5 x000000FBE0000000
PAL SHADOW REG 6 x6703B70006000221
PAL SHADOW REG 7 x90000000800E8800
PALTEMP0 x0000000000000001
PALTEMP1 x0000000000000004
PALTEMP2 xFFFFFFFF92850918
PALTEMP3 x0000000000004400
PALTEMP4 x00000000098BE130
PALTEMP5 x0000000000000180
PALTEMP6 x0000000000000004
PALTEMP7 x0000000000000016
PALTEMP8 x0000000000000004
PALTEMP9 x0000000000000003
PALTEMP10 x000000000012A04C
PALTEMP11 x0000000000000000
PALTEMP12 xFFFFFFFF83A25C80
PALTEMP13 x0000000000006E80
PALTEMP14 x0000000000000000
PALTEMP15 x00000000000F0000
PALTEMP16 x0000009806700001
PALTEMP17 x0000529323B0E1EE
PALTEMP18 xFFFFFFFF81C20000
PALTEMP19 x000000007FF92000
PALTEMP20 x0000000042FA2000
PALTEMP21 x0000000200000000
PALTEMP22 x0000000000CF0000
PALTEMP23 x0000000045308080
Exception Address Reg x000000000012A04C
Native-mode Instruction
Exception PC x000000000004A813
Exception Summary Reg x0000000000000000
Exception Mask Reg x0000000000000000
PAL Base Address Reg x0000000000008000
Base Addr for PALcode: x0000000000000002
Interrupt Summary Reg x0000000000200000
External HW Interrupt at IPL21
AST Requests 3-0: x0000000000000000
IBOX Ctrl and Status Reg x000000C144020000
Timeout Counter Bit Clear.
IBOX Timeout Counter Enabled.
Floating Point Instr's May be Issued.
PAL Shadow Registers Enabled.
Correctable Error Interrupts Enabled.
ICACHE BIST (Self Test) Was Successful.
TEST_STATUS_H Pin Asserted
Icache Par Err Stat Reg x0000000000000000
Dcache Par Err Stat Reg x0000000000000000
Virtual Address Reg x00000000009DF9C0
Memory Mgmt Flt Sts Reg x0000000000014310
If Err, Reference Resulted in DTB Miss
Fault Inst RA Field: x000000000000000C
Fault Inst Opcode: x0000000000000028
Scache Address Reg xFFFFFF000000F3AF
Scache Status Reg x0000000000000000
Bcache Tag Address Reg xFFFFFF80290D0FFF
Last Bcache Access Resulted in a Miss.
Value of Parity Bit for Tag Control Status
Bits Dirty, Shared & Valid is Clear.
Value of Tag Control Dirty Bit is Clear.
Value of Tag Control Shared Bit is Clear.
Value of Tag Control Valid Bit is Set.
Value of Parity Bit Covering Tag Store
Address Bits is Clear.
Tag Address<38:20> Is: x0000000000000290
Ext Interface Address Reg xFFFFFF00669221CF
Fill Syndrome Reg x0000000000001800
Ext Interface Status Reg xFFFFFFF141FFFFFF
Error Source is Memory or System
UNCORRECTABLE ECC ERROR
Error Occurred During D-ref Fill
LD LOCK xFFFFFF0043C5E7CF
** IOD SUBPACKET -> ** IOD 0 Register Subpacket
WHOAMI x000004FA Module Revision 1.
VCTY ASIC Rev = 0
Bcache Size = 4MB
CPU = 0
This Bus Bridge Phy Addr x000000F9E0000000
IOD# 0
Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment
Device Class: Host Bus to PCI Bridge
MC-PCI Command Register x46490FB1 Module Self-Test Passed LED On.
Delayed PCI Bus Reads Protocol: Enabled
Bridge to PCI Transactions: Enabled
Bridge WILL NOT REQUEST 64 Bit Data Trans
Bridge ACCEPTS 64 Bit Data Transactions
PCI Address Parity Check: Enabled
MC Bus CMD/Addr Parity Check: Enabled
MC Bus NXM Check: Enabled
Check ALL Transactions for Errors
Use MC_BMSK for 16 Byte Align Blk Mem Wrt
Wrt PEND_NUM Threshold: 9.
RD_TYPE Memory Prefetch Algorithm: Short
RL_TYPE Mem Rd Line Prefetch Type: Medium
RM_TYPE Mem Rd Multiple Cmd Type: Long
ARB_MODE PCI Arbitration: Round Robin
Mem Host Address Ext Reg x00000000 HAE Sparse Mem Adr<31:27> x00000000
IO Host Adr Ext Register x00000000 PCI Upper Adr Bits<31:25> x00000000
Interrupt Ctrl Register x00000003 Write Device Interrupt Info Struct:Enabled
Interrupt Request x00800000 Interrupts asserted x00000000
Hard Error
Interrupt Mask0 Register x00E51000
Interrupt Mask1 Register x00000000
MC Error Info Register 0 x669221D0
MC Bus Trans Addr<31:4>: 669221D0
MC Error Info Register 1 x800E8900 MC bus trans addr <39:32> x00000000
MC Command is Read1-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register xC0000000 Uncorrectable ECC err det by MDPB
MC error info latched
PCI Bus Trans Error Adr x00000000
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
** IOD SUBPACKET -> ** IOD 1 Register Subpacket
WHOAMI x000004FA Module Revision 1.
VCTY ASIC Rev = 0
Bcache Size = 4MB
CPU = 0
This Bus Bridge Phy Addr x000000FBE0000000
IOD# 1
Dev Type & Rev Register x06000221 CAP Chip Revision: x00000001
B3040 Module Revision: x00000002
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
Internal CAP Chip Arbiter: Enabled
Device Class: Host Bus to PCI Bridge
MC-PCI Command Register x46490FB1 Module Self-Test Passed LED On.
Delayed PCI Bus Reads Protocol: Enabled
Bridge to PCI Transactions: Enabled
Bridge WILL NOT REQUEST 64 Bit Data Trans
Bridge ACCEPTS 64 Bit Data Transactions
PCI Address Parity Check: Enabled
MC Bus CMD/Addr Parity Check: Enabled
MC Bus NXM Check: Enabled
Check ALL Transactions for Errors
Use MC_BMSK for 16 Byte Align Blk Mem Wrt
Wrt PEND_NUM Threshold: 9.
RD_TYPE Memory Prefetch Algorithm: Short
RL_TYPE Mem Rd Line Prefetch Type: Medium
RM_TYPE Mem Rd Multiple Cmd Type: Long
ARB_MODE PCI Arbitration: Round Robin
Mem Host Address Ext Reg x00000000 HAE Sparse Mem Adr<31:27> x00000000
IO Host Adr Ext Register x00000000 PCI Upper Adr Bits<31:25> x00000000
Interrupt Ctrl Register x00000003 Write Device Interrupt Info Struct:Enabled
Interrupt Request x00800000 Interrupts asserted x00000000
Hard Error
Interrupt Mask0 Register x00C11111
Interrupt Mask1 Register x00000000
MC Error Info Register 0 x669221D0
MC Bus Trans Addr<31:4>: 669221D0
MC Error Info Register 1 x800E8900 MC bus trans addr <39:32> x00000000
MC Command is Read1-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register xC0000000 Uncorrectable ECC err det by MDPB
MC error info latched
PCI Bus Trans Error Adr x00000000
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.19-2
******************************** ENTRY 4572 ********************************
Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V6.2-1H3
Event sequence number 5807.
Timestamp of occurrence 26-MAY-1997 11:14:58
Time since reboot 2 Day(s) 15:06:16
Host name AXP2
System Model AlphaServer 4100 5/400 4MB
Entry type 37. Crash Re-Start
Bugcheck Minor class 1. Crash Re-start
Bugcheck Msg MACHINECHK, Machine check while in kernel
mode
Process ID x000600D7
Process Name
KSP x000000007FF91EC0
ESP x000000007FF96000
SSP x000000007FF9C100
USP x000000007ED12E70
R0 x0000000000000000
R1 x000000007FF91EE0
R2 xFFFFFFFF927AE2B8
R3 xFFFFFFFF927AE810
R4 x0000000000000000
R5 x0000000000000180
R6 x0000000000000004
R7 x000000000861C100
R8 x0000000000000006
R9 x0000000000000000
R10 x0000000000000001
R11 x0000000000000000
R12 x0000000000000000
R13 x0000000000000000
R14 x0000000000000000
R15 x00000000009DF9B0
R16 x0000000000000215
R17 x0000000000000001
R18 x0000000000000001
R19 xFFFFFFFF81C1DF18
R20 x0000000000000008
R21 xFFFFFFFF81C1DF18
R22 x0000000000000100
R23 x0000000000000180
R24 xFFFFFFFF81C1DC00
R25 x0000000000000003
R26 x0000000000000210
R27 xFFFFFFFF927B6560
R28 xFFFFFFFF8003F0EC
FP x000000007FF91EC0
SP x000000007FF91EC0
PC xFFFFFFFF8004E610
PS x0000000000001F00
PTBR x00000000000217D1
Process Ctl Block Base Re x0000000045308080
PRBR xFFFFFFFF81C20000
VPTB x0000000200000000
System Ctl Block Base Reg x0000000000000678
Software Interrupt Summar x0000000000000000
ASN x0000000000000045
ASTSR ASTEN x000000000000000F
FEN x0000000000000001
ASN x0000000000000045
IPL x000000000000001F
MCES x0000000000000001
|
589.10 | Try swapping high/low members of upper 1GB memory option? | HARMNY::CUMMINS | | Tue May 27 1997 11:29 | 26 |
| Several different error addresses in this log. All in upper 1GB of
memory. Correctables early on with uncorrectable eventually..
Correctables:
EI_ADDR: xFFFFFF0066C181CF FILL_SYN: D900 --> data bit 05
EI_ADDR: xFFFFFF0066C9A1CF FILL_SYN: D600 --> data bit 04
EI_ADDR: xFFFFFF0066C9A1CF FILL_SYN: D600 --> data bit 04
EI_ADDR: xFFFFFF0066C9A1CF FILL_SYN: D600 --> data bit 04
EI_ADDR: xFFFFFF006703B71F FILL_SYN: DC00 --> data bit 07
Uncorrectable:
EI_ADDR: xFFFFFF00669221CF
IOD's MDPB chip saw the same errors..
All errors originated in/from memory since no DIRTY bit set.
Re: bad spares.. This is quite possible since from what I have seen the
quality of the 4100/4000 spares, esp. memory, is just plain awful at best.
Since the data points to MDPB always detecting the fault and the syndrome
register always points to the high half of the transaction, if you're still
in experimentation mode, you could try swapping the low/high halves of the
upper 1GB pair to see if the problem follows the card or not. This would
give you a good idea whether you are chasing a faulty memory spare versus a
motherboard or some other systemic problem.
|
589.11 | | HARMNY::CUMMINS | | Wed May 28 1997 14:07 | 9 |
| Note that the data bit callouts in reply -.1 are as described in the EV5 HW
spec. Since the upper byte of the syndrome is involved in each of the CRDs,
one actually needs to add 64 to these numbers..
EI_ADDR: xFFFFFF0066C181CF FILL_SYN: D900 --> data bit 69
EI_ADDR: xFFFFFF0066C9A1CF FILL_SYN: D600 --> data bit 68
EI_ADDR: xFFFFFF0066C9A1CF FILL_SYN: D600 --> data bit 68
EI_ADDR: xFFFFFF0066C9A1CF FILL_SYN: D600 --> data bit 68
EI_ADDR: xFFFFFF006703B71F FILL_SYN: DC00 --> data bit 71
|
589.12 | looking like memory all along (sigh....) | GIDDAY::FLAWN | | Fri May 30 1997 10:07 | 29 |
|
Thanks, this does now look like bad memory, with the earlier failure to get
information due to ERLBUFFERPAGES being too low and I think the machine check
handler blowing the kernel stack with SYSGEN param KSTACKPAGES at 1 (I set it
way up to 6, 2 would probably have done).
The customer has shifted to using a loan 2100 so we could run diags - it turns
out that DECVET wasn't necessary, the console diags pull consistent soft errors
at a reasonably regular rate which is what we'll chase.
Moving MEM1H down to MEM0L shifts the errors down to the low 1GB with the
syndrom bits indicting the low card. (Initially MEM1L and MEM1H were swapped
and this showed the low card faulty in MEM1, so we have consistency). We'll
now proceed to weed out the others.
This is a pretty rough rate of failure on new boards - do you think
manufacturing is aware of these instances already ?
I don't quite understand the distinction between looking at the syndrome info
and which MDP ASIC saw the error - my understanding is that without syndrome
information I can't pick the card (MDPB is the one seeing the errors even on
the low card). At one stage I thought if the error was in the high 64 bits
(seen by MDPB) then that meant the high card but that's clearly not the case
(what I'm missing is why - i.e. which card the bits of a given physical
address is on - I've looked at the SPM and the system spec ... don't have a HW
spec).... maybe this should be obvious anyway....
Thanks for the help with this mess,
Dave.
|
589.13 | Not sliced on QW boundaries | POBOXB::STEINMAN | | Fri May 30 1997 10:18 | 6 |
|
The system bus data bits (127:0) are not perfectly sliced between the
HIGH and LOW modules of a given memory pair. Due to routing, timing,
layout, etc. the bits are somewhat scattered.
mo
|
589.14 | Can only isolate to mem pair member on EV5 CRDs | HARMNY::CUMMINS | | Mon Jun 02 1997 10:30 | 32 |
| Another issue is that all versions of the MDP chips have a bug in them
which can result in data corruption if software accesses registers on the
MDP chips that require involvement by the MDPB. There are four such CSRs:
MDPA_STAT,
MDPB_STAT,
MDPA_SYNDROME, and
MDPB_SYNDROME.
Unfortunately, these are the registers you need to use to figure out which
half of a given memory option is at fault in the case of IOD-detected CRDs.
Accordingly, we changed PALcode early on to not collect state from these
registers during PALcode CRD handling.
At the time we made the SRM PALcode changes, we implemented a call to PAL
(CSERVE) that could be used to enable reading of these registers on CRDs.
The SRM console-based TEST command turns on this feature so that IOD stat
and syndrome info is collected and displayed during error handling. TEST
does no writes to media (in FIELD mode), so the risk of data corruption is
essentially nil. Writes are performed in manufacturing mode TEST, but data
corruption is recoverable in this environment, should it occur.
Finally, UNIX/VMS PAL always scrubs CRDs, and so the thinking was that the
MDP bug did not need to be fixed since there should be an EV5-detected CRD
generated during the read of the faulty memory location, provided the error
was not a transient. PAL collects syndrome information on EV5-detected CRD
errors, and this data can be used to isolate to the correct half of the MEM
pair. [I'm not sure whether HAL scrubs memory on CRDs in an NT environment,
but will find out and post a reply here with the answer..]
I'd be interested to know if you are seeing IOD-detected CRDs with no
accompanying EV5-detected CRDs.
|