T.R | Title | User | Personal Name | Date | Lines |
---|
469.1 | | POBOXB::BAK | | Thu Feb 06 1997 11:26 | 1 |
| You need to get the error logs from DECEvent....
|
469.2 | Memory problem, likely | POBOXB::STEINMAN | | Thu Feb 06 1997 12:17 | 26 |
|
Looking at the error output in .0, you have an old version of DECEvent
that isn't cracking the error info, but I was able to discern what the
problem is.
I believe you have a faulty memory module or pair.
The address that produced the correctable error is:
4D8FB340, so you can use the console SHOW MEM command to find
the base address of the memory pair that contains this address.
It appears that the error syndrome is B5 (though
I cannot be sure) which points to data<47> which would indicate the
high memory card of the pair that contains this address.
I also noticed that you reported the cache size as 3MB. The 400 MHz
module has a 4MB BCache, but it is reported as 3MB if you have the
wrong version of console. If this is indeed the case, I'd look into
upgrading the firmware as well.
If you need any further assistance, feel free to email me directly
at POBOXA::STEINMAN or call me at DTN: 223-3874
mo
|
469.3 | We'll test it tomorrow!! | MDR01::CARRANZ | MCS Madrid | Thu Feb 06 1997 12:49 | 7 |
| Many thanks for your so quickly reply, we are going to install last DECEvent
version (2.3) and test FW version and memory.
I'll update the note with news.
Carmen.
|
469.4 | Need some more information | POBOXB::DONALDSON | | Thu Feb 06 1997 18:13 | 28 |
| Hi,
There are a couple of courious items in your information. First you say
the system is a 4CPU system yet the log says only 2 cpus are present.
Second, it looks like the error you are getting is a correctable error
but the errorlog you have posted is filtered so the register information
is being surpressed. Can you posted the full error log entry for one of
the 620 errors so we can see the registers? That way we can see what
component is causing the correctable errors.
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 9.
Timestamp of occurrence 31-JAN-1997 15:33:13
Host name rtrprd2
System type register x00000016 Systype 22. Not announced yet
---> Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors
---> CPU Minor class 4. 620 System Correctable Error
|
469.5 | Probably memory problemas. Thaks all!! | MDR01::CARRANZ | MCS Madrid | Mon Feb 10 1997 10:46 | 192 |
| Hi all,
Many thaks for your replys.
Excuse me by the mistake, but i have included my computer System Startup,
instead of customer one.
I append it.
(At the Startup the computer had just two CPU's, because we were testing the
4 CPUS backplane.)
We are now updating the Firmware version and we are going to change the memory.
As here have said, it looks like a memory problem.
We have update to DECevent V2.3 and the "CPU EXCEPTION" errors looks now:
dia -R -f ./binary.errlog -i cpus | more
******************************** ENTRY 35 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 8.
Timestamp of occurrence 31-JAN-1997 15:33:13
Host name rtrprd2
System type register x00000016 AlphaServer 4000 Series
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000001
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 4. 620 System Correctable Error
Software Flags x0000000000000000
Active CPUs x00000003
Hardware Rev x00000000
System Serial Number AY62722544
Module Serial NumberModule Type x0000
System Revision x00000000
Machine Check Reason x0086 Alpha Chip Detected ECC Error, From Memory
Ext Interface Status Reg xFFFFFFF0C1FFFFFF
DATA SOURCE IS MEMORY OR SYSTEM
CORRECTABLE ECC ERROR
D-ref fill
Ext Interface Address Reg xFFFFFF004D8FB37F
Fill Syndrome Reg x00000000000000B5
Interrupt Summary Reg x0000000100000000
Correctable ECC Errors (IPL31)
AST Requests 3-0: x0000000000000000
WHOAMI x00000001 CPU1 Detected This Error
--IOD REGISTERS FOLLOW--
Base Addr of Bridge x0000000000000000
Register Contents Not Valid For This Error
Dev Type & Rev Register x00000000 Register Contents Not Valid For This Error
MC Error Info Register 0 x00000000 Register Contents Not Valid For This Error
MC Error Info Register 1 x00000000 Register Contents Not Valid For This ErrorCAP Error Register x00000000 Register Contents Not Valid For This Error
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.21-3
# uerf -R -r 300 -f ./binary.errlog | more
********************************* ENTRY 1. *********************************
----- EVENT INFORMATION -----
EVENT CLASS OPERATIONAL EVENT
OS EVENT TYPE 300. SYSTEM STARTUP
SEQUENCE NUMBER 1.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Sun Feb 2 07:19:52 1997
OCCURRED ON SYSTEM rtrprd2
SYSTEM ID x00070016
SYSTYPE x00000000
MESSAGE Alpha boot: available memory from
_0x2972000 to 0x7fff6000
Digital UNIX V3.2F (Rev. 69.73); Fri
_Oct 25 14:41:14 MET DST 1996
physical memory = 2048.00 megabytes.
available memory = 2006.51 megabytes.
using 7856 buffers containing 61.37
_megabytes of memory Master cpu at slot 0.
Firmware revision: 1.2
PALcode: Digital-UNIX/OSF version 1.21
AlphaServer 4100 5/400 3MB
pci1 at mcbus0 slot 5
psiop0 at pci1 slot 1
Loading SIOP: script c0001900, reg
_1222200, data c000d8f8
scsi0 at psiop0 slot 0
rz5 at scsi0 bus 0 target 5 lun 0 (DEC
_ RRD45 (C) DEC 0436)
pza0 at pci1 slot 2
pza0 firmware version: DEC P01 A10
_
scsi1 at pza0 slot 0
rz8 at scsi1 bus 1 target 0 lun 0 (DEC
_ HSZ40 V27Z)
rz9 at scsi1 bus 1 target 1 lun 0 (DEC
_ HSZ40 V27Z)
rz10 at scsi1 bus 1 target 2 lun 0
_(DEC HSZ40 V27Z)
rz11 at scsi1 bus 1 target 3 lun 0
_(DEC HSZ40 V27Z) tu0: DECchip 21040-AA: Revision: 2.4
tu0 at pci1 slot 3
tu0: DEC TULIP Ethernet Interface,
_hardware address: 00-00-F8-21-ED-4A
tu0: auto sensing: selected UTP
_(10BaseT) port
pza1 at pci1 slot 4
pza1 firmware version: DEC P01 A10
_
scsi2 at pza1 slot 0
rz16 at scsi2 bus 2 target 0 lun 0
_(DEC HSZ40 V27Z)
rz17 at scsi2 bus 2 target 1 lun 0
_(DEC HSZ40 V27Z)
rz18 at scsi2 bus 2 target 2 lun 0
_(DEC HSZ40 V27Z)
rz19 at scsi2 bus 2 target 3 lun 0
_(DEC HSZ40 V27Z)
psiop1 at pci1 slot 5
Loading SIOP: script c162f900, reg
_1222000, data c163bcf8
scsi3 at psiop1 slot 0
gpc0 at eisa0 pci0 at mcbus0 slot 4
eisa0 at pci0
ace0 at eisa0
ace1 at eisa0
lp0 at eisa0
fdi0 at eisa0
fd0 at fdi0 unit 0
dns0 at eisa0
dns0: Digital WAN Device Driver
_Interface
dns1: Digital WAN Device Driver
_Interface
Initializing xcr0. Please wait.
Initializing xcr0. Please wait.
Initializing xcr0. Please wait.
Initializing xcr0. Please wait.
xcr0 at pci0 slot 2
re0 at xcr0 unit 0 (unit status =
_ONLINE, raid level = 1)
re1 at xcr0 unit 1 (unit status =
_ONLINE, raid level = 1)
fta0 DEC DEFPA FDDI Module, Hardware
_Revision 0 fta0 at pci0 slot 5
fta0: DMA Available.
fta0: DEC DEFPA (PDQ) FDDI Interface,
_Hardware address: 00-00-F8-40-F4-C1
fta0: Firmware rev: 2.46
Created FRU table configuration binary
_log packet
lvm0: configured.
lvm1: configured.
dli: configured
SuperLAT. Copyright 1993 Meridian
_Technology Corp. All rights
_reserved.
x25_access: configured
x25_relay: configured
wandd_base: configured
wandd_llc2: configured
wandd_lapb: configured
wan_utilities: configured
ctf_base: configured
Node ID is 00-00-f8-21-ed-4a (from
_device tu0)
dna_netman: configured dna_dli: configured
Again, thank very much.
Carmen.
|
469.6 | CPU problems!! | MDR01::CARRANZ | MCS Madrid | Wed Feb 12 1997 12:46 | 153 |
| Hello,
We have installed DECevent V2.3 and and update Firmware computer (CD 3.8).
Please, could it be possible that anybody take a look at it to confirm our
"cpu exception" problem.
Many thanks and regards,
Carmen Arranz.
******** ****** ***** **** ****
Our errlog looks now:
# dia -R -i cpus -f ./binary.errlog
DECevent V2.3
******************************** ENTRY 2 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 74.
Timestamp of occurrence 09-FEB-1997 23:20:02
Host name rtrprd2
System type register x00000016 AlphaServer 4000 Series
Number of CPUs (mpnum) x00000004
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 4. 620 System Correctable Error
Software Flags x0000000000000000
Active CPUs x0000000F
Hardware Rev x00000000
System Serial Number C1563
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
Base Addr of Bridge x000000FBE0000000
Dev Type & Rev Register x06000231 CAP Chip Revision: x00000001 HORSE Module Revision: x00000003
SADDLE Module Revision: x00000002
SADDLE Module Type: Left Hand
Internal CAP Chip Arbiter: Enabled
PCI Class Code x00000600
MC Error Info Register 0 x4D8FB340
MC Bus Trans Addr<31:4>: 4D8FB340
MC Error Info Register 1 x800E8800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x88000000 Correctable ECC err det by MDPA
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.21-3
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 73.
Timestamp of occurrence 09-FEB-1997 23:20:02
Host name rtrprd2
System type register x00000016 AlphaServer 4000 Series
Number of CPUs (mpnum) x00000004
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 4. 620 System Correctable Error
Software Flags x0000000000000000
Active CPUs x0000000F
Hardware Rev x00000000
System Serial Number C1563
Module Serial NumberModule Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
Base Addr of Bridge x000000F9E0000000
Dev Type & Rev Register x06008231 CAP Chip Revision: x00000001
HORSE Module Revision: x00000003
SADDLE Module Revision: x00000002
SADDLE Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment PCI Class Code x00000600MC Error Info Register 0 x4D8FB340
MC Bus Trans Addr<31:4>: 4D8FB340
MC Error Info Register 1 x800E8800 MC bus trans addr <39:32> x00000000
MC Command is Read0-Mem
CPU0 Master at Time of Error
Device ID: x00000002
MC error info valid
CAP Error Register x88000000 Correctable ECC err det by MDPA
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.21-3
******************************** ENTRY 4 ********************************
.... and so on ....
I'll put our binary.errlog into:
chueca:: (51.195)
Thanks and regards,
Carmen Arranz & Elsa Soengas
|
469.7 | | MAY30::CUMMINS | | Wed Feb 12 1997 13:08 | 20 |
| See note 484. Your and the noter's customer's systems described in note
484 are possibly experiencing the same symptoms. The noter in note 484
indicated there may have been 630 errors. But we're checking on this.
1024MB (1GB) EDO memory pairs were being used on the system described
in note 484. There's a problem with SYNC memories and older revision
motherboards that I didn't bother to discuss in the 484 note string.
Does your system have any SYNC memories? If so, do you know what the
system motherboard revision level is? The footprint of the problem I'm
describing is that only IOD-detected 620 CRD errors are ever seen. No
CPU-detected CRDs are ever logged..
Are there any 630 entries in the customer's error log? Have you tried
running the console TEST command with this version of console? If 630s,
use the FILL_SYNDROME data to determine the card pair member. Use the
error address and the SRM consoel SHOW MEMORY command to determine
which memory pair is faulty (assuming the problem is memory). If SYNC
memory, then the problem may well be an older-rev motherboard.
BC
|
469.8 | The rest of the story.... | POBOXB::STEINMAN | | Wed Feb 12 1997 16:35 | 141 |
|
Bill,
That was only a partial listing from DECEvent. Here is the complete
log of the error (including the EV5 detected error):
From: POBOXA::SHEPARD "GARY DTN 223-2499" 12-FEB-1997 15:35:23.74
To: POBOXB::STEINMAN
CC: SHEPARD
Subj: RE: DECEvent log + pointer to binary -- wanna have a look to confirm memory failure? Sure looks like it to me...thanks
Hi Mo,
Here is the CPU detected CRD followed by the IOD detected. It has
a syndrome of B5 just like you determined from the notes file.
This matches up with your analysis in the notes file.
Gary
******************************** ENTRY 7 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 30.
Timestamp of occurrence 05-FEB-1997 17:21:46
Host name rtrprd2
System type register x00000016 AlphaServer 4000 Series
Number of CPUs (mpnum) x00000004
CPU logging event (mperr) x00000003
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 4. 620 System Correctable Error
Software Flags x0000000000000000
Active CPUs x0000000F
Hardware Rev x00000000
System Serial Number C1563
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0086 Alpha Chip Detected ECC Error, From Memory
Ext Interface Status Reg xFFFFFFF0C1FFFFFF
DATA SOURCE IS MEMORY OR SYSTEM
CORRECTABLE ECC ERROR
D-ref fill
Ext Interface Address Reg xFFFFFF004D8E337F
Fill Syndrome Reg x00000000000000B5
Interrupt Summary Reg x0000000100000000
Correctable ECC Errors (IPL31)
AST Requests 3-0: x0000000000000000
WHOAMI x00000003 CPU3 Detected This Error
--IOD REGISTERS FOLLOW--
Base Addr of Bridge x0000000000000000
Register Contents Not Valid For This Error
Dev Type & Rev Register x00000000 Register Contents Not Valid For This Error
MC Error Info Register 0 x00000000 Register Contents Not Valid For This Error
MC Error Info Register 1 x00000000 Register Contents Not Valid For This Error
CAP Error Register x00000000 Register Contents Not Valid For This Error
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.21-3
******************************** ENTRY 8 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 31.
Timestamp of occurrence 05-FEB-1997 17:21:46
Host name rtrprd2
System type register x00000016 AlphaServer 4000 Series
Number of CPUs (mpnum) x00000004
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 4. 620 System Correctable Error
Software Flags x0000000000000000
Active CPUs x0000000F
Hardware Rev x00000000
System Serial Number C1563
Module Serial Number
Module Type x0000
System Revision x00000000
Machine Check Reason x0204 IOD Detected Soft Error
Ext Interface Status Reg x0000000000000000
Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
Register Contents Not Valid For This Error
Fill Syndrome Reg x0000000000000000
Register Contents Not Valid For This Error
Interrupt Summary Reg x0000000000000000
Register Contents Not Valid For This Error
WHOAMI x00000000 Register Contents Not Valid For This Error
--IOD REGISTERS FOLLOW--
This Bus Bridge Phy Addr x000000F9E0000000
IOD# 0
Dev Type & Rev Register x06008231 CAP Chip Revision: x00000001
B3040 Module Revision: x00000003
B3050 Module Revision: x00000002
B3050 Module Type: Left Hand
PCI-EISA Bus Bridge Present on PCI Segment
Device Class: Host Bus to PCI Bridge
MC Error Info Register 0 x4D8E3340
MC Bus Trans Addr<31:4>: 4D8E3340
MC Error Info Register 1 x800FDA00 MC bus trans addr <39:32> x00000000
MC Command is ReadMod0-Mem
CPU3 OR IOD3 Master at Time of Error
Device ID: x00000007
MC error info valid
CAP Error Register x88000000 Correctable ECC err det by MDPA
MC error info latched
MDPA Status Register x00000000 MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid
MDPB Status Register x00000000 MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid
PALcode Revision Palcode Rev: 1.21-3
|
469.9 | ...a little more of the rest of the story | POBOXB::STEINMAN | | Wed Feb 12 1997 16:38 | 6 |
|
Just to complete the story....486.2 is the DECEvent log for
the problem described in note 469. It is not a CPU problem, but
a memory CRD problem.
mo
|