| Title: | Alpha Workstation Conference |
| Notice: | See note 1.* for conference notices |
| Moderator: | WRKSYS::HOUSE |
| Created: | Wed Sep 07 1994 |
| Last Modified: | Fri Jun 06 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 1996 |
| Total number of notes: | 9122 |
Hi,
Getting CPU EXCEPTION errors on an AlphaStation 500/333, 384MB memory,
running UNIX V3.2G.
System hasn't crashed yet but we are getting over 300 errors in a week.
DECevent is not yet installed on this system and uerf doesn't tell a lot.
(suggested customer to find DECevent in UNIX software CD V4.0 or higher)
Meanwhile could someone analyse following uerf report?
Or pointer for the documents?
Thanks for your help in advance.
CRDC/Mitz
----- EVENT INFORMATION -----
EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 382.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Apr 21 06:35:34 1997
OCCURRED ON SYSTEM NWSCMX2
SYSTEM ID x0005000F
SYSTYPE x00000000
----- UNIT INFORMATION -----
UNIT CLASS CPU
RECORD ENTRY DUMP:
RECORD HEADER
0000: 017E00A0 0005000F 00060101 335B0AB6 *..~...........[3*
0010: 4353574E 0032584D 00000000 00000000 *NWSCMX2.........*
0020: 00000001 00000000 15040064 00000000 *........d.......*
0030: 00000000 00000000 *........ *
RECORD BODY
0038: 00000060 80000000 00000018 00000038 *`...........8...*
0048: 00000086 00000000 020FF4DF FFFFFF00 *................*
^^
|_____ Is this OSF/1 PAL err code 86, D-CACHE PE?
0058: 0000005D 00000000 C4FFFFFF FFFFFFF0 *]...............*
0068: 00000000 00000001 00000000 00000000 *................*
0078: 00000000 00000000 00000000 00000000 *................*
0088: 00000000 00000000 00000000 00000000 *................*
0098: 00000000 5E3C7E25 *....%~<^ *
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 1934.1 | Some info | WRKSYS::DISCHLER | I don't wanna wait in vain | Wed Apr 23 1997 11:11 | 7 |
There was an ECO early on that placed two capacitors
on the write_enable lines going from the CIA chip to a buffer
between the DIMMs. You can experience corrected cache errors
without them.
Also, reseat your dimms.
RJD
| |||||
| 1934.2 | they all are MC630's. | KAOFS::M_NAKAGAWA | Wed Apr 23 1997 19:22 | 73 | |
RE: .-1
Thanks for your reply, do you happen to have ECO#?
Customer has successfully installed DECevent on his system.
We are getting "MC630 Bcache error", "EV5 Detected Corr ECC Error".
All errors have same FILL_SYNDROME 5D, suspect bad DIMM.
Now this system has total of 384MB memory.
I haven't confirmed exact DIMM configuration with my customer yet
but on the assumption that: BANK A = 256MB (64MB DIMM x4)
BANK B = 128MB (32MB DIMM x4)
which DIMM location does FILL_SYNDROME 5D point to?
Jerry's nice machine check program V3.5 doesn't have AS500, AS600
is very close but memory configuration is different from as500.
Thanks,
CRDC/Mitz
**************************** ENTRY 565 ********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 382.
Timestamp of occurrence 21-APR-1997 06:35:34
Host name NWSCMX2
System type register x0000000F Alcor
Number of CPUs (mpnum) x00000001
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 3. Bcache error (630 entry)
Flags: x80000000 Retryable Error
Mchk Error Code x0000000000000086
EV5 Detected Corr ECC Error
EI ADDR xFFFFFF00020FF4DF
FILL SYNDROME x000000000000005D
EI STATUS xFFFFFFF0C4FFFFFF
Error occurred during D-ref fill
ISR x0000000100000000
Correctable ECC errors (IPL31)
AST requests 3 - 0 x0000000000000000
CIA Syndrome x0000000000000000
ECC Syndrome x0000000000000000
MEM ERR0 x0000000000000000
Memory Port Address x0000000000000000
MEM ERR1 x0000000000000000
Bits <33:32> of Memory Po x0000000000000000
Bit <39> of Memory Port x0000000000000000
Memory Command x0000000000000000
Mask When Err Occurred x0000000000000000
Mem Seq State Idle
Encoded Set Sel: Set 0 Selected
CIA ERR STAT x0000000000000000
Memory Cycle Source is PCI
IO Cmnd/Addr Queue Vld Bi x0000000000000000
CPU Cmnd/Addr Queue Vld B x0000000000000000
DM State: Idle
EV5 Resp. for DMA: No Response
CIA ERR x0000000000000000
| |||||
| 1934.3 | AS500 dimm callout MC630's | CSC32::HUTMACHER | Thu Apr 24 1997 09:27 | 236 | |
Hi Mitz
i got this handy dandy decoder text file for memory errors from
WRKSYS::ALPHASTATION500 notes file note:106.* and here goes.
EI ADDR xFFFFFF00020FF4DF this is 32-33meg region
------- failing address
FILL SYNDROME x000000000000005D this is failing syndrome
-- in bits <7-0>
*** if memory setup is like this bad simm in BANK A set of dimms dimm J25
BANK A = 256MB (64MB DIMM x4)
BANK B = 128MB (32MB DIMM x4)
*** if memory setup is like this bad simm in BANK B set of dimms dimm J22
since it would still size larger dimms as lowest memory address
range 1st no matter which bank they were plugged into.
BANK A = 128MB (32MB DIMM x4)
BANK B = 256MB (64MB DIMM x4)
using the Fill Syndrome (2 bytes) and ei_address with the following
procedure ends up with
ei_address 20FF4DF has bit 4= 1 or set
|_ 1101 so this means use Odd QW columns on chart
Fill Syndrome 005D syn in bits <7-0> so Low column in Odd QW section
on chart
Odd Qw Section
syndr / \
0x5D, 39,2,4,1,1, /* DAT43,115,188,260 */
|
|Low synd column so failing dimm slot=1 from chart
so use final decoder to go from this dimm slot # to actual Jxx slot
on the as500's motherboard
* dimm slot # we found above
*
slot 1 2 3 4
-----------------
bank A 25 26 28 30
bank B 22 27 29 23
|
|so the failing dimm is either j25 or j22 depending on
what bank A or B the failing address of 20FF4DF fails into.
jim hutmacher mvhs colorado csc 800-354-9000 ext 25561
here's the decoder article for alphastation500
--------------------------------------------------------------------------------
Maverick AS500 DIMM lookup table
Use this table to determine the failing DIMM on an AS500 workstation.
Required inputs:
Fill Syndrome (2 bytes)
ei_address
Procedure:
Isolate the bank -
At the SRM console use the show memory command to determine the
system's memory configuration.
For example this system has only one bank. If 2 banks are present
use the base address to determine in which bank the error was
detected.
>>>show mem
Memory Size = 256Mb
Bank Size/Sets Base Addr Speed
------ ---------- --------- -----
00 256Mb/2 000000000 Fast
Issolate the DIMM -
1 - if bit<4> is 0 then the QW is even
if bit<4> is 1 then the QW is odd
exa. ei_address QW
567020 even
567030 odd
etc
2 - fill_syndrome is 2 byte
<15:8> high
<7:0> low
3 - Use the table below and scan down the left side for the matching
syndrome no match indicates a 2-bit error. Scan across the row and
stop at the column for the correct combination of even/odd QW and
hi/low syndrome byte.
The number indicates the failing DIMM. You have determined which
bank from the address now use the table to determine the slot
number.
The numbers are the actul designators on the mother board.
slot 1 2 3 4
-----------------
bank A 25 26 28 30
bank B 22 27 29 23
Charts
------
e
v o
/* e d
S D n d slot 1 2 3 4
y a ---------------------
d t Q Q bank A 25 26 28 30
r a w w bank B 22 27 29 23
o __ __
m B l h l h Use this table to find the DIMM at fault!
e i o i o i
t w w
*/
0x01,0xff00,4,4,2,1, /* DAT16,88,160,232 */
0x02,0xff01,4,4,2,1, /* DAT17,89,161,233 */
0x04,0xff02,4,4,1,2, /* DAT34,106,178,250 */
0x08,0xff03,4,4,1,2, /* DAT35,107,179,251 */
0x0b, 17,3,4,3,1, /* DAT19,91,163,235 */
0x0E, 16,3,4,3,1, /* DAT18,90,162,234 */
0x10,0xff04,4,4,1,1, /* DAT52,124,196,268 */
0x13, 18,3,4,3,1, /* DAT20,92,164,236 */
0x15, 19,3,4,3,1, /* DAT21,93,165,236 */
0x16, 20,3,4,3,1, /* DAT22,94,166,237 */
0x19, 21,3,4,3,1, /* DAT23,95,167,238 */
0x1A, 22,3,4,3,1, /* DAT24,96,168,239 */
0x1C, 23,3,4,3,1, /* DAT25,97,169,240 */
0x20,0xff05,4,4,1,1, /* DAT53,125,197,269 */
0x23, 8,3,4,3,1, /* DAT8,80,152,224 */
0x25, 9,3,4,3,1, /* DAT9,81,153,225 */
0x26, 10,3,4,3,1, /* DAT10,82,154,226 */
0x29, 11,3,4,3,1, /* DAT11,83,155,227 */
0x2A, 12,3,4,3,1, /* DAT12,84,156,228 */
0x2C, 13,3,4,3,1, /* DAT13,85,157,229 */
0x31, 14,3,4,3,1, /* DAT14,86,158,230 */
0x34, 15,3,4,3,1, /* DAT15,87,159,231 */
0x40,0xff06,4,4,1,1, /* DAT70,142,214,286 */
0x4A, 33,2,4,2,1, /* DAT37,109,181,253 */
0x4F, 32,2,4,2,1, /* DAT37,109,182,254 */
0x52, 34,2,4,2,1, /* DAT38,110,183,255 */
0x54, 35,2,4,2,1, /* DAT39,111,184,256 */
0x57, 36,2,4,2,1, /* DAT40,112,185,257 */
0x58, 37,2,4,2,1, /* DAT41,113,186,258 */
0x5B, 38,2,4,2,1, /* DAT42,114,187,259 */
0x5D, 39,2,4,1,1, /* DAT43,115,188,260 */
0x62, 56,2,4,2,1, /* DAT62,134,206,278 */
0x64, 57,2,4,2,1, /* DAT63,134,206,279 */
0x67, 58,2,4,2,1, /* DAT64,135,207,280 */
0x68, 59,2,4,2,1, /* DAT65,136,208,281 */
0x6B, 60,2,4,2,1, /* DAT66,137,209,282 */
0x6D, 61,2,4,2,1, /* DAT67,138,210,283 */
0x70, 62,2,4,2,1, /* DAT68,139,211,284 */
0x75, 63,2,4,2,1, /* DAT69,140,212,285 */
0x80,0xff07,4,4,1,1, /* DAT71,143,215,287 */
0x8A, 49,2,4,3,1, /* DAT55,126,198,270 */
0x8F, 48,2,4,3,1, /* DAT54,125,197,269 */
0x92, 50,2,4,3,1, /* DAT56,127,199,271 */
0x94, 51,2,4,3,1, /* DAT57,128,200,272 */
0x97, 52,1,4,3,1, /* DAT58,129,201,273 */
0x98, 53,1,4,3,1, /* DAT59,130,202,274 */
0x9B, 54,1,4,3,1, /* DAT60,131,203,275 */
0x9D, 55,1,4,3,1, /* DAT61,132,204,276 */
0xA2, 40,3,4,1,1, /* DAT44,116,188,260 */
0xA4, 41,3,4,1,1, /* DAT45,117,189,261 */
0xA7, 42,3,4,1,1, /* DAT46,118,190,262 */
0xA8, 43,3,4,1,1, /* DAT47,119,191,263 */
0xAB, 44,3,4,2,1, /* DAT48,120,192,264 */
0xAD, 45,3,4,2,1, /* DAT49,121,193,265 */
0xB0, 46,3,4,2,1, /* DAT50,122,194,266 */
0xB5, 47,3,4,2,1, /* DAT51,123,195,267 */
0xCB, 1,2,4,3,2, /* DAT1,73,145,217 */
0xCE, 0,2,4,3,2, /* DAT0,72,144,216 */
0xD3, 2,2,3,3,2, /* DAT2,74,146,218 */
0xD5, 3,2,3,3,2, /* DAT3,75,147,219 */
0xD6, 4,2,3,3,2, /* DAT4,76,148,220 */
0xD9, 5,2,3,3,2, /* DAT5,77,149,221 */
0xDA, 6,2,4,3,1, /* DAT6,78,150,222 */
0xDC, 7,2,4,3,1, /* DAT7,79,151,223 */
0xE3, 24,3,4,2,1, /* DAT26,98,170,242 */
0xE5, 25,3,4,2,1, /* DAT27,99,171,243 */
0xE6, 26,3,3,2,2, /* DAT28,100,172,244 */
0xE9, 27,3,3,2,2, /* DAT29,101,173,245 */
0xEA, 28,3,3,2,2, /* DAT30,102,174,246 */
0xEC, 29,3,3,2,2, /* DAT31,103,175,247 */
0xF1, 30,3,4,2,2, /* DAT32,104,176,248 */
0xF4, 31,3,4,2,2, /* DAT33,105,177,249 */
};
Top view of the AS500 module showing DIMM positions.
----------------------------------
| |
| A B |
---------------- |
| ================== J23 |
| ================== J30 |
| ================== J29 |
| ================== J28 |
| |
| ================== J27 | Front of system
| ================== J26 |
| ================== J22 |
| ================== J25 |
| |
| |
| |
| |
| |
| |
| |
| |
-------------------------------------------------
| |||||
| 1934.4 | Thanks JIM !!!!! | KAOFS::M_NAKAGAWA | Thu Apr 24 1997 13:08 | 12 | |
Jim,
Thanks a lot. This is second time you helped me(it was A4100 OPA2
problem befor I think).
I was not aware of WRKSYS::ALPHASTATION500 note, I will add it right
away into my list.
Thanks again.
CRDC/Mitz
| |||||
| 1934.5 | your welcome... | CSC32::HUTMACHER | Thu Apr 24 1997 13:26 | 2 | |
your welcome 1 1/2 feet of snow here in colorado and still coming
down. i may not be going home today.... take care jim
| |||||
| 1934.6 | ECO details on AS500 | WRKSYS::DISCHLER | I don't wanna wait in vain | Wed Apr 30 1997 12:47 | 1 |
Contact Gordon Frye or Dan LeBlanc for ECO details. | |||||