[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference wrksys::alphastation

Title:Alpha Workstation Conference
Notice:See note 1.* for conference notices
Moderator:WRKSYS::HOUSE
Created:Wed Sep 07 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1996
Total number of notes:9122

1934.0. "AS500/333 CPU EXCEPTION errors, MEM prob??" by KAOFS::M_NAKAGAWA () Wed Apr 23 1997 00:31

Hi,
Getting CPU EXCEPTION errors on an AlphaStation 500/333, 384MB memory,
running UNIX V3.2G.
System hasn't crashed yet but we are getting over 300 errors in a week.
DECevent is not yet installed on this system and uerf doesn't tell a lot.
(suggested customer to find DECevent in UNIX software CD V4.0 or higher)

Meanwhile could someone analyse following uerf report?
Or pointer for the documents?
    
Thanks for your help in advance.
    
    CRDC/Mitz
    
     

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT 
OS EVENT TYPE                  100.     CPU EXCEPTION 
SEQUENCE NUMBER                382.
OPERATING SYSTEM                        DEC OSF/1 
OCCURRED/LOGGED ON                      Mon Apr 21 06:35:34 1997
OCCURRED ON SYSTEM                      NWSCMX2 
SYSTEM ID                 x0005000F
SYSTYPE                   x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU 

RECORD ENTRY DUMP:

  RECORD HEADER
0000:   017E00A0  0005000F  00060101  335B0AB6        *..~...........[3*
0010:   4353574E  0032584D  00000000  00000000        *NWSCMX2.........*
0020:   00000001  00000000  15040064  00000000        *........d.......*
0030:   00000000  00000000                            *........        *

  RECORD BODY
0038:   00000060  80000000  00000018  00000038        *`...........8...*
0048:   00000086  00000000  020FF4DF  FFFFFF00        *................*
              ^^
               |_____ Is this OSF/1 PAL err code 86, D-CACHE PE?

0058:   0000005D  00000000  C4FFFFFF  FFFFFFF0        *]...............*
0068:   00000000  00000001  00000000  00000000        *................*
0078:   00000000  00000000  00000000  00000000        *................*
0088:   00000000  00000000  00000000  00000000        *................*
0098:   00000000  5E3C7E25                            *....%~<^        *



    
T.RTitleUserPersonal
Name
DateLines
1934.1Some infoWRKSYS::DISCHLERI don&#039;t wanna wait in vainWed Apr 23 1997 12:117
    	There was an ECO early on that placed two capacitors
    on the write_enable lines going from the CIA chip to a buffer
    between the DIMMs. You can experience corrected cache errors
    without them.
    
    Also, reseat your dimms.
    				RJD
1934.2they all are MC630's.KAOFS::M_NAKAGAWAWed Apr 23 1997 20:2273
    
RE: .-1
    Thanks for your reply, do you happen to have ECO#?
    
    
    Customer has successfully installed DECevent on his system.
    We are getting "MC630 Bcache error", "EV5 Detected Corr ECC Error".
    All errors have same FILL_SYNDROME 5D, suspect bad DIMM.
    Now this system has total of 384MB memory.
    I haven't confirmed exact DIMM configuration with my customer yet
    but on the assumption that:	 	BANK A = 256MB (64MB DIMM x4)
    			 		BANK B = 128MB (32MB DIMM x4)
    which DIMM location does FILL_SYNDROME 5D point to?
    Jerry's nice machine check program V3.5 doesn't have AS500, AS600
    is very close but memory configuration is different from as500.
    
    Thanks,
    CRDC/Mitz
    **************************** ENTRY  565 ******************************** 

Logging OS                        2. Digital UNIX 
System Architecture               2. Alpha 
Event sequence number           382. 
Timestamp of occurrence              21-APR-1997 06:35:34   
Host name                            NWSCMX2 

System type register      x0000000F  Alcor 
Number of CPUs (mpnum)    x00000001 
CPU logging event (mperr) x00000000 

Event validity                    1. O/S claims event is valid 
Event severity                    1. Severe Priority 
Entry type                      100. CPU Machine Check Errors 

CPU Minor class                   3. Bcache error (630 entry)  

Flags:                    x80000000  Retryable Error 
Mchk Error Code           x0000000000000086 
                                     EV5 Detected Corr ECC Error 
EI ADDR                   xFFFFFF00020FF4DF 
FILL SYNDROME             x000000000000005D 
EI STATUS                 xFFFFFFF0C4FFFFFF 
                                     Error occurred during D-ref fill 
ISR                       x0000000100000000 
                                     Correctable ECC errors (IPL31) 
                                     AST requests 3 - 0  x0000000000000000 
CIA Syndrome              x0000000000000000 
                                     ECC Syndrome   x0000000000000000 
MEM ERR0                  x0000000000000000 
                                     Memory Port Address   x0000000000000000 
MEM ERR1                  x0000000000000000 
                                     Bits <33:32> of Memory Po x0000000000000000 

                                     Bit <39> of Memory Port  x0000000000000000 

                                     Memory Command  x0000000000000000 

                                     Mask When Err Occurred  x0000000000000000 

                                     Mem Seq State  Idle 
                                     Encoded Set Sel:  Set 0 Selected 
CIA ERR STAT              x0000000000000000 
                                     Memory Cycle Source is PCI 
                                     IO Cmnd/Addr Queue Vld Bi x0000000000000000 

                                     CPU Cmnd/Addr Queue Vld B x0000000000000000 

                                     DM State:  Idle 
                                     EV5 Resp. for DMA:  No Response 
CIA ERR                   x0000000000000000 
    
    
    
1934.3AS500 dimm callout MC630'sCSC32::HUTMACHERThu Apr 24 1997 10:27236
    Hi Mitz
    
    i got this handy dandy decoder text file for memory errors from 
    WRKSYS::ALPHASTATION500 notes file note:106.* and  here goes.
    
    EI ADDR                   xFFFFFF00020FF4DF this is 32-33meg region
                                        ------- failing address
    
    FILL SYNDROME             x000000000000005D this is failing syndrome
                                             -- in bits <7-0>
    
*** if memory setup is like this bad simm in BANK A set of dimms dimm J25
        BANK A = 256MB (64MB DIMM x4)
        BANK B = 128MB (32MB DIMM x4)
    
*** if memory setup is like this bad simm in BANK B set of dimms dimm J22
    since it would still size larger dimms as lowest memory address 
    range 1st no matter which bank they were plugged into.
        BANK A = 128MB (32MB DIMM x4)                    
        BANK B = 256MB (64MB DIMM x4)
    
    using the Fill Syndrome (2 bytes) and ei_address with the following
    procedure ends up with
    
    ei_address 20FF4DF has bit 4= 1 or set
                    |_ 1101 so this means use Odd QW columns on chart
    
    Fill Syndrome 005D syn in bits <7-0> so Low column in Odd QW section
                                         on chart
    
    
               Odd Qw Section
    syndr           / \
    0x5D,    39,2,4,1,1,    /* DAT43,115,188,260 */
                    |
                    |Low synd column so failing dimm slot=1 from chart
    
    so use final decoder to go from this dimm slot # to actual Jxx slot
    on the as500's motherboard
    
             * dimm slot # we found above
             *
     slot    1   2   3   4
         -----------------
    bank A  25  26  28  30
    bank B  22  27  29  23
            |
            |so the failing dimm is either j25 or j22 depending on
             what bank A or B the failing address of 20FF4DF fails into.
    
    
    jim hutmacher mvhs colorado csc 800-354-9000 ext 25561
    
    
    here's the decoder article for alphastation500 
    
--------------------------------------------------------------------------------
    
    
Maverick AS500 DIMM lookup table
    
    
    Use this table to determine the failing DIMM on an AS500 workstation.
    
    Required inputs:
    
        Fill Syndrome (2 bytes)
        ei_address
    
    
    Procedure:
    
    Isolate the bank -
    
        At the SRM console use the show memory command to determine the
        system's memory configuration.
    
        For example this system has only one bank. If 2 banks are present
        use the base address to determine in which bank the error was 
        detected.
    
    >>>show mem
                         
    Memory Size = 256Mb
    
    Bank      Size/Sets   Base Addr     Speed
    ------    ----------  ---------     -----
    00        256Mb/2     000000000     Fast
    
    Issolate the DIMM -
    
     1 - if bit<4> is 0 then the QW is even
         if bit<4> is 1 then the QW is odd
    
        exa.   ei_address   QW
               567020       even
               567030       odd
               etc
                               
     2 - fill_syndrome is 2 byte
    
          <15:8>    high
          <7:0>     low
    
    
     3 - Use the table below and scan down the left side for the matching
         syndrome no match indicates a 2-bit error. Scan across the row and 
         stop at the column for the correct combination of even/odd QW and 
         hi/low syndrome byte.
         The number indicates the failing DIMM.  You have determined which
         bank from the address now use the table to determine the slot 
         number.
         The numbers are the actul designators on the mother board.
    
     slot    1   2   3   4
         -----------------
    bank A  25  26  28  30
    bank B  22  27  29  23
    
    
    
            Charts
            ------
    
                e
                v   o
    /*          e   d
    S         D n   d               slot    1   2   3   4
    y         a                     ---------------------
    d         t Q   Q               bank A  25  26  28  30
    r         a w   w               bank B  22  27  29  23
    o           __  __
    m         B l h l h             Use this table to find the DIMM at fault!
    e         i o i o i
              t w   w
    */
    0x01,0xff00,4,4,2,1,    /* DAT16,88,160,232 */
    0x02,0xff01,4,4,2,1,    /* DAT17,89,161,233 */
    0x04,0xff02,4,4,1,2,    /* DAT34,106,178,250 */
    0x08,0xff03,4,4,1,2,    /* DAT35,107,179,251 */
    0x0b,    17,3,4,3,1,    /* DAT19,91,163,235 */
    0x0E,    16,3,4,3,1,    /* DAT18,90,162,234 */
    0x10,0xff04,4,4,1,1,    /* DAT52,124,196,268 */
    0x13,    18,3,4,3,1,    /* DAT20,92,164,236 */
    0x15,    19,3,4,3,1,    /* DAT21,93,165,236 */
    0x16,    20,3,4,3,1,    /* DAT22,94,166,237 */
    0x19,    21,3,4,3,1,    /* DAT23,95,167,238 */
    0x1A,    22,3,4,3,1,    /* DAT24,96,168,239 */
    0x1C,    23,3,4,3,1,    /* DAT25,97,169,240 */
    0x20,0xff05,4,4,1,1,    /* DAT53,125,197,269 */
    0x23,     8,3,4,3,1,    /* DAT8,80,152,224 */
    0x25,     9,3,4,3,1,    /* DAT9,81,153,225 */
    0x26,    10,3,4,3,1,    /* DAT10,82,154,226 */
    0x29,    11,3,4,3,1,    /* DAT11,83,155,227 */
    0x2A,    12,3,4,3,1,    /* DAT12,84,156,228 */
    0x2C,    13,3,4,3,1,    /* DAT13,85,157,229 */
    0x31,    14,3,4,3,1,    /* DAT14,86,158,230 */
    0x34,    15,3,4,3,1,    /* DAT15,87,159,231 */
    0x40,0xff06,4,4,1,1,    /* DAT70,142,214,286 */
    0x4A,    33,2,4,2,1,    /* DAT37,109,181,253 */
    0x4F,    32,2,4,2,1,    /* DAT37,109,182,254 */
    0x52,    34,2,4,2,1,    /* DAT38,110,183,255 */
    0x54,    35,2,4,2,1,    /* DAT39,111,184,256 */
    0x57,    36,2,4,2,1,    /* DAT40,112,185,257 */
    0x58,    37,2,4,2,1,    /* DAT41,113,186,258 */
    0x5B,    38,2,4,2,1,    /* DAT42,114,187,259 */
    0x5D,    39,2,4,1,1,    /* DAT43,115,188,260 */
    0x62,    56,2,4,2,1,    /* DAT62,134,206,278 */
    0x64,    57,2,4,2,1,    /* DAT63,134,206,279 */
    0x67,    58,2,4,2,1,    /* DAT64,135,207,280 */
    0x68,    59,2,4,2,1,    /* DAT65,136,208,281 */
    0x6B,    60,2,4,2,1,    /* DAT66,137,209,282 */
    0x6D,    61,2,4,2,1,    /* DAT67,138,210,283 */
    0x70,    62,2,4,2,1,    /* DAT68,139,211,284 */
    0x75,    63,2,4,2,1,    /* DAT69,140,212,285 */
    0x80,0xff07,4,4,1,1,    /* DAT71,143,215,287 */
    0x8A,    49,2,4,3,1,    /* DAT55,126,198,270 */
    0x8F,    48,2,4,3,1,    /* DAT54,125,197,269 */
    0x92,    50,2,4,3,1,    /* DAT56,127,199,271 */
    0x94,    51,2,4,3,1,    /* DAT57,128,200,272 */
    0x97,    52,1,4,3,1,    /* DAT58,129,201,273 */
    0x98,    53,1,4,3,1,    /* DAT59,130,202,274 */
    0x9B,    54,1,4,3,1,    /* DAT60,131,203,275 */
    0x9D,    55,1,4,3,1,    /* DAT61,132,204,276 */
    0xA2,    40,3,4,1,1,    /* DAT44,116,188,260 */
    0xA4,    41,3,4,1,1,    /* DAT45,117,189,261 */
    0xA7,    42,3,4,1,1,    /* DAT46,118,190,262 */
    0xA8,    43,3,4,1,1,    /* DAT47,119,191,263 */
    0xAB,    44,3,4,2,1,    /* DAT48,120,192,264 */
    0xAD,    45,3,4,2,1,    /* DAT49,121,193,265 */
    0xB0,    46,3,4,2,1,    /* DAT50,122,194,266 */
    0xB5,    47,3,4,2,1,    /* DAT51,123,195,267 */
    0xCB,     1,2,4,3,2,    /* DAT1,73,145,217 */
    0xCE,     0,2,4,3,2,    /* DAT0,72,144,216 */
    0xD3,     2,2,3,3,2,    /* DAT2,74,146,218 */
    0xD5,     3,2,3,3,2,    /* DAT3,75,147,219 */
    0xD6,     4,2,3,3,2,    /* DAT4,76,148,220 */
    0xD9,     5,2,3,3,2,    /* DAT5,77,149,221 */
    0xDA,     6,2,4,3,1,    /* DAT6,78,150,222 */
    0xDC,     7,2,4,3,1,    /* DAT7,79,151,223 */
    0xE3,    24,3,4,2,1,    /* DAT26,98,170,242 */
    0xE5,    25,3,4,2,1,    /* DAT27,99,171,243 */
    0xE6,    26,3,3,2,2,    /* DAT28,100,172,244 */
    0xE9,    27,3,3,2,2,    /* DAT29,101,173,245 */
    0xEA,    28,3,3,2,2,    /* DAT30,102,174,246 */
    0xEC,    29,3,3,2,2,    /* DAT31,103,175,247 */
    0xF1,    30,3,4,2,2,    /* DAT32,104,176,248 */
    0xF4,    31,3,4,2,2,    /* DAT33,105,177,249 */
    };
    
        Top view of the AS500 module showing DIMM positions.
    
    
    
                       ----------------------------------
                       |                                |
                       |              A    B            |
        ----------------                                |
        |         ==================      J23           |
        |         ================== J30                |
        |         ==================      J29           |
        |         ================== J28                |
        |                                               |
        |         ==================      J27           |   Front of system
        |         ================== J26                |
        |         ==================      J22           |
        |         ================== J25                |
        |                                               |
        |                                               |
        |                                               |
        |                                               |
        |                                               |
        |                                               |
        |                                               |
        |                                               |
        -------------------------------------------------
1934.4Thanks JIM !!!!!KAOFS::M_NAKAGAWAThu Apr 24 1997 14:0812
    Jim,
    
    Thanks a lot.  This is second time you helped me(it was A4100 OPA2
    problem befor I think).
    
    I was not aware of WRKSYS::ALPHASTATION500 note, I will add it right
    away into my list.
    
    Thanks again.
    
    CRDC/Mitz
    
1934.5your welcome...CSC32::HUTMACHERThu Apr 24 1997 14:262
    your welcome 1 1/2 feet of snow here in colorado and still coming
    down. i may not be going home today.... take care    jim 
1934.6ECO details on AS500WRKSYS::DISCHLERI don&#039;t wanna wait in vainWed Apr 30 1997 13:471
    	Contact Gordon Frye or Dan LeBlanc for ECO details.