[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::alphaserver_4100

Title:AlphaServer 4100
Moderator:MOVMON::DAVISS
Created:Tue Apr 16 1996
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:648
Total number of notes:3158

484.0. "A4100 too many cpu error" by NETRIX::"[email protected]" (Romeo Cesarato) Tue Feb 11 1997 11:52

      Hi.
      I have a problem with a A4100: 

      SROM VERSION V1.1  -   VERSION 2.0-3, 21-AUG-1996 14:31:24
      OPENVMS PALCODE V1.18-8  DIGITAL UNIX PALCODE V1.21-12
       
      With "TEST CPU" command in console mode I have the following
      output:

      PROCESS TIME CPU0: SOFT ERROR DETECTED, VECTOR 00620
      MCHK_CODE: 00000000  02040000
   
      System goes in loop and is necessary make a hardware restart.
      I have try to make a downgrade firmware to version 1.2-4,
      and i have fixed the problem.
      But I needed version 2.0-3 because digital unix v3.2-g is 
      installed.
      
      Can anybody help me please?

          Thank in advanced.    Regard Romeo Cesaeato 
[Posted by WWW Notes gateway]
T.RTitleUserPersonal
Name
DateLines
484.1MAY30::CUMMINSTue Feb 11 1997 19:2424
    The system is experiencing single-bit (CRD) memory errors. V1.2-4
    console did not automatically report soft errors. This feature had
    to be enabled by doing SET D_LOGSOFT ON prior to running the V1.2-4
    TEST command. V2.0-3 and later consoles automatically enable soft
    error reporting. This is why you only see the errors with V2.0-3.
    
    The fact that it "goes into a loop" suggests that the system has 
    *lots* of these errors - this happens quite frequently. Note: older
    revision PCI bridge cards can cause IOD-detected CRD errors with a 
    particular footprint. I believe older revision motherboards can cause
    similar symptoms. You should have your system inventoried as to
    revision levels to see whether your hardware is fully up to rev.
    
    What does the SARM consoel SHOW FRU command return re: part numbers,
    serial numbers, and revision levels? Would be helpful if you could
    post SHOW FRU display output as a reply to this note.. It would also
    be helpful if you could post three successive 620/630 error reports.
    This will help us diagnose the problem.
    
    It's quite possible the machine in question has bad memory. But we need
    more data before we can know whether this is faulty memory or some
    other out of rev module.
    
    BC
484.2MEM0H -or- MEM0L, which one?KAOFS::M_NAKAGAWATue Feb 11 1997 21:04231
    
re .-1
    
I have some DECevent samples here.
I believe we have memory problem with this system.
It has 1024MB memory(a pair of B3030-FA) and I wanted to know which
half(MEM0H or MEM0L) is causing it. 
I tried the MACHINE CHECK program(4100.digital_unix) but it didn't help.
 
Some memory problems are discussed in following BLITZ:

[TD 2109] Alpha Server 4100 - Memory Errors 
[TD 2226] Alpha Server 4100 - SRM Console V4.8-3


    Thanks for your help,
    CRDC/Mitz
    
------------------- MC620/630 DECevent Sample-----------------------

Timestamp of occurrence              03-FEB-1997 15:52:37
CPU Minor class                   4. 620 System Correctable Error
Timestamp of occurrence              03-FEB-1997 15:52:37
CPU Minor class                   4. 620 System Correctable Error
Timestamp of occurrence              03-FEB-1997 15:52:37
CPU Minor class                   3. Bcache error (630 entry)


******************************** ENTRY    6 ******************************** 


Logging OS                        2. Digital UNIX 
System Architecture               2. Alpha 
Event sequence number            37. 
Timestamp of occurrence              03-FEB-1997 15:52:37   
Host name                            emcsats004 

System type register      x00000016  AlphaStation 4x00 
Number of CPUs (mpnum)    x00000002 
CPU logging event (mperr) x00000000 

Event validity                    1. O/S claims event is valid 
Event severity                    5. Low Priority 
Entry type                      100. CPU Machine Check Errors 

CPU Minor class                   4. 620 System Correctable Error 

Software Flags            x0000000000000000 
Active CPUs               x00000003 
Hardware Rev              x00000000 
System Serial Number                 C1563 
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 
Ext Interface Status Reg  x0000000000000000 
                                     Not Valid for 620 System 
                                     Correctable Errors 
Ext Interface Address Reg x0000000000000000 
                                     Not Valid for 620 System 
                                     Correctable Errors 
Fill Syndrome Reg         x0000000000000000 
                                     Not Valid for 620 System 
                                     Correctable Errors 
Interrupt Summary Reg     x0000000000000000 
                                     Not Valid for 620 System 
                                     Correctable Errors 

WHOAMI                    x00000000  Module Revision  0. 
                                     MID  0. 
                                     GID  0. 

Sys Environmental Regs    x00000000 
Base Addr of Bridge       x000000FBE0000000 
Dev Type & Rev Register   x06000221  CAP Chip Revision:        x00000001 
                                     HORSE  Module Revision:   x00000002 
                                     SADDLE Module Revision:   x00000002 
                                     SADDLE Module Type:        Left Hand 
                                     Internal CAP Chip Arbiter: Enabled 
                                     PCI Class Code            x00000600 
MC Error Info Register 0  x193E3C40 
                                     MC Bus Trans Addr<31:4>: 193E3C40 
MC Error Info Register 1  x800F4800  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read0-Mem 
                                     IOD1 Master at Time of Error 
                                     Device ID 2  x00000005 
                                     MC error info valid 
CAP Error Register        x89000000  Error Detected but Not Logged 
                                     Correctable ECC err det by MDPA 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
PALcode Revision                     Palcode Rev: 1.21-3 


******************************** ENTRY    7 ******************************** 


Logging OS                        2. Digital UNIX 
System Architecture               2. Alpha 
Event sequence number            36. 
Timestamp of occurrence              03-FEB-1997 15:52:37   
Host name                            emcsats004 

System type register      x00000016  AlphaStation 4x00 
Number of CPUs (mpnum)    x00000002 
CPU logging event (mperr) x00000000 

Event validity                    1. O/S claims event is valid 
Event severity                    5. Low Priority 
Entry type                      100. CPU Machine Check Errors 

CPU Minor class                   4. 620 System Correctable Error 

Software Flags            x0000000000000000 
Active CPUs               x00000003 
Hardware Rev              x00000000 
System Serial Number                 C1563 
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 
Ext Interface Status Reg  x0000000000000000 
                                     Not Valid for 620 System 
                                     Correctable Errors 
Ext Interface Address Reg x0000000000000000 
                                     Not Valid for 620 System 
                                     Correctable Errors 
Fill Syndrome Reg         x0000000000000000 
                                     Not Valid for 620 System 
                                     Correctable Errors 
Interrupt Summary Reg     x0000000000000000 
                                     Not Valid for 620 System 
                                     Correctable Errors 

WHOAMI                    x00000000  Module Revision  0. 
                                     MID  0. 
                                     GID  0. 

Sys Environmental Regs    x00000000 
Base Addr of Bridge       x000000F9E0000000 
Dev Type & Rev Register   x06008221  CAP Chip Revision:        x00000001 
                                     HORSE  Module Revision:   x00000002 
                                     SADDLE Module Revision:   x00000002 
                                     SADDLE Module Type:        Left Hand 
                                     PCI-EISA Bus Bridge Present on PCI Segment 
                                     PCI Class Code            x00000600 
MC Error Info Register 0  x193E3C40 
                                     MC Bus Trans Addr<31:4>: 193E3C40 
MC Error Info Register 1  x800F4800  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read0-Mem 
                                     IOD1 Master at Time of Error 
                                     Device ID 2  x00000005 
                                     MC error info valid 
CAP Error Register        x89000000  Error Detected but Not Logged 
                                     Correctable ECC err det by MDPA 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
PALcode Revision                     Palcode Rev: 1.21-3 


******************************** ENTRY    8 ******************************** 


Logging OS                        2. Digital UNIX 
System Architecture               2. Alpha 
Event sequence number            35. 
Timestamp of occurrence              03-FEB-1997 15:52:37   
Host name                            emcsats004 

System type register      x00000016  AlphaStation 4x00 
Number of CPUs (mpnum)    x00000002 
CPU logging event (mperr) x00000000 

Event validity                    1. O/S claims event is valid 
Event severity                    3. High Priority 
Entry type                      100. CPU Machine Check Errors 

CPU Minor class                   3. Bcache error (630 entry)  

Software Flags            x0000000000000000 
Active CPUs               x00000003 
Hardware Rev              x00000000 
System Serial Number                 C1563 
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 
Ext Interface Status Reg  x0000000000000000 
Ext Interface Address Reg x0000000000000000 
Fill Syndrome Reg         x0000000000000000 
Interrupt Summary Reg     x0000000000000000 

WHOAMI                    x00000000  Module Revision  0. 
                                     MID  0. 
                                     GID  0. 

Sys Environmental Regs    x00000000 
Base Addr of Bridge       x000000F9E0000000 
Dev Type & Rev Register   x06008221  CAP Chip Revision:        x00000001 
                                     HORSE  Module Revision:   x00000002 
                                     SADDLE Module Revision:   x00000002 
                                     SADDLE Module Type:        Left Hand 
                                     PCI-EISA Bus Bridge Present on PCI Segment 
                                     PCI Class Code            x00000600 
MC Error Info Register 0  x193E3C40 
                                     MC Bus Trans Addr<31:4>: 193E3C40 
MC Error Info Register 1  x800F4800  MC bus trans addr <39:32> x00000000 
                                     MC Command is Read0-Mem 
                                     IOD1 Master at Time of Error 
                                     Device ID 2  x00000005 
                                     MC error info valid 
CAP Error Register        x89000000  Error Detected but Not Logged 
                                     Correctable ECC err det by MDPA 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
PALcode Revision                     Palcode Rev: 1.21-3 

		===========================================
    
484.3MAY30::CUMMINSWed Feb 12 1997 10:3055
    This lengthy note is leading somewhere. Bear with me..
    
    The AlphaServer 4100/4000's PCI bridge's ASIC has a bug in it that can
    cause data corruption given a certain sequence of events. Basically,
    the sequence involves a CSR read from the IOD's B chip while a DMA is
    in progress. There are only a couple registers implemented in the B
    chip. These include the SYNDROME and STAT CSRs which are used, in part,
    to provide I/O-detected, single-bit error syndrome status. The only
    software that would normally access these registers is PALcode. The
    operating systems never touch them, as all error data collection is
    handled by PAL (and NT HAL) code on the 4100/4000 platform.
    
    We discovered the above data corruption problem prior to FRS. The
    program opted to not re-spin the ASIC. Instead, we modified PALcode to
    not collect B chip CSR error info on CRD or MCHK errors. Since VMS/UNIX
    PALcode attempts to scrub all single-bit memory errors, it was felt
    that more often than not, the EV5 would also detect a single-bit error
    during the course of scrubbing the location, assuming the error was not
    a transient.
    
    The impact of all of this is:
    
      1. The data corruption problem is worked around and made effectively
         made moot by changes to PALcode. Customers should not ever see
         this problem (though you wouldn't want to write an application
         that periodically polled these STAT and SYNDROME CSRs!)
      2. The side effect of (1) is that SYNDROME and STAT registers will
         always read as zero on I/O-detected CRD errors logged in the
         system error log. [See note below..] This will obviously hinder
         isolation to a memory pair member.
      3. More often than not, PAL scrubbing, which involves reading and
         then writing back the data, will generate an EV5-detected CRD
         (630 or 620) error. PAL will snapshot the EV5 FILL_SYNDROME IPR
         which will then enable isolation to a pair member.
    
    Note: the V3.0-10 SRM console and later versions added an automatic
    enable of PAL collection of I/O SYNDROME and STAT data on I/O-detected
    620 CRD errors. This is because TEST performs read-only operations to
    disk/tape/floppy. No writes.. Therefore, V3.0-10 and greater consoles
    can be used to diagnose to a memory pair member assuming the problem is
    repeatable under console TEST. And very often it is..
    
    In summary, if your error log does not have any EV5-detected 620 or 630
    CRD error entries, then you will not be able to diagnose to a memory
    pair member. Are there any CPU-detected CRD errors in the system error
    log? The ones you posted were all I/O-detected errors... If you update
    to V3.0-10, and re-run the TEST command, you may see EV5-detected CRD
    errors. In this case, the error frames displayed will include syndrome
    data for isolation to a pair member.
    
    The V3.0-10 console is available on the V3.8 Firmware Update CD (as
    well as via our firmware web site..)
    
    If questions, let me know.
    BC
484.4ThanksKAOFS::M_NAKAGAWAWed Feb 12 1997 12:2310
    Thanks for the info.
    
     >Are there any CPU-detected CRD errors in the system error log?
    
    No, they all are I/O-detected. 
    A few MC620 errors followed by a MC630 as described in the TD #2109.
    
    Thanks again,
    CRDC/Mitz
    
484.5MAY30::CUMMINSWed Feb 12 1997 12:335
    Not sure what you meant in your previous reply. 630 CRD errors are
    *always* EV5-detected; never I/O-detected. Are you saying you see 630
    errors in the error log?
    
    BC
484.6620,620,620 then 630KAOFS::M_NAKAGAWAWed Feb 12 1997 17:2397
re:last   >Are you saying you see 630 errors in the error log?

    
Please refer to 484.2 DECevent entry #8, two MC620's followed by a MC630.

Sometimes we just get only MC620's but sometimes MC630 follows
immediately after MC620's something like below.
When we get MC630, the "MC Error Info Register 0" always contains 
"x193E3C40" and I was wondering if someone could tell MEM0H or MEM0L if
we have memory problem here.
The "CAP Error Register"(err sum) says "Correctable ECC err det by MDPA".

		---------------------------------------

Timestamp of occurrence              04-FEB-1997 08:46:38   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              04-FEB-1997 08:46:38   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              04-FEB-1997 08:46:38   
CPU Minor class                   4. 620 System Correctable Error 

Timestamp of occurrence              03-FEB-1997 15:52:37   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              03-FEB-1997 15:52:37   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              03-FEB-1997 15:52:37   
CPU Minor class                   3. Bcache error (630 entry) <----- !!!  

Timestamp of occurrence              02-FEB-1997 12:00:02   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              02-FEB-1997 12:00:02   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              02-FEB-1997 12:00:02   
CPU Minor class                   3. Bcache error (630 entry) <----- !!!

Timestamp of occurrence              31-JAN-1997 16:45:03   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 16:45:03   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 16:45:03   
CPU Minor class                   3. Bcache error (630 entry) <----- !!!

Timestamp of occurrence              31-JAN-1997 14:05:09   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 14:05:09   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 14:05:08   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 14:05:08   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 14:05:08   
CPU Minor class                   3. Bcache error (630 entry) <------ !!! 

Timestamp of occurrence              31-JAN-1997 11:03:30   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 11:03:30   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 11:03:30   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 11:03:27   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 11:03:27   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 11:03:27   
CPU Minor class                   3. Bcache error (630 entry) <----- !!!
  
Timestamp of occurrence              31-JAN-1997 10:24:51   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 10:24:51   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 10:24:51   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 10:24:51   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              31-JAN-1997 10:24:51   
CPU Minor class                   3. Bcache error (630 entry) <----- !!!

Timestamp of occurrence              30-JAN-1997 17:31:38   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              30-JAN-1997 17:31:38   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              30-JAN-1997 17:31:38   
CPU Minor class                   4. 620 System Correctable Error 

Timestamp of occurrence              30-JAN-1997 15:52:50   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              30-JAN-1997 15:52:50   
CPU Minor class                   4. 620 System Correctable Error 
Timestamp of occurrence              30-JAN-1997 15:52:50   
CPU Minor class                   4. 620 System Correctable Error 

---------------------------------------------------------------------------
Thanks again for your help.
CRDC/Mitz

    
484.7MAY30::CUMMINSThu Feb 13 1997 10:5011
    I looked again at the error log snippets from reply .2 and there are
    indeed 630 entries. Unfortunately, no EV5 error info is presented. I
    am assuming this is because you are running older DECevent.
    
    I strongly recommend that you update all of your customers to DECevent
    V2.3 with the latest 4100/4000 KNL files. Several problems involving
    CRD error interpretation / reporting have been resolved in DECevent.
    Among other additions and fixes...
    
    Without the syndrome information, it is not possible to diagnose to a
    card pair member.