[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::alphaserver_4100

Title:AlphaServer 4100
Moderator:MOVMON::DAVISS
Created:Tue Apr 16 1996
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:648
Total number of notes:3158

488.0. "AS4100 mach check code 203 - no err bit set" by CSC32::BULLION () Wed Feb 12 1997 18:38

    
     Need help-ideas on cause of these machine checks on AS4100
     Digital UNIX V3.2G (Rev. 62); physical memory = 2048.00 megabytes.
     Three cpu's , two 1gb mem modules. Firmware revision: 3.0
     PALcode: Digital-UNIX/OSF version 1.21, AlphaServer 4100 5/400 4MB
    
     System running since 10 feb no errors - but also no users.
     I suspect problem will start again when users get on system.
     
       Any ideas as to cause of machine checks - these are
       not getting logged to error log? What is mchk code 203 ?
    
      Thanks
      Carl Bullion
      Hardware Support-colorado 
    
    //////////////////////////////////////////////////////////////////////////////////
    09-Feb-1997 20:19:24 Machine Check SYSTEM Fatal Abort
    Machine check code = 0x2030000
    pal temp[0-1]           = 0000000000000040 0000000000000000
    pal temp[2-3]           = fffffc0000470810 0000000000004400
    pal temp[4-5]           = 0000000000000002 ffffffffffffff40
    pal temp[6-7]           = 0000000000000000 fffffc0000470290
    pal temp[8-9]           = 1f1e171515020100 fffffc0000470580
    pal temp[10-11]         = 000003ff800dad70 fffffc00004703e0
    pal temp[12-13]         = fffffc0000470780 0000000000006e80
    pal temp[14-15]         = 0000000000000000 00000000000f0000
    pal temp[16-17]         = 0000020306600001 0000000000000000
    pal temp[18-19]         = 000000011fffe760 ffffffffb8273a58
    pal temp[20-21]         = 00000000194e6000 fffffc00004707b0
    pal temp[22-23]         = fffffc0000615790 00000000194cba58
    shadow[0-1]             = 0000000000000000 0000000000000000
    shadow[2-3]             = 0000000000000000 0000000000000000
    shadow[4-5]             = 0000000000000000 0000000000000000
    shadow[6-7]             = 0000000000000000 0000000000000000
    Addr of excepting instruction   = 000003ff800dad70
    Summary of arithmetic traps     = 0000000000000000
    Exception mask                  = 0000000000000000
    Base address for PALcode        = 0000000000014000
    Interrupt Status Reg            = 0000000080000000
    CURRENT SETUP OF EV5 IBOX       = 000000c164020000
    I-CACHE Reg Tag parity error    = 0000000000000000
    D-CACHE error Reg               = 0000000000000000
    Effective VA                    = 0000000000146008
    Reason for D-stream             = 00000000000058d0
    EV5 SCache address              = ffffff000001900f
    EV5 SCache TAG/Data parity      = 0000000000000000
    EV5 BC_TAG_ADDR                 = ffffff8010cdafff
    EV5 EI_ADDR: Phys addr of Xfer  = ffffff0075f0617f
    Fill Syndrome                   = 000000000000002a
    EI_STAT reg                     = fffffff001ffffff
    LD_LOCK                         = ffffff00002007ff
    IOD 0 register dump:
    Base Addr of PCI bridge = 000000f9e0000000
    Whami reg.      = 0000103a    ! ???? dtag par err?all other whami=04fa
    Sys. Env. reg.  = 00000000
    PCI Rev. reg.   = 06008232
    CAP_CTL reg.    = 46470ff1
    HAE_MEM reg.    = 00000000
    HAE_IO reg.     = 00000000
    INT_CTL reg.    = 00000003
    INT_REG reg.    = 00000000
    INT_MASK0 reg.  = 00c51110
    INT_MASK1 reg.  = 00000000
    MC_ERR0 reg.    = e0000000
    MC_ERR1 reg.    = 000e88fd
    CAP_ERR reg.    = 00000000
    PCI_ERR1 reg.   = 00000000
    MDPA_STAT reg.  = 00000000
    MDPA_SYN reg.   = 00000000
    MDPB_STAT reg.  = 00000000
    MDPB_SYN reg.   = 00000000
    IOD 1 register dump:
    Base Addr of PCI bridge = 000000fbe0000000
    Whami reg.      = 000004fa
    Sys. Env. reg.  = 00000000
    PCI Rev. reg.   = 06000232
    CAP_CTL reg.    = 46470ff1
    HAE_MEM reg.    = 00000000
    HAE_IO reg.     = 00000000
    INT_CTL reg.    = 00000003
    INT_REG reg.    = 00000000
    INT_MASK0 reg.  = 00c51111
    INT_MASK1 reg.  = 00000000
    MC_ERR0 reg.    = e0000000
    MC_ERR1 reg.    = 000e88fd
    CAP_ERR reg.    = 00000000
    PCI_ERR1 reg.   = 00000000
    MDPA_STAT reg.  = 00000000
    MDPA_SYN reg.   = 00000000
    MDPB_STAT reg.  = 00000000
    MDPB_SYN reg.   = 00000000
    Machine Check SYSTEM Fatal Abort
    .
       .
           several entrys all same
           .
    
              .
    10-Feb-1997 12:22:47  Machine check code = 0x2030000
    pal temp[0-1]           = 0000000000000007 0000000000000001
    pal temp[2-3]           = fffffc0000470810 0000000000004400
    pal temp[4-5]         = 0000000000000004 0000000000000000
    pal temp[6-7]           = fffffc0000005ce0 fffffc0000470290
    pal temp[8-9]           = 1f1e171515020100 fffffc0000470580
    pal temp[10-11]         = fffffc0000480ae4 fffffc00004703e0
    pal temp[12-13]         = fffffc0000470780 0000000000006e80
    pal temp[14-15]         = 0000000000000000 00000000000f0000
    pal temp[16-17]         = 0000020306600001 0000000000000000
    pal temp[18-19]         = 0000000000000000 ffffffffb691b978
    pal temp[20-21]         = 00000000009ae000 fffffc00004707b0
    pal temp[22-23]         = fffffc0000615790 000000007fc67a58
    shadow[0-1]             = 0000000000000000 0000000000000000
    shadow[2-3]             = 0000000000000000 0000000000000000
    shadow[4-5]             = 00000b2600000000 0000000000000000
    shadow[6-7]             = 0000000000000000 0000000000000000
    Addr of excepting instruction   = fffffc0000480ae4
    Summary of arithmetic traps     = 0000000000000000
    Exception mask                  = 0000000000000000
    Base address for PALcode        = 0000000000014000
    Interrupt Status Reg            = 0000000080e00000
    CURRENT SETUP OF EV5 IBOX       = 000000c160020000
    I-CACHE Reg Tag parity error    = 0000000000000000
    D-CACHE error Reg               = 0000000000000000
    Effective VA                    = ffffffffb6919f50
    Reason for D-stream           = 0000000000016e91
    EV5 SCache address              = ffffff000001904f
    EV5 SCache TAG/Data parity      = 0000000000000000
    EV5 BC_TAG_ADDR                 = ffffff80004d1fff
    EV5 EI_ADDR: Phys addr of Xfer  = ffffff007e00000f
    Fill Syndrome                   = 0000000000000c00
    EI_STAT reg                     = fffffff001ffffff
    LD_LOCK                         = ffffff0000005b6f
    IOD 0 register dump:
    Base Addr of PCI bridge = 000000f9e0000000
    Whami reg.      = 000004fa
    Sys. Env. reg.  = 00000000
    PCI Rev. reg.   = 06008232
    CAP_CTL reg.    = 46470ff1
    HAE_MEM reg.    = 00000000
    HAE_IO reg.     = 00000000
    INT_CTL reg.    = 00000003
    INT_REG reg.    = 00011000
    INT_MASK0 reg.  = 00c51110
    INT_MASK1 reg.  = 00000000
    MC_ERR0 reg.    = e0000000
    MC_ERR1 reg.    = 000e88fd
    CAP_ERR reg.    = 00000000
    PCI_ERR1 reg. = 00000000
    MDPA_STAT reg.  = 00000000
    MDPA_SYN reg.   = 00000000
    MDPB_STAT reg.  = 00000000
    MDPB_SYN reg.   = 00000000
    IOD 1 register dump:
    Base Addr of PCI bridge = 000000fbe0000000
    Whami reg.      = 000004fa
    Sys. Env. reg.  = 00000000
    PCI Rev. reg.   = 06000232
    CAP_CTL reg.    = 46470ff1
    HAE_MEM reg.    = 00000000
    HAE_IO reg.     = 00000000
    INT_CTL reg.    = 00000003
    INT_REG reg.    = 00001100
    INT_MASK0 reg.  = 00c51111
    INT_MASK1 reg.  = 00000000
    MC_ERR0 reg.    = e0000000
    MC_ERR1 reg.    = 000e88fd
    CAP_ERR reg.    = 00000000
    PCI_ERR1 reg.   = 00000000
    MDPA_STAT reg.  = 00000000
    MDPA_SYN reg.   = 00000000
    MDPB_STAT reg.  = 00000000
    MDPB_SYN reg.   = 00000000
    
    10-Feb-1997 12:22:50  Machine Check SYSTEM Fatal Abort
    10-Feb-1997 12:22:50  Machine check code = 0x2030000
    pal temp[0-1]           = 0000000000000007 0000000000000001
    pal temp[2-3]           = fffffc0000470810 0000000000004400
    pal temp[4-5]           = 0000000000000004 0000000000000000
    pal temp[6-7]           = fffffc0000005ce0 fffffc0000470290
    pal temp[8-9]           = 1f1e171515020100 fffffc0000470580
    pal temp[10-11]         = fffffc0000480ae4 fffffc00004703e0
    pal temp[12-13]         = fffffc0000470780 0000000000006e80
    pal temp[14-15]         = 0000000000000000 00000000000f0000
    pal temp[16-17]         = 0000020306600001 0000000000000000
    pal temp[18-19]         = 0000000000000000 ffffffffb691b978
    pal temp[20-21]         = 00000000009ae000 fffffc00004707b0
    pal temp[22-23]         = fffffc0000615790 000000007fc67a58
    shadow[0-1]             = 0000000000000000 0000000000000000
    shadow[2-3]             = 0000000000000000 0000000000000000
    shadow[4-5]             = 00000b2600000000 0000000000000000
    shadow[6-7]             = 0000000000000000 0000000000000000
    Addr of excepting instruction   = fffffc0000480ae4
    Summary of arithmetic traps     = 0000000000000000
    Exception mask                  = 0000000000000000
    Base address for PALcode        = 0000000000014000
    Interrupt Status Reg            = 0000000080e00000
    CURRENT SETUP OF EV5 IBOX       = 000000c160020000
    I-CACHE Reg Tag parity error    = 0000000000000000
    D-CACHE error Reg               = 0000000000000000
    Effective VA                    = ffffffffb6919f50
    Reason for D-stream             = 0000000000016e91
    EV5 SCache address              = ffffff000001904f
    EV5 SCache TAG/Data parity      = 0000000000000000
    EV5 BC_TAG_ADDR                 = ffffff80004dafff
    EV5 EI_ADDR: Phys addr of Xfer  = ffffff007e00000f
    Fill Syndrome                 = 0000000000000c00
    EI_STAT reg                     = fffffff001ffffff
    LD_LOCK                         = ffffff0000005b6f
    IOD 0 register dump:
    Base Addr of PCI bridge = 000000f9e0000000
    Whami reg.      = 000004fa
    Sys. Env. reg.  = 00000000
    PCI Rev. reg.   = 06008232
    CAP_CTL reg.    = 46470ff1
    HAE_MEM reg.    = 00000000
    HAE_IO reg.     = 00000000
    INT_CTL reg.    = 00000003
    INT_REG reg.    = 00011000
    INT_MASK0 reg.  = 00c51110
    INT_MASK1 reg.  = 00000000
    MC_ERR0 reg.    = e0000000
    MC_ERR1 reg.    = 000e88fd
    CAP_ERR reg.    = 00000000
    PCI_ERR1 reg.   = 00000000
    MDPA_STAT reg.  = 00000000
    MDPA_SYN reg.   = 00000000
    MDPB_STAT reg.  = 00000000
    MDPB_SYN reg.   = 00000000
    IOD 1 register dump:
    
    
    Base Addr of PCI bridge = 000000fbe0000000
    Whami reg.      = 000004fa
    Sys. Env. reg.  = 00000000
    PCI Rev. reg.   = 06000232
    CAP_CTL reg.    = 46470ff1
    HAE_MEM reg.    = 00000000
    HAE_IO reg.     = 00000000
    INT_CTL reg.    = 00000003
    INT_REG reg.    = 00001100
    INT_MASK0 reg.  = 00c51111
    INT_MASK1 reg.  = 00000000
    MC_ERR0 reg.    = e0000000
    MC_ERR1 reg.    = 000e88fd
    CAP_ERR reg.    = 00000000PCI_ERR1 reg.   = 00000000
    MDPA_STAT reg.  = 00000000
    MDPA_SYN reg.   = 00000000
    MDPB_STAT reg.  = 00000000
    MDPB_SYN reg.   = 00000000
    
    10-Feb-1997 12:22:59  halted CPU 0
    10-Feb-1997 12:22:59
    10-Feb-1997 12:22:59  halt code = 2
    10-Feb-1997 12:22:59  kernel stack not valid halt
    10-Feb-1997 12:22:59  PC = fffffc0000432624
    
    
    
    
T.RTitleUserPersonal
Name
DateLines
488.1DTAG parity error on CPU0MAY30::CUMMINSThu Feb 13 1997 12:0013
    Yes, this log indicates that CPU 0 (MID=2 in WHOAMI) took a DTAG parity
    error. Bit 12 of WHOAMI is set.
    
    Info: when SW reads any one of the IOD's WHOAMI CSRs, information about
    the particular CPU is included in the bus transaction. This data
    includes CPU node ID (MID), CPU revision info, and DTAG PARITY and FILL
    ERROR bits. The two error bits are implemented in HW as flops on the
    CPU. The act of reading WHOAMI clears the error flop. Consequently,
    when PALcode collects system error state during error handling, it
    reads WHOAMI off IOD0 and saves this error state for the MCHK frame
    it eventually passes to higher level software. Subsequent reads of the
    same WHOAMI or other IOD's WHOAMI registers will show DTAG PE and FILL
    ERROR clear (assuming there are no new errors).
488.2thanks for info:whami = Dtag P.E.CSC32::BULLIONThu Feb 13 1997 12:5116
    
     Thanks for the info - I'll suggest f-s replace cpu 0 if errors
     come back - However - There were many of these errors over several
     hours and only the first entry had err bit set in whami register ?
     If i had missed the first entry - - - - ?. So when are we able to
     latch another err in the whami register? Also none of these were
     in error log only in console log! DEcevent logged  memory errors
     (bad mem module) before and after these machine checks.
    
     Would be nice to have a fault management spec manual on these systems.
    
    Thanks
    Carl B.
    
    
     
488.3HARMNY::CUMMINSTue May 06 1997 12:401
    The 4100/4000 Service Manual provides fault management / errors info.