Title: | AlphaServer 4100 |
Moderator: | MOVMON::DAVIS S |
Created: | Tue Apr 16 1996 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 648 |
Total number of notes: | 3158 |
Need help-ideas on cause of these machine checks on AS4100 Digital UNIX V3.2G (Rev. 62); physical memory = 2048.00 megabytes. Three cpu's , two 1gb mem modules. Firmware revision: 3.0 PALcode: Digital-UNIX/OSF version 1.21, AlphaServer 4100 5/400 4MB System running since 10 feb no errors - but also no users. I suspect problem will start again when users get on system. Any ideas as to cause of machine checks - these are not getting logged to error log? What is mchk code 203 ? Thanks Carl Bullion Hardware Support-colorado ////////////////////////////////////////////////////////////////////////////////// 09-Feb-1997 20:19:24 Machine Check SYSTEM Fatal Abort Machine check code = 0x2030000 pal temp[0-1] = 0000000000000040 0000000000000000 pal temp[2-3] = fffffc0000470810 0000000000004400 pal temp[4-5] = 0000000000000002 ffffffffffffff40 pal temp[6-7] = 0000000000000000 fffffc0000470290 pal temp[8-9] = 1f1e171515020100 fffffc0000470580 pal temp[10-11] = 000003ff800dad70 fffffc00004703e0 pal temp[12-13] = fffffc0000470780 0000000000006e80 pal temp[14-15] = 0000000000000000 00000000000f0000 pal temp[16-17] = 0000020306600001 0000000000000000 pal temp[18-19] = 000000011fffe760 ffffffffb8273a58 pal temp[20-21] = 00000000194e6000 fffffc00004707b0 pal temp[22-23] = fffffc0000615790 00000000194cba58 shadow[0-1] = 0000000000000000 0000000000000000 shadow[2-3] = 0000000000000000 0000000000000000 shadow[4-5] = 0000000000000000 0000000000000000 shadow[6-7] = 0000000000000000 0000000000000000 Addr of excepting instruction = 000003ff800dad70 Summary of arithmetic traps = 0000000000000000 Exception mask = 0000000000000000 Base address for PALcode = 0000000000014000 Interrupt Status Reg = 0000000080000000 CURRENT SETUP OF EV5 IBOX = 000000c164020000 I-CACHE Reg Tag parity error = 0000000000000000 D-CACHE error Reg = 0000000000000000 Effective VA = 0000000000146008 Reason for D-stream = 00000000000058d0 EV5 SCache address = ffffff000001900f EV5 SCache TAG/Data parity = 0000000000000000 EV5 BC_TAG_ADDR = ffffff8010cdafff EV5 EI_ADDR: Phys addr of Xfer = ffffff0075f0617f Fill Syndrome = 000000000000002a EI_STAT reg = fffffff001ffffff LD_LOCK = ffffff00002007ff IOD 0 register dump: Base Addr of PCI bridge = 000000f9e0000000 Whami reg. = 0000103a ! ???? dtag par err?all other whami=04fa Sys. Env. reg. = 00000000 PCI Rev. reg. = 06008232 CAP_CTL reg. = 46470ff1 HAE_MEM reg. = 00000000 HAE_IO reg. = 00000000 INT_CTL reg. = 00000003 INT_REG reg. = 00000000 INT_MASK0 reg. = 00c51110 INT_MASK1 reg. = 00000000 MC_ERR0 reg. = e0000000 MC_ERR1 reg. = 000e88fd CAP_ERR reg. = 00000000 PCI_ERR1 reg. = 00000000 MDPA_STAT reg. = 00000000 MDPA_SYN reg. = 00000000 MDPB_STAT reg. = 00000000 MDPB_SYN reg. = 00000000 IOD 1 register dump: Base Addr of PCI bridge = 000000fbe0000000 Whami reg. = 000004fa Sys. Env. reg. = 00000000 PCI Rev. reg. = 06000232 CAP_CTL reg. = 46470ff1 HAE_MEM reg. = 00000000 HAE_IO reg. = 00000000 INT_CTL reg. = 00000003 INT_REG reg. = 00000000 INT_MASK0 reg. = 00c51111 INT_MASK1 reg. = 00000000 MC_ERR0 reg. = e0000000 MC_ERR1 reg. = 000e88fd CAP_ERR reg. = 00000000 PCI_ERR1 reg. = 00000000 MDPA_STAT reg. = 00000000 MDPA_SYN reg. = 00000000 MDPB_STAT reg. = 00000000 MDPB_SYN reg. = 00000000 Machine Check SYSTEM Fatal Abort . . several entrys all same . . 10-Feb-1997 12:22:47 Machine check code = 0x2030000 pal temp[0-1] = 0000000000000007 0000000000000001 pal temp[2-3] = fffffc0000470810 0000000000004400 pal temp[4-5] = 0000000000000004 0000000000000000 pal temp[6-7] = fffffc0000005ce0 fffffc0000470290 pal temp[8-9] = 1f1e171515020100 fffffc0000470580 pal temp[10-11] = fffffc0000480ae4 fffffc00004703e0 pal temp[12-13] = fffffc0000470780 0000000000006e80 pal temp[14-15] = 0000000000000000 00000000000f0000 pal temp[16-17] = 0000020306600001 0000000000000000 pal temp[18-19] = 0000000000000000 ffffffffb691b978 pal temp[20-21] = 00000000009ae000 fffffc00004707b0 pal temp[22-23] = fffffc0000615790 000000007fc67a58 shadow[0-1] = 0000000000000000 0000000000000000 shadow[2-3] = 0000000000000000 0000000000000000 shadow[4-5] = 00000b2600000000 0000000000000000 shadow[6-7] = 0000000000000000 0000000000000000 Addr of excepting instruction = fffffc0000480ae4 Summary of arithmetic traps = 0000000000000000 Exception mask = 0000000000000000 Base address for PALcode = 0000000000014000 Interrupt Status Reg = 0000000080e00000 CURRENT SETUP OF EV5 IBOX = 000000c160020000 I-CACHE Reg Tag parity error = 0000000000000000 D-CACHE error Reg = 0000000000000000 Effective VA = ffffffffb6919f50 Reason for D-stream = 0000000000016e91 EV5 SCache address = ffffff000001904f EV5 SCache TAG/Data parity = 0000000000000000 EV5 BC_TAG_ADDR = ffffff80004d1fff EV5 EI_ADDR: Phys addr of Xfer = ffffff007e00000f Fill Syndrome = 0000000000000c00 EI_STAT reg = fffffff001ffffff LD_LOCK = ffffff0000005b6f IOD 0 register dump: Base Addr of PCI bridge = 000000f9e0000000 Whami reg. = 000004fa Sys. Env. reg. = 00000000 PCI Rev. reg. = 06008232 CAP_CTL reg. = 46470ff1 HAE_MEM reg. = 00000000 HAE_IO reg. = 00000000 INT_CTL reg. = 00000003 INT_REG reg. = 00011000 INT_MASK0 reg. = 00c51110 INT_MASK1 reg. = 00000000 MC_ERR0 reg. = e0000000 MC_ERR1 reg. = 000e88fd CAP_ERR reg. = 00000000 PCI_ERR1 reg. = 00000000 MDPA_STAT reg. = 00000000 MDPA_SYN reg. = 00000000 MDPB_STAT reg. = 00000000 MDPB_SYN reg. = 00000000 IOD 1 register dump: Base Addr of PCI bridge = 000000fbe0000000 Whami reg. = 000004fa Sys. Env. reg. = 00000000 PCI Rev. reg. = 06000232 CAP_CTL reg. = 46470ff1 HAE_MEM reg. = 00000000 HAE_IO reg. = 00000000 INT_CTL reg. = 00000003 INT_REG reg. = 00001100 INT_MASK0 reg. = 00c51111 INT_MASK1 reg. = 00000000 MC_ERR0 reg. = e0000000 MC_ERR1 reg. = 000e88fd CAP_ERR reg. = 00000000 PCI_ERR1 reg. = 00000000 MDPA_STAT reg. = 00000000 MDPA_SYN reg. = 00000000 MDPB_STAT reg. = 00000000 MDPB_SYN reg. = 00000000 10-Feb-1997 12:22:50 Machine Check SYSTEM Fatal Abort 10-Feb-1997 12:22:50 Machine check code = 0x2030000 pal temp[0-1] = 0000000000000007 0000000000000001 pal temp[2-3] = fffffc0000470810 0000000000004400 pal temp[4-5] = 0000000000000004 0000000000000000 pal temp[6-7] = fffffc0000005ce0 fffffc0000470290 pal temp[8-9] = 1f1e171515020100 fffffc0000470580 pal temp[10-11] = fffffc0000480ae4 fffffc00004703e0 pal temp[12-13] = fffffc0000470780 0000000000006e80 pal temp[14-15] = 0000000000000000 00000000000f0000 pal temp[16-17] = 0000020306600001 0000000000000000 pal temp[18-19] = 0000000000000000 ffffffffb691b978 pal temp[20-21] = 00000000009ae000 fffffc00004707b0 pal temp[22-23] = fffffc0000615790 000000007fc67a58 shadow[0-1] = 0000000000000000 0000000000000000 shadow[2-3] = 0000000000000000 0000000000000000 shadow[4-5] = 00000b2600000000 0000000000000000 shadow[6-7] = 0000000000000000 0000000000000000 Addr of excepting instruction = fffffc0000480ae4 Summary of arithmetic traps = 0000000000000000 Exception mask = 0000000000000000 Base address for PALcode = 0000000000014000 Interrupt Status Reg = 0000000080e00000 CURRENT SETUP OF EV5 IBOX = 000000c160020000 I-CACHE Reg Tag parity error = 0000000000000000 D-CACHE error Reg = 0000000000000000 Effective VA = ffffffffb6919f50 Reason for D-stream = 0000000000016e91 EV5 SCache address = ffffff000001904f EV5 SCache TAG/Data parity = 0000000000000000 EV5 BC_TAG_ADDR = ffffff80004dafff EV5 EI_ADDR: Phys addr of Xfer = ffffff007e00000f Fill Syndrome = 0000000000000c00 EI_STAT reg = fffffff001ffffff LD_LOCK = ffffff0000005b6f IOD 0 register dump: Base Addr of PCI bridge = 000000f9e0000000 Whami reg. = 000004fa Sys. Env. reg. = 00000000 PCI Rev. reg. = 06008232 CAP_CTL reg. = 46470ff1 HAE_MEM reg. = 00000000 HAE_IO reg. = 00000000 INT_CTL reg. = 00000003 INT_REG reg. = 00011000 INT_MASK0 reg. = 00c51110 INT_MASK1 reg. = 00000000 MC_ERR0 reg. = e0000000 MC_ERR1 reg. = 000e88fd CAP_ERR reg. = 00000000 PCI_ERR1 reg. = 00000000 MDPA_STAT reg. = 00000000 MDPA_SYN reg. = 00000000 MDPB_STAT reg. = 00000000 MDPB_SYN reg. = 00000000 IOD 1 register dump: Base Addr of PCI bridge = 000000fbe0000000 Whami reg. = 000004fa Sys. Env. reg. = 00000000 PCI Rev. reg. = 06000232 CAP_CTL reg. = 46470ff1 HAE_MEM reg. = 00000000 HAE_IO reg. = 00000000 INT_CTL reg. = 00000003 INT_REG reg. = 00001100 INT_MASK0 reg. = 00c51111 INT_MASK1 reg. = 00000000 MC_ERR0 reg. = e0000000 MC_ERR1 reg. = 000e88fd CAP_ERR reg. = 00000000PCI_ERR1 reg. = 00000000 MDPA_STAT reg. = 00000000 MDPA_SYN reg. = 00000000 MDPB_STAT reg. = 00000000 MDPB_SYN reg. = 00000000 10-Feb-1997 12:22:59 halted CPU 0 10-Feb-1997 12:22:59 10-Feb-1997 12:22:59 halt code = 2 10-Feb-1997 12:22:59 kernel stack not valid halt 10-Feb-1997 12:22:59 PC = fffffc0000432624
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
488.1 | DTAG parity error on CPU0 | MAY30::CUMMINS | Thu Feb 13 1997 12:00 | 13 | |
Yes, this log indicates that CPU 0 (MID=2 in WHOAMI) took a DTAG parity error. Bit 12 of WHOAMI is set. Info: when SW reads any one of the IOD's WHOAMI CSRs, information about the particular CPU is included in the bus transaction. This data includes CPU node ID (MID), CPU revision info, and DTAG PARITY and FILL ERROR bits. The two error bits are implemented in HW as flops on the CPU. The act of reading WHOAMI clears the error flop. Consequently, when PALcode collects system error state during error handling, it reads WHOAMI off IOD0 and saves this error state for the MCHK frame it eventually passes to higher level software. Subsequent reads of the same WHOAMI or other IOD's WHOAMI registers will show DTAG PE and FILL ERROR clear (assuming there are no new errors). | |||||
488.2 | thanks for info:whami = Dtag P.E. | CSC32::BULLION | Thu Feb 13 1997 12:51 | 16 | |
Thanks for the info - I'll suggest f-s replace cpu 0 if errors come back - However - There were many of these errors over several hours and only the first entry had err bit set in whami register ? If i had missed the first entry - - - - ?. So when are we able to latch another err in the whami register? Also none of these were in error log only in console log! DEcevent logged memory errors (bad mem module) before and after these machine checks. Would be nice to have a fault management spec manual on these systems. Thanks Carl B. | |||||
488.3 | HARMNY::CUMMINS | Tue May 06 1997 12:40 | 1 | ||
The 4100/4000 Service Manual provides fault management / errors info. |