Title: | AlphaServer 4100 |
Moderator: | MOVMON::DAVIS S |
Created: | Tue Apr 16 1996 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 648 |
Total number of notes: | 3158 |
Hi. I have a problem with a A4100: SROM VERSION V1.1 - VERSION 2.0-3, 21-AUG-1996 14:31:24 OPENVMS PALCODE V1.18-8 DIGITAL UNIX PALCODE V1.21-12 With "TEST CPU" command in console mode I have the following output: PROCESS TIME CPU0: SOFT ERROR DETECTED, VECTOR 00620 MCHK_CODE: 00000000 02040000 System goes in loop and is necessary make a hardware restart. I have try to make a downgrade firmware to version 1.2-4, and i have fixed the problem. But I needed version 2.0-3 because digital unix v3.2-g is installed. Can anybody help me please? Thank in advanced. Regard Romeo Cesaeato [Posted by WWW Notes gateway]
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
484.1 | MAY30::CUMMINS | Tue Feb 11 1997 19:24 | 24 | ||
The system is experiencing single-bit (CRD) memory errors. V1.2-4 console did not automatically report soft errors. This feature had to be enabled by doing SET D_LOGSOFT ON prior to running the V1.2-4 TEST command. V2.0-3 and later consoles automatically enable soft error reporting. This is why you only see the errors with V2.0-3. The fact that it "goes into a loop" suggests that the system has *lots* of these errors - this happens quite frequently. Note: older revision PCI bridge cards can cause IOD-detected CRD errors with a particular footprint. I believe older revision motherboards can cause similar symptoms. You should have your system inventoried as to revision levels to see whether your hardware is fully up to rev. What does the SARM consoel SHOW FRU command return re: part numbers, serial numbers, and revision levels? Would be helpful if you could post SHOW FRU display output as a reply to this note.. It would also be helpful if you could post three successive 620/630 error reports. This will help us diagnose the problem. It's quite possible the machine in question has bad memory. But we need more data before we can know whether this is faulty memory or some other out of rev module. BC | |||||
484.2 | MEM0H -or- MEM0L, which one? | KAOFS::M_NAKAGAWA | Tue Feb 11 1997 21:04 | 231 | |
re .-1 I have some DECevent samples here. I believe we have memory problem with this system. It has 1024MB memory(a pair of B3030-FA) and I wanted to know which half(MEM0H or MEM0L) is causing it. I tried the MACHINE CHECK program(4100.digital_unix) but it didn't help. Some memory problems are discussed in following BLITZ: [TD 2109] Alpha Server 4100 - Memory Errors [TD 2226] Alpha Server 4100 - SRM Console V4.8-3 Thanks for your help, CRDC/Mitz ------------------- MC620/630 DECevent Sample----------------------- Timestamp of occurrence 03-FEB-1997 15:52:37 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 03-FEB-1997 15:52:37 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 03-FEB-1997 15:52:37 CPU Minor class 3. Bcache error (630 entry) ******************************** ENTRY 6 ******************************** Logging OS 2. Digital UNIX System Architecture 2. Alpha Event sequence number 37. Timestamp of occurrence 03-FEB-1997 15:52:37 Host name emcsats004 System type register x00000016 AlphaStation 4x00 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity 1. O/S claims event is valid Event severity 5. Low Priority Entry type 100. CPU Machine Check Errors CPU Minor class 4. 620 System Correctable Error Software Flags x0000000000000000 Active CPUs x00000003 Hardware Rev x00000000 System Serial Number C1563 Module Serial Number Module Type x0000 System Revision x00000000 Machine Check Reason x0204 IOD Detected Soft Error Ext Interface Status Reg x0000000000000000 Not Valid for 620 System Correctable Errors Ext Interface Address Reg x0000000000000000 Not Valid for 620 System Correctable Errors Fill Syndrome Reg x0000000000000000 Not Valid for 620 System Correctable Errors Interrupt Summary Reg x0000000000000000 Not Valid for 620 System Correctable Errors WHOAMI x00000000 Module Revision 0. MID 0. GID 0. Sys Environmental Regs x00000000 Base Addr of Bridge x000000FBE0000000 Dev Type & Rev Register x06000221 CAP Chip Revision: x00000001 HORSE Module Revision: x00000002 SADDLE Module Revision: x00000002 SADDLE Module Type: Left Hand Internal CAP Chip Arbiter: Enabled PCI Class Code x00000600 MC Error Info Register 0 x193E3C40 MC Bus Trans Addr<31:4>: 193E3C40 MC Error Info Register 1 x800F4800 MC bus trans addr <39:32> x00000000 MC Command is Read0-Mem IOD1 Master at Time of Error Device ID 2 x00000005 MC error info valid CAP Error Register x89000000 Error Detected but Not Logged Correctable ECC err det by MDPA MC error info latched MDPA Status Register x00000000 MDPA Status Register Data Not Valid MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid MDPB Status Register x00000000 MDPB Status Register Data Not Valid MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid PALcode Revision Palcode Rev: 1.21-3 ******************************** ENTRY 7 ******************************** Logging OS 2. Digital UNIX System Architecture 2. Alpha Event sequence number 36. Timestamp of occurrence 03-FEB-1997 15:52:37 Host name emcsats004 System type register x00000016 AlphaStation 4x00 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity 1. O/S claims event is valid Event severity 5. Low Priority Entry type 100. CPU Machine Check Errors CPU Minor class 4. 620 System Correctable Error Software Flags x0000000000000000 Active CPUs x00000003 Hardware Rev x00000000 System Serial Number C1563 Module Serial Number Module Type x0000 System Revision x00000000 Machine Check Reason x0204 IOD Detected Soft Error Ext Interface Status Reg x0000000000000000 Not Valid for 620 System Correctable Errors Ext Interface Address Reg x0000000000000000 Not Valid for 620 System Correctable Errors Fill Syndrome Reg x0000000000000000 Not Valid for 620 System Correctable Errors Interrupt Summary Reg x0000000000000000 Not Valid for 620 System Correctable Errors WHOAMI x00000000 Module Revision 0. MID 0. GID 0. Sys Environmental Regs x00000000 Base Addr of Bridge x000000F9E0000000 Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001 HORSE Module Revision: x00000002 SADDLE Module Revision: x00000002 SADDLE Module Type: Left Hand PCI-EISA Bus Bridge Present on PCI Segment PCI Class Code x00000600 MC Error Info Register 0 x193E3C40 MC Bus Trans Addr<31:4>: 193E3C40 MC Error Info Register 1 x800F4800 MC bus trans addr <39:32> x00000000 MC Command is Read0-Mem IOD1 Master at Time of Error Device ID 2 x00000005 MC error info valid CAP Error Register x89000000 Error Detected but Not Logged Correctable ECC err det by MDPA MC error info latched MDPA Status Register x00000000 MDPA Status Register Data Not Valid MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid MDPB Status Register x00000000 MDPB Status Register Data Not Valid MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid PALcode Revision Palcode Rev: 1.21-3 ******************************** ENTRY 8 ******************************** Logging OS 2. Digital UNIX System Architecture 2. Alpha Event sequence number 35. Timestamp of occurrence 03-FEB-1997 15:52:37 Host name emcsats004 System type register x00000016 AlphaStation 4x00 Number of CPUs (mpnum) x00000002 CPU logging event (mperr) x00000000 Event validity 1. O/S claims event is valid Event severity 3. High Priority Entry type 100. CPU Machine Check Errors CPU Minor class 3. Bcache error (630 entry) Software Flags x0000000000000000 Active CPUs x00000003 Hardware Rev x00000000 System Serial Number C1563 Module Serial Number Module Type x0000 System Revision x00000000 Machine Check Reason x0204 IOD Detected Soft Error Ext Interface Status Reg x0000000000000000 Ext Interface Address Reg x0000000000000000 Fill Syndrome Reg x0000000000000000 Interrupt Summary Reg x0000000000000000 WHOAMI x00000000 Module Revision 0. MID 0. GID 0. Sys Environmental Regs x00000000 Base Addr of Bridge x000000F9E0000000 Dev Type & Rev Register x06008221 CAP Chip Revision: x00000001 HORSE Module Revision: x00000002 SADDLE Module Revision: x00000002 SADDLE Module Type: Left Hand PCI-EISA Bus Bridge Present on PCI Segment PCI Class Code x00000600 MC Error Info Register 0 x193E3C40 MC Bus Trans Addr<31:4>: 193E3C40 MC Error Info Register 1 x800F4800 MC bus trans addr <39:32> x00000000 MC Command is Read0-Mem IOD1 Master at Time of Error Device ID 2 x00000005 MC error info valid CAP Error Register x89000000 Error Detected but Not Logged Correctable ECC err det by MDPA MC error info latched MDPA Status Register x00000000 MDPA Status Register Data Not Valid MDPA Error Syndrome Reg x00000000 MDPA Syndrome Register Data Not Valid MDPB Status Register x00000000 MDPB Status Register Data Not Valid MDPB Error Syndrome Reg x00000000 MDPB Syndrome Register Data Not Valid PALcode Revision Palcode Rev: 1.21-3 =========================================== | |||||
484.3 | MAY30::CUMMINS | Wed Feb 12 1997 10:30 | 55 | ||
This lengthy note is leading somewhere. Bear with me.. The AlphaServer 4100/4000's PCI bridge's ASIC has a bug in it that can cause data corruption given a certain sequence of events. Basically, the sequence involves a CSR read from the IOD's B chip while a DMA is in progress. There are only a couple registers implemented in the B chip. These include the SYNDROME and STAT CSRs which are used, in part, to provide I/O-detected, single-bit error syndrome status. The only software that would normally access these registers is PALcode. The operating systems never touch them, as all error data collection is handled by PAL (and NT HAL) code on the 4100/4000 platform. We discovered the above data corruption problem prior to FRS. The program opted to not re-spin the ASIC. Instead, we modified PALcode to not collect B chip CSR error info on CRD or MCHK errors. Since VMS/UNIX PALcode attempts to scrub all single-bit memory errors, it was felt that more often than not, the EV5 would also detect a single-bit error during the course of scrubbing the location, assuming the error was not a transient. The impact of all of this is: 1. The data corruption problem is worked around and made effectively made moot by changes to PALcode. Customers should not ever see this problem (though you wouldn't want to write an application that periodically polled these STAT and SYNDROME CSRs!) 2. The side effect of (1) is that SYNDROME and STAT registers will always read as zero on I/O-detected CRD errors logged in the system error log. [See note below..] This will obviously hinder isolation to a memory pair member. 3. More often than not, PAL scrubbing, which involves reading and then writing back the data, will generate an EV5-detected CRD (630 or 620) error. PAL will snapshot the EV5 FILL_SYNDROME IPR which will then enable isolation to a pair member. Note: the V3.0-10 SRM console and later versions added an automatic enable of PAL collection of I/O SYNDROME and STAT data on I/O-detected 620 CRD errors. This is because TEST performs read-only operations to disk/tape/floppy. No writes.. Therefore, V3.0-10 and greater consoles can be used to diagnose to a memory pair member assuming the problem is repeatable under console TEST. And very often it is.. In summary, if your error log does not have any EV5-detected 620 or 630 CRD error entries, then you will not be able to diagnose to a memory pair member. Are there any CPU-detected CRD errors in the system error log? The ones you posted were all I/O-detected errors... If you update to V3.0-10, and re-run the TEST command, you may see EV5-detected CRD errors. In this case, the error frames displayed will include syndrome data for isolation to a pair member. The V3.0-10 console is available on the V3.8 Firmware Update CD (as well as via our firmware web site..) If questions, let me know. BC | |||||
484.4 | Thanks | KAOFS::M_NAKAGAWA | Wed Feb 12 1997 12:23 | 10 | |
Thanks for the info. >Are there any CPU-detected CRD errors in the system error log? No, they all are I/O-detected. A few MC620 errors followed by a MC630 as described in the TD #2109. Thanks again, CRDC/Mitz | |||||
484.5 | MAY30::CUMMINS | Wed Feb 12 1997 12:33 | 5 | ||
Not sure what you meant in your previous reply. 630 CRD errors are *always* EV5-detected; never I/O-detected. Are you saying you see 630 errors in the error log? BC | |||||
484.6 | 620,620,620 then 630 | KAOFS::M_NAKAGAWA | Wed Feb 12 1997 17:23 | 97 | |
re:last >Are you saying you see 630 errors in the error log? Please refer to 484.2 DECevent entry #8, two MC620's followed by a MC630. Sometimes we just get only MC620's but sometimes MC630 follows immediately after MC620's something like below. When we get MC630, the "MC Error Info Register 0" always contains "x193E3C40" and I was wondering if someone could tell MEM0H or MEM0L if we have memory problem here. The "CAP Error Register"(err sum) says "Correctable ECC err det by MDPA". --------------------------------------- Timestamp of occurrence 04-FEB-1997 08:46:38 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 04-FEB-1997 08:46:38 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 04-FEB-1997 08:46:38 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 03-FEB-1997 15:52:37 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 03-FEB-1997 15:52:37 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 03-FEB-1997 15:52:37 CPU Minor class 3. Bcache error (630 entry) <----- !!! Timestamp of occurrence 02-FEB-1997 12:00:02 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 02-FEB-1997 12:00:02 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 02-FEB-1997 12:00:02 CPU Minor class 3. Bcache error (630 entry) <----- !!! Timestamp of occurrence 31-JAN-1997 16:45:03 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 16:45:03 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 16:45:03 CPU Minor class 3. Bcache error (630 entry) <----- !!! Timestamp of occurrence 31-JAN-1997 14:05:09 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 14:05:09 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 14:05:08 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 14:05:08 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 14:05:08 CPU Minor class 3. Bcache error (630 entry) <------ !!! Timestamp of occurrence 31-JAN-1997 11:03:30 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 11:03:30 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 11:03:30 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 11:03:27 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 11:03:27 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 11:03:27 CPU Minor class 3. Bcache error (630 entry) <----- !!! Timestamp of occurrence 31-JAN-1997 10:24:51 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 10:24:51 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 10:24:51 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 10:24:51 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 31-JAN-1997 10:24:51 CPU Minor class 3. Bcache error (630 entry) <----- !!! Timestamp of occurrence 30-JAN-1997 17:31:38 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 30-JAN-1997 17:31:38 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 30-JAN-1997 17:31:38 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 30-JAN-1997 15:52:50 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 30-JAN-1997 15:52:50 CPU Minor class 4. 620 System Correctable Error Timestamp of occurrence 30-JAN-1997 15:52:50 CPU Minor class 4. 620 System Correctable Error --------------------------------------------------------------------------- Thanks again for your help. CRDC/Mitz | |||||
484.7 | MAY30::CUMMINS | Thu Feb 13 1997 10:50 | 11 | ||
I looked again at the error log snippets from reply .2 and there are indeed 630 entries. Unfortunately, no EV5 error info is presented. I am assuming this is because you are running older DECevent. I strongly recommend that you update all of your customers to DECevent V2.3 with the latest 4100/4000 KNL files. Several problems involving CRD error interpretation / reporting have been resolved in DECevent. Among other additions and fixes... Without the syndrome information, it is not possible to diagnose to a card pair member. |