[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference mvblab::alphaserver_4100

Title:AlphaServer 4100
Moderator:MOVMON::DAVISS
Created:Tue Apr 16 1996
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:648
Total number of notes:3158

469.0. "MULTIPLE CPU EXCEPTION. URGENT HELP!!!!! PLEASE" by MDR01::CARRANZ (MCS Madrid) Thu Feb 06 1997 11:20

We have two AlphaServer 4100 5/400 RACK MOUNTED with the next HW configuration:

	Digital Unix V3.2F
	ASEBASE130 ----> DECSAFE
	ASECMS130 ----> DECSAFE
	2 GB of Memory (4*512MB).
	3 Power Supply
	4 CPU's 3MB Cache
	
	PCI/IO subsystem configuration:
	2*KZPSA connected to 2 HSZ40 FW V2.7
	1*KZPSC (3 Channels)
        2*DNSES (X.25)
	1*DE435
	1*DEFPA PCI TO FDDI ADAPTER
	1*KZPAA

	The system disk is attached to KZPSC (RAID 1).
	
	
Both 4100 has the same HW/SW configuration.

Our problem is that one computer intermitently gives next message:

	# uerf -R -o full -f binary.errlog -r 100| more
                                              
                                                  uerf version 4.2-011 (122)


********************************* ENTRY     1. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  9.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Fri Jan 31 15:33:13 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  2.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY     2. *********************************


----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  8.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Fri Jan 31 15:33:13 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  2.
PROCESSOR WHO LOGGED      x00000001

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY     3. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  7.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Fri Jan 31 15:33:13 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  2.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY     4. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  6.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Fri Jan 31 14:02:06 1997OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  2.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY     5. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  5.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Fri Jan 31 14:02:06 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  2.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY     6. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  4.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Fri Jan 31 14:02:06 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  2.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----UNIT CLASS                              CPU

********************************* ENTRY     7. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                 10.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Tue Jan 28 12:22:36 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  4.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY     8. *********************************
                                                                                ----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  9.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Tue Jan 28 12:22:36 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  4.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY     9. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTIONSEQUENCE NUMBER                  8.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Tue Jan 28 12:22:36 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  4.
PROCESSOR WHO LOGGED      x00000001

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY    10. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  7.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Tue Jan 28 12:21:53 1997
OCCURRED ON SYSTEM                      rtrprd2SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  4.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY    11. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  6.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Tue Jan 28 12:21:53 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  4.
PROCESSOR WHO LOGGED      x00000000
----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY    12. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  5.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Tue Jan 28 12:21:53 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  4.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU
stdin
********************************* ENTRY    13. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  4.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Tue Jan 28 12:21:47 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
PROCESSOR COUNT                  4.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU

********************************* ENTRY    14. *********************************
-.....----...... 

	# dia -a -f ./binary.errlog -R -o full -i cpu| more
DECevent V2.1


******************************** ENTRY   34 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number             9.
Timestamp of occurrence              31-JAN-1997 15:33:13
Host name                            rtrprd2

System type register      x00000016  Systype 22. Not announced yet
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x00000000

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   4. 620 System Correctable Error

stdinEntry Body Size:          x000000C0
Entry body:

          15--<-12  11--<-08  07--<-04  03--<-00   :Byte Order
 0000:    00000000  00000000  00000000  00000001   *................*
 0010:    35323237  32365941  00000000  00000003   *........AY627225*
 0020:    00000000  00000000  00000000  00003434   *44..............*
 0030:    00000000  00000000  00000000  00000000   *................*
 0040:    00000038  00000018  80000000  00000070   *p...........8...*
 0050:    00000000  00000000  00000000  02040000   *................*
 0060:    00000000  00000000  00000000  00000000   *................*
 0070:    00000000  00000000  00000000  00000000   *................*
 0080:    4D8FB340  06000231  000000FB  E0000000   *[email protected]*
 0090:    00000000  00000000  88000000  800EDB00   *................*
 00A0:    FFFFFC00  004E21B0  00000000  00000000   *.........!N.....*
 00B0:    5E3C7E25  00000000  0020000B  00020115   *...... .....%~<^*



******************************** ENTRY   35 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number             8.
Timestamp of occurrence              31-JAN-1997 15:33:13
Host name                            rtrprd2

System type register      x00000016  Systype 22. Not announced yet
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x00000001

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   4. 620 System Correctable Error

Entry Body Size:          x000000C0
Entry body:

          15--<-12  11--<-08  07--<-04  03--<-00   :Byte Order
 0000:    00000000  00000000  00000000  00000001   *................*
 0010:    35323237  32365941  00000000  00000003   *........AY627225*
 0020:    00000000  00000000  00000000  00003434   *44..............*
 0030:    00000000  00000000  00000000  00000000   *................* 0040:    00000038  00000018  80000000  00000070   *p...........8...*
 0050:    FFFFFFF0  C1FFFFFF  00000000  00860000   *................*
 0060:    00000000  000000B5  FFFFFF00  4D8FB37F   *...M............*
 0070:    00000000  00000001  00000001  00000000   *................*
 0080:    00000000  00000000  00000000  00000000   *................*
 0090:    00000000  00000000  00000000  00000000   *................*
 00A0:    00000000  00000000  00000000  00000000   *................*
 00B0:    5E3C7E25  00000000  0020000B  00020115   *...... .....%~<^*



******************************** ENTRY   36 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number             7.
Timestamp of occurrence              31-JAN-1997 15:33:13
Host name                            rtrprd2

System type register      x00000016  Systype 22. Not announced yet
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x00000000Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   4. 620 System Correctable Error

Entry Body Size:          x000000C0
Entry body:

          15--<-12  11--<-08  07--<-04  03--<-00   :Byte Order
 0000:    00000000  00000000  00000000  00000001   *................*
 0010:    35323237  32365941  00000000  00000003   *........AY627225*
 0020:    00000000  00000000  00000000  00003434   *44..............*
 0030:    00000000  00000000  00000000  00000000   *................*
 0040:    00000038  00000018  80000000  00000070   *p...........8...*
 0050:    00000000  00000000  00000000  02040000   *................*
 0060:    00000000  00000000  00000000  00000000   *................*
 0070:    00000000  00000000  00000000  00000000   *................*
 0080:    4D8FB340  06008231  000000F9  E0000000   *[email protected]*
 0090:    00000000  00000000  88000000  800EDB00   *................*
 00A0:    FFFFFC00  004E21B0  00000000  00000000   *.........!N.....*
 00B0:    5E3C7E25  00000000  0020000B  00020115   *...... .....%~<^*
                                                                           

******************************** ENTRY   37 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number             6.
Timestamp of occurrence              31-JAN-1997 14:02:06
Host name                            rtrprd2

System type register      x00000016  Systype 22. Not announced yet
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x00000000

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   4. 620 System Correctable Error

Entry Body Size:          x000000C0
Entry body:
          15--<-12  11--<-08  07--<-04  03--<-00   :Byte Order
 0000:    00000000  00000000  00000000  00000001   *................*
 0010:    35323237  32365941  00000000  00000003   *........AY627225*
 0020:    00000000  00000000  00000000  00003434   *44..............*
 0030:    00000000  00000000  00000000  00000000   *................*
 0040:    00000038  00000018  80000000  00000070   *p...........8...*
 0050:    00000000  00000000  00000000  02040000   *................*
 0060:    00000000  00000000  00000000  00000000   *................*
 0070:    00000000  00000000  00000000  00000000   *................*
 0080:    4D8FB340  06000231  000000FB  E0000000   *[email protected]*
 0090:    00000000  00000000  88000000  800E8B00   *................*
 00A0:    FFFFFC00  004E21B0  00000000  00000000   *.........!N.....*
 00B0:    5E3C7E25  00000000  0020000B  00020115   *...... .....%~<^*



******************************** ENTRY   38 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number             5.Timestamp of occurrence              31-JAN-1997 14:02:06
Host name                            rtrprd2

System type register      x00000016  Systype 22. Not announced yet
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x00000000

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   4. 620 System Correctable Error

Entry Body Size:          x000000C0
Entry body:

          15--<-12  11--<-08  07--<-04  03--<-00   :Byte Order
 0000:    00000000  00000000  00000000  00000001   *................*
 0010:    35323237  32365941  00000000  00000003   *........AY627225*
 0020:    00000000  00000000  00000000  00003434   *44..............*
 0030:    00000000  00000000  00000000  00000000   *................*
 0040:    00000038  00000018  80000000  00000070   *p...........8...*
 0050:    00000000  00000000  00000000  02040000   *................*
stdin
 0060:    00000000  00000000  00000000  00000000   *................*
 0070:    00000000  00000000  00000000  00000000   *................*
 0080:    4D8FB340  06008231  000000F9  E0000000   *[email protected]*
 0090:    00000000  00000000  88000000  800E8B00   *................*
 00A0:    FFFFFC00  004E21B0  00000000  00000000   *.........!N.....*
 00B0:    5E3C7E25  00000000  0020000B  00020115   *...... .....%~<^*


                                                                               
---......---- (and so on)

	Here is a boot sequence:

	# uerf -R -r 300 | more
                                                  uerf version 4.2-011 (122)


********************************* ENTRY     1. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             OPERATIONAL EVENT
OS EVENT TYPE                  300.     SYSTEM STARTUP
SEQUENCE NUMBER                  0.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Thu Feb  6 09:16:57 1997
OCCURRED ON SYSTEM                      contento
SYSTEM ID                 x00060009     CPU TYPE:  DEC 2100
SYSTYPE                   x00000000
MESSAGE                                 PCXAL keyboard, language Espanol

                                        Alpha boot: available memory from
                                         _0xa24000 to 0x7ffe000
                                        Digital UNIX V3.2C  (Rev. 148); Thu
                                         _Jan 30 17:27:12 MET 1997
                                        physical memory = 128.00 megabytes.
                                        available memory = 117.85 megabytes.
stdin
                                        using 483 buffers containing 3.77
                                         _megabytes of memory
                                        Firmware revision: 4.4
                                        PALcode: OSF version 1.45
                                        ibus0 at nexus
                                        AlphaServer 2000 4/275
                                        gpc0 at ibus0
                                        pci0 at ibus0 slot 0
                                        psiop0 at pci0 slot 1
                                        Loading SIOP: script 1000700, reg
                                         _81000000, data 4078c6e8
                                        scsi0 at psiop0 slot 0
                                        rz0 at scsi0 bus 0 target 0 lun 0 (DEC
                                         _    RZ28     (C) DEC 442D)
                                        rz1 at scsi0 bus 0 target 1 lun 0 (DEC
                                         _    RZ26L    (C) DEC 442D)
                                        rz6 at scsi0 bus 0 target 6 lun 0 (DEC
                                         _    RRD45   (C) DEC  1645)
                                        eisa0 at pci0
                                        ace0 at eisa0
                                        ace1 at eisa0
                                        lp0 at eisa0
                                        fdi0 at eisa0
                                        fd0 at fdi0 unit 0
                                        vga0 at eisa0
                                         1024x768 (ATI64   )
                                        vga0: ATI Mach64-GX Rev. 1
                                        Attempt to disable non-existant
                                         _interrupt -1
                                        tu0: DECchip 21040-AA: Revision: 2.4
                                        tu0 at pci0 slot 7
                                        tu0: DEC TULIP Ethernet Interface,
                                         _hardware address: 00-00-F8-20-0E-93
                                        tu0: console mode: selecting BNC
                                         _(10Base2) port
                                        lvm0: configured.
                                        lvm1: configured.
                                        dli: configured
                                        SuperLAT. Copyright 1993 Meridian
                                         _Technology Corp. All rights
                                         _reserved.
                                                               
	We have put next files into:

	 latina::hard$disk:[mas.airtel]
	(51.695)	

	> messages
	> binary.errlog
	> rtrprd2.html  (sys_check utility)
	
	
        This is a very critical customer.

	We have done next HW troubleshouting since September-96:

		* We have changed the four CPUS.
		* System motherboard
		* PCI motherboard
		* 2*KZPSA	
		

	
		
		

        
	
	
	
	
	
	         

    
T.RTitleUserPersonal
Name
DateLines
469.1POBOXB::BAKThu Feb 06 1997 11:261
You need to get the error logs from DECEvent....
469.2Memory problem, likelyPOBOXB::STEINMANThu Feb 06 1997 12:1726
    
    Looking at the error output in .0, you have an old version of DECEvent
    that isn't cracking the error info, but I was able to discern what the
    problem is.
    
    I believe you have a faulty memory module or pair.  
    
    The address that produced the correctable error is:
    4D8FB340, so you can use the console SHOW MEM command to find
    the base address of the memory pair that contains this address.
    
    It appears that the error syndrome is B5 (though
    I cannot be sure) which points to data<47> which would indicate the
    high memory card of the pair that contains this address.
    
    I also noticed that you reported the cache size as 3MB.  The 400 MHz
    module has a 4MB BCache, but it is reported as 3MB if you have the
    wrong version of console.  If this is indeed the case, I'd look into
    upgrading the firmware as well.
    
    If you need any further assistance, feel free to email me directly
    at POBOXA::STEINMAN or call me at DTN: 223-3874
    
    	mo
    

469.3We'll test it tomorrow!!MDR01::CARRANZMCS MadridThu Feb 06 1997 12:497
    Many thanks for your so quickly reply, we are going to install last DECEvent
    version (2.3) and test FW version and memory.
    
    I'll update the note with news.
    
    
    Carmen.
469.4Need some more informationPOBOXB::DONALDSONThu Feb 06 1997 18:1328
Hi,

There are a couple of courious items in your information.  First you say
the system is a 4CPU system yet the log says only 2 cpus are present.
Second, it looks like the error you are getting is a correctable error
but the errorlog you have posted is filtered so the register information
is being surpressed.  Can you posted the full error log entry for one of
the 620 errors so we can see the registers?  That way we can see what
component is causing the correctable errors.


     Logging OS                        2. Digital UNIX
     System Architecture               2. Alpha
     Event sequence number             9.
     Timestamp of occurrence              31-JAN-1997 15:33:13
     Host name                            rtrprd2

     System type register      x00000016  Systype 22. Not announced yet
---> Number of CPUs (mpnum)    x00000002
     CPU logging event (mperr) x00000000

     Event validity                    1. O/S claims event is valid
     Event severity                    5. Low Priority
     Entry type                      100. CPU Machine Check Errors

---> CPU Minor class                   4. 620 System Correctable Error

          
469.5Probably memory problemas. Thaks all!!MDR01::CARRANZMCS MadridMon Feb 10 1997 10:46192
Hi all,

Many thaks for your replys.

Excuse me by the mistake, but i have included my computer System Startup, 
instead of customer one.

I append it.

(At the Startup the computer had just two CPU's, because we were testing the
4 CPUS backplane.)

We are now updating the Firmware version and we are going to change the memory.

As here have said, it looks like a memory problem.

We have update to DECevent V2.3 and the "CPU EXCEPTION" errors looks now:

dia -R -f ./binary.errlog -i cpus | more


******************************** ENTRY   35 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number             8.
Timestamp of occurrence              31-JAN-1997 15:33:13
Host name                            rtrprd2

System type register      x00000016  AlphaServer 4000 Series
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x00000001

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   4. 620 System Correctable Error

Software Flags            x0000000000000000
Active CPUs               x00000003
Hardware Rev              x00000000
System Serial Number                 AY62722544
Module Serial NumberModule Type                   x0000
System Revision           x00000000

Machine Check Reason          x0086  Alpha Chip Detected ECC Error, From Memory

Ext Interface Status Reg  xFFFFFFF0C1FFFFFF
                                     DATA SOURCE IS MEMORY OR SYSTEM
                                     CORRECTABLE ECC ERROR
                                     D-ref fill
Ext Interface Address Reg xFFFFFF004D8FB37F
Fill Syndrome Reg         x00000000000000B5
Interrupt Summary Reg     x0000000100000000
                                     Correctable ECC Errors (IPL31)
                                     AST Requests 3-0:  x0000000000000000

WHOAMI                    x00000001  CPU1 Detected This Error

--IOD REGISTERS FOLLOW--
Base Addr of Bridge       x0000000000000000
                                     Register Contents Not Valid For This Error
Dev Type & Rev Register   x00000000  Register Contents Not Valid For This Error
MC Error Info Register 0  x00000000  Register Contents Not Valid For This Error
MC Error Info Register 1  x00000000  Register Contents Not Valid For This ErrorCAP Error Register        x00000000  Register Contents Not Valid For This Error
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid

PALcode Revision                     Palcode Rev: 1.21-3

                                                                                

	# uerf -R -r 300 -f ./binary.errlog | more


********************************* ENTRY     1. *********************************

----- EVENT INFORMATION -----

EVENT CLASS                             OPERATIONAL EVENT
OS EVENT TYPE                  300.     SYSTEM STARTUP
SEQUENCE NUMBER                  1.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Sun Feb  2 07:19:52 1997
OCCURRED ON SYSTEM                      rtrprd2
SYSTEM ID                 x00070016
SYSTYPE                   x00000000
MESSAGE                                 Alpha boot: available memory from
                                         _0x2972000 to 0x7fff6000
                                        Digital UNIX V3.2F (Rev. 69.73); Fri
                                         _Oct 25 14:41:14 MET DST 1996
                                        physical memory = 2048.00 megabytes.
                                        available memory = 2006.51 megabytes.
                                        using 7856 buffers containing 61.37
                                         _megabytes of memory                                        Master cpu at slot 0.
                                        Firmware revision: 1.2
                                        PALcode: Digital-UNIX/OSF version 1.21
                                        AlphaServer 4100 5/400 3MB
                                        pci1 at mcbus0 slot 5
                                        psiop0 at pci1 slot 1
                                        Loading SIOP: script c0001900, reg
                                         _1222200, data c000d8f8
                                        scsi0 at psiop0 slot 0
                                        rz5 at scsi0 bus 0 target 5 lun 0 (DEC
                                         _    RRD45   (C) DEC  0436)
                                        pza0 at pci1 slot 2
                                        pza0 firmware version: DEC  P01  A10
                                         _
                                        scsi1 at pza0 slot 0
                                        rz8 at scsi1 bus 1 target 0 lun 0 (DEC
                                         _    HSZ40            V27Z)
                                        rz9 at scsi1 bus 1 target 1 lun 0 (DEC
                                         _    HSZ40            V27Z)
                                        rz10 at scsi1 bus 1 target 2 lun 0
                                         _(DEC     HSZ40            V27Z)
                                        rz11 at scsi1 bus 1 target 3 lun 0
                                         _(DEC     HSZ40            V27Z)                                        tu0: DECchip 21040-AA: Revision: 2.4
                                        tu0 at pci1 slot 3
                                        tu0: DEC TULIP Ethernet Interface,
                                         _hardware address: 00-00-F8-21-ED-4A
                                        tu0: auto sensing: selected UTP
                                         _(10BaseT) port
                                        pza1 at pci1 slot 4
                                        pza1 firmware version: DEC  P01  A10
                                         _
                                        scsi2 at pza1 slot 0
                                        rz16 at scsi2 bus 2 target 0 lun 0
                                         _(DEC     HSZ40            V27Z)
                                        rz17 at scsi2 bus 2 target 1 lun 0
                                         _(DEC     HSZ40            V27Z)
                                        rz18 at scsi2 bus 2 target 2 lun 0
                                         _(DEC     HSZ40            V27Z)
                                        rz19 at scsi2 bus 2 target 3 lun 0
                                         _(DEC     HSZ40            V27Z)
                                        psiop1 at pci1 slot 5
                                        Loading SIOP: script c162f900, reg
                                         _1222000, data c163bcf8
                                        scsi3 at psiop1 slot 0
                                        gpc0 at eisa0                                        pci0 at mcbus0 slot 4
                                        eisa0 at pci0
                                        ace0 at eisa0
                                        ace1 at eisa0
                                        lp0 at eisa0
                                        fdi0 at eisa0
                                        fd0 at fdi0 unit 0
                                        dns0 at eisa0
                                        dns0: Digital WAN Device Driver
                                         _Interface
                                        dns1: Digital WAN Device Driver
                                         _Interface
                                        Initializing xcr0.  Please wait.
                                        Initializing xcr0.  Please wait.
                                        Initializing xcr0.  Please wait.
                                        Initializing xcr0.  Please wait.
                                        xcr0 at pci0 slot 2
                                        re0 at xcr0 unit 0 (unit status =
                                         _ONLINE, raid level = 1)
                                        re1 at xcr0 unit 1 (unit status =
                                         _ONLINE, raid level = 1)
                                        fta0 DEC DEFPA FDDI Module, Hardware
                                         _Revision 0                                        fta0 at pci0 slot 5
                                        fta0: DMA Available.
                                        fta0: DEC DEFPA (PDQ) FDDI Interface,
                                         _Hardware address: 00-00-F8-40-F4-C1
                                        fta0: Firmware rev: 2.46
                                        Created FRU table configuration binary
                                         _log packet
                                        lvm0: configured.
                                        lvm1: configured.
                                        dli: configured
                                        SuperLAT. Copyright 1993 Meridian
                                         _Technology Corp. All rights
                                         _reserved.
                                        x25_access: configured
                                        x25_relay: configured
                                        wandd_base: configured
                                        wandd_llc2: configured
                                        wandd_lapb: configured
                                        wan_utilities: configured
                                        ctf_base: configured
                                        Node ID is 00-00-f8-21-ed-4a (from
                                         _device tu0)
                                        dna_netman: configured                                        dna_dli: configured


Again, thank very much.

Carmen.
469.6CPU problems!!MDR01::CARRANZMCS MadridWed Feb 12 1997 12:46153
Hello,

We have installed DECevent V2.3 and and update Firmware computer (CD 3.8).

Please, could it be possible that anybody take a look at it to confirm our 
"cpu exception" problem.

Many thanks and regards,


Carmen Arranz.

******** ****** ***** **** ****

Our errlog looks now:                      

	# dia -R -i cpus -f ./binary.errlog 

DECevent V2.3


******************************** ENTRY    2 ********************************


Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number            74.
Timestamp of occurrence              09-FEB-1997 23:20:02
Host name                            rtrprd2

System type register      x00000016  AlphaServer 4000 Series
Number of CPUs (mpnum)    x00000004
CPU logging event (mperr) x00000000

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   4. 620 System Correctable Error
                                                                                
Software Flags            x0000000000000000
Active CPUs               x0000000F
Hardware Rev              x00000000
System Serial Number                 C1563
Module Serial Number
Module Type                   x0000
System Revision           x00000000

Machine Check Reason          x0204  IOD Detected Soft Error

Ext Interface Status Reg  x0000000000000000
                                     Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
                                     Register Contents Not Valid For This Error
Fill Syndrome Reg         x0000000000000000
                                     Register Contents Not Valid For This Error
Interrupt Summary Reg     x0000000000000000
                                     Register Contents Not Valid For This Error
WHOAMI                    x00000000  Register Contents Not Valid For This Error

--IOD REGISTERS FOLLOW--
Base Addr of Bridge       x000000FBE0000000
Dev Type & Rev Register   x06000231  CAP Chip Revision:        x00000001                                     HORSE  Module Revision:   x00000003
                                     SADDLE Module Revision:   x00000002
                                     SADDLE Module Type:        Left Hand
                                     Internal CAP Chip Arbiter: Enabled
                                     PCI Class Code            x00000600
MC Error Info Register 0  x4D8FB340
                                     MC Bus Trans Addr<31:4>: 4D8FB340
MC Error Info Register 1  x800E8800  MC bus trans addr <39:32> x00000000
                                     MC Command is Read0-Mem
                                     CPU0 Master at Time of Error
                                     Device ID:   x00000002
                                     MC error info valid
CAP Error Register        x88000000  Correctable ECC err det by MDPA
                                     MC error info latched
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid

PALcode Revision                     Palcode Rev: 1.21-3
                                                                                

Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number            73.
Timestamp of occurrence              09-FEB-1997 23:20:02
Host name                            rtrprd2

System type register      x00000016  AlphaServer 4000 Series
Number of CPUs (mpnum)    x00000004
CPU logging event (mperr) x00000000

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   4. 620 System Correctable Error

Software Flags            x0000000000000000
Active CPUs               x0000000F
Hardware Rev              x00000000
System Serial Number                 C1563
Module Serial NumberModule Type                   x0000
System Revision           x00000000

Machine Check Reason          x0204  IOD Detected Soft Error

Ext Interface Status Reg  x0000000000000000
                                     Register Contents Not Valid For This Error
Ext Interface Address Reg x0000000000000000
                                     Register Contents Not Valid For This Error
Fill Syndrome Reg         x0000000000000000
                                     Register Contents Not Valid For This Error
Interrupt Summary Reg     x0000000000000000
                                     Register Contents Not Valid For This Error
WHOAMI                    x00000000  Register Contents Not Valid For This Error

--IOD REGISTERS FOLLOW--
Base Addr of Bridge       x000000F9E0000000
Dev Type & Rev Register   x06008231  CAP Chip Revision:        x00000001
                                     HORSE  Module Revision:   x00000003
                                     SADDLE Module Revision:   x00000002
                                     SADDLE Module Type:        Left Hand
                                     PCI-EISA Bus Bridge Present on PCI Segment                                     PCI Class Code            x00000600MC Error Info Register 0  x4D8FB340
                                     MC Bus Trans Addr<31:4>: 4D8FB340
MC Error Info Register 1  x800E8800  MC bus trans addr <39:32> x00000000
                                     MC Command is Read0-Mem
                                     CPU0 Master at Time of Error
                                     Device ID:   x00000002
                                     MC error info valid
CAP Error Register        x88000000  Correctable ECC err det by MDPA
                                     MC error info latched
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid

PALcode Revision                     Palcode Rev: 1.21-3


******************************** ENTRY    4 ********************************

.... and so on ....

I'll put our binary.errlog into:

	chueca:: (51.195)

Thanks and regards,

Carmen Arranz & Elsa Soengas
469.7MAY30::CUMMINSWed Feb 12 1997 13:0820
    See note 484. Your and the noter's customer's systems described in note
    484 are possibly experiencing the same symptoms. The noter in note 484
    indicated there may have been 630 errors. But we're checking on this.
    
    1024MB (1GB) EDO memory pairs were being used on the system described
    in note 484. There's a problem with SYNC memories and older revision
    motherboards that I didn't bother to discuss in the 484 note string.
    Does your system have any SYNC memories? If so, do you know what the
    system motherboard revision level is? The footprint of the problem I'm
    describing is that only IOD-detected 620 CRD errors are ever seen. No
    CPU-detected CRDs are ever logged..
    
    Are there any 630 entries in the customer's error log? Have you tried
    running the console TEST command with this version of console? If 630s,
    use the FILL_SYNDROME data to determine the card pair member. Use the
    error address and the SRM consoel SHOW MEMORY command to determine
    which memory pair is faulty (assuming the problem is memory). If SYNC
    memory, then the problem may well be an older-rev motherboard.
    
    BC
469.8The rest of the story....POBOXB::STEINMANWed Feb 12 1997 16:35141
    
    Bill,
    
    That was only a partial listing from DECEvent.  Here is the complete
    log of the error (including the EV5 detected error):
    
From:	POBOXA::SHEPARD      "GARY DTN 223-2499" 12-FEB-1997 15:35:23.74
To:	POBOXB::STEINMAN
CC:	SHEPARD
Subj:	RE: DECEvent log + pointer to binary -- wanna have a look to confirm memory failure? Sure looks like it to me...thanks

Hi Mo,

Here is the CPU detected CRD followed by the IOD detected.  It has
a syndrome of B5 just like you determined from the notes file.
This matches up with your analysis in the notes file.


Gary


******************************** ENTRY    7 ******************************** 


Logging OS                        2. Digital UNIX 
System Architecture               2. Alpha 
Event sequence number            30. 
Timestamp of occurrence              05-FEB-1997 17:21:46   
Host name                            rtrprd2 

System type register      x00000016  AlphaServer 4000 Series 
Number of CPUs (mpnum)    x00000004 
CPU logging event (mperr) x00000003 

Event validity                    1. O/S claims event is valid 
Event severity                    5. Low Priority 
Entry type                      100. CPU Machine Check Errors 

CPU Minor class                   4. 620 System Correctable Error 

Software Flags            x0000000000000000 
Active CPUs               x0000000F 
Hardware Rev              x00000000 
System Serial Number                 C1563 
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0086  Alpha Chip Detected ECC Error, From Memory 

Ext Interface Status Reg  xFFFFFFF0C1FFFFFF 
                                     DATA SOURCE IS MEMORY OR SYSTEM 
                                     CORRECTABLE ECC ERROR 
                                     D-ref fill 
Ext Interface Address Reg xFFFFFF004D8E337F 
Fill Syndrome Reg         x00000000000000B5 
Interrupt Summary Reg     x0000000100000000 
                                     Correctable ECC Errors (IPL31) 
                                     AST Requests 3-0:  x0000000000000000 
                                       
WHOAMI                    x00000003  CPU3 Detected This Error 
                                       
--IOD REGISTERS FOLLOW--               
Base Addr of Bridge       x0000000000000000 
                                     Register Contents Not Valid For This Error 
Dev Type & Rev Register   x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 0  x00000000  Register Contents Not Valid For This Error 
MC Error Info Register 1  x00000000  Register Contents Not Valid For This Error 
CAP Error Register        x00000000  Register Contents Not Valid For This Error 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.21-3 


******************************** ENTRY    8 ******************************** 


Logging OS                        2. Digital UNIX 
System Architecture               2. Alpha 
Event sequence number            31. 
Timestamp of occurrence              05-FEB-1997 17:21:46   
Host name                            rtrprd2 

System type register      x00000016  AlphaServer 4000 Series 
Number of CPUs (mpnum)    x00000004 
CPU logging event (mperr) x00000000 

Event validity                    1. O/S claims event is valid 
Event severity                    5. Low Priority 
Entry type                      100. CPU Machine Check Errors 

CPU Minor class                   4. 620 System Correctable Error 

Software Flags            x0000000000000000 
Active CPUs               x0000000F 
Hardware Rev              x00000000 
System Serial Number                 C1563 
Module Serial Number                   
Module Type                   x0000 
System Revision           x00000000 

Machine Check Reason          x0204  IOD Detected Soft Error 

Ext Interface Status Reg  x0000000000000000 
                                     Register Contents Not Valid For This Error 
Ext Interface Address Reg x0000000000000000 
                                     Register Contents Not Valid For This Error 
Fill Syndrome Reg         x0000000000000000 
                                     Register Contents Not Valid For This Error 
Interrupt Summary Reg     x0000000000000000 
                                     Register Contents Not Valid For This Error 
WHOAMI                    x00000000  Register Contents Not Valid For This Error 
                                       
--IOD REGISTERS FOLLOW--               
This Bus Bridge Phy Addr  x000000F9E0000000 
                                     IOD# 0 
Dev Type & Rev Register   x06008231  CAP Chip Revision:        x00000001 
                                     B3040 Module Revision:    x00000003 
                                     B3050 Module Revision:    x00000002 
                                     B3050 Module Type:   Left Hand 
                                     PCI-EISA Bus Bridge Present on PCI Segment 
                                     Device Class: Host Bus to PCI Bridge 
MC Error Info Register 0  x4D8E3340 
                                     MC Bus Trans Addr<31:4>: 4D8E3340 
MC Error Info Register 1  x800FDA00  MC bus trans addr <39:32> x00000000 
                                     MC Command is ReadMod0-Mem 
                                     CPU3 OR IOD3 Master at Time of Error 
                                     Device ID:   x00000007 
                                     MC error info valid 
CAP Error Register        x88000000  Correctable ECC err det by MDPA 
                                     MC error info latched 
MDPA Status Register      x00000000  MDPA Status Register Data Not Valid 
MDPA Error Syndrome Reg   x00000000  MDPA Syndrome Register Data Not Valid 
MDPB Status Register      x00000000  MDPB Status Register Data Not Valid 
MDPB Error Syndrome Reg   x00000000  MDPB Syndrome Register Data Not Valid 
                                       
PALcode Revision                     Palcode Rev: 1.21-3 
    
469.9...a little more of the rest of the storyPOBOXB::STEINMANWed Feb 12 1997 16:386
    
    Just to complete the story....486.2 is the DECEvent log for
    the problem described in note 469.  It is not a CPU problem, but
    a memory CRD problem.
    
    	mo