[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

1795.0. "rmerror_int: Error_count = 1 unit = 0 Err_reg = 0xffffffffa0000002 Node = 2 panic (cpu 0): rmerror_int: fatal error and no alternate mc to failover " by TUXEDO::SWEENEY (Tom Sweeney in LKG) Mon Dec 23 1996 15:09

T.RTitleUserPersonal
Name
DateLines
1795.1Also...TUXEDO::SWEENEYTom Sweeney in LKGTue Jan 07 1997 14:2313
1795.2Still the problem....TUXEDO::SWEENEYTom Sweeney in LKGFri Jan 24 1997 15:5810
    Hi All,
    
    	I'm still having problems.  We changed the size of the swap space
    on SKULK, and now the cluster stays up for about 10 hours.  But then it
    crashes, SKULK comes back up, but CELL goes into a panic and stays
    there until it's powered up.
    
    	Any suggestions, or should I put in a QAR?
    
    	tom
1795.3SMURF::MARSHALLRob Marshall - USEGFri Jan 24 1997 17:0721
    Hi Tom,
    
    Well, before you open a QAR, or worse a CLD (gasp! :-), what version 
    are you running?  There are some timing problems with the rm driver
    in v1.4 for which I could give you some test patches.
    
    Also, the panic, as such, means exactly what it says, ie: it got an 
    error on the one and only memory channel it had, so it panics.  We
    could probably argue about if this is the right way to do things,
    but...
    
    You can get a better idea of what is causing the panic by looking at
    the core file and using some of Bill Grava's tools (look at his home
    page: http://www.zk3.dec.com/~grava) to annotate the register contents.
    
    
    Hope this gives you a better starting point,
    
    Rob Marshall
    USEG
    ([email protected])
1795.4That's a JumpTUXEDO::SWEENEYTom Sweeney in LKGWed Jan 29 1997 13:5215
Hi Rob,

	Thanks for the note.  Apologies for not replying sooner,
life's been crazy.

	We're running 3.2g Unix with v1.3 ASEmgr.  Is that what 
you were looking for?  For the moment, we have to keep this cluster
at 3.2g Unix and can't go to 4.0.

	I'll take a look at Bill's home page.  Thanks for the pointer.
I'll post an update as soon as I can.

	Thanks!
	
	Tom Sweeney
1795.5Finally some further progress.TUXEDO::SWEENEYTom Sweeney in LKGTue Feb 18 1997 17:24422
Thanks for the pointer to Mr Grava's page.  It was quite helpful in eliminating 
some potential problems.  I pulled over the rmstuf executable and ran it as 
suggested.  The out put is below.  Does it shed any light as to what the root
problem might be?  From my limited knowledge, it looks like we've a ton of 
interrupts that probably shouldn't be happening.  But I don't know what to do
about them if that's the problem.  It also seems to indicate that the hub might
not be too happy.  Am I reading this correctly?

Plus there are some abbreviations here that were not listed in the documentation
for the tool and I've no idea what they stand for.


Thanks!

	Tom

skulk.lkg.dec.com> rmstuf -f temp.txt
thread 0xfffffc001fe41c00 stopped at  [boot:1760 ,0xfffffc00003e195c]    Source
not
available
(dbx) px *rmPrimary
struct {
    boardRev = 0xb
    rmZen = 0x11f1b26d  ==> PRI ONLINE AOK HOK HONL
    errorCount = 0x1
    interruptCount = 0x1ab9b
    errorInts = 0x1
    errorsProcessed = 0x1
    stateInts = 0x2
    notifInts = 0x1ab98
    droppedAcks = 0x0
    utilityCounter = 0x1ab98
    rebounds = 0x0
    lastIntTime = 0x8903032d9c58d       ==> Mon 01/13/97 00:18:05.561200(musec)
    ISRentryErr = 0x4040001     ==> FATAL;
    ISRentryPort = 0x42420001   ==>  !AOK, HOK, !HONL, TYP=1 NID=2
    ISRentryLcsr = 0x84002      ==>  (SC) (RCV err) UPNODES ( 0 )
    ISRentryLastLcsr = 0x280078 ==>  UPNODES ( 0 2 )
    lastLCSR = 0x84002  ==>  (SC) (RCV err) UPNODES ( 0 )
    lastLCSR1 = 0x0
    lastRMPORT = 0x42420001     ==>  !AOK, HOK, !HONL, TYP=1 NID=2
    lastRMERR = 0x4040001       ==> FATAL;
    pIPconfig = 0xfffffc001fe04300
    rmel = struct {
        rmspurFlag = 0x180011
        rmspurTbar = 0x0
        rmspurRbar = 0x280000b
        rmspurLcsr = 0xf800
        rmspurRmerr = 0xc0000000
        rmspurGubar = 0x124
        rmspurPCIRevID = 0x0
        rmspurPCIStatus = 0x84002
        rmspurPCICommand = 0x8000000
    }
    errIndex = 0x4040001
    queueCount = 0x42420001
    errors = {
        [0] struct {
            when = 0x5  ==> Wed 12/31/69 19:00:05.0(musec)
            zenState = 0x8f35000000004  ==> ALT OFFLINE !AOK !HOK !HONL
            bolt = 0x8a1106d
            errType = 0x1259
            lcsr = 0x0
            lastLcsr = 0x3      ==>  (XMT err) (RCV err) UPNODES ( none )
            rmerr = 0x84000     ==>
            rmport = 0x80878    ==>  !AOK, !HOK, !HONL, TYP=0 NID=8
            statusView = 0x1
            sequence = 0x0
        }
        [1] struct {
            when = 0x1040052010001      ==> Tue 08/06/13 09:54:09.66560(musec)
            zenState = 0x8f35000000004  ==> ALT OFFLINE !AOK !HOK !HONL
            bolt = 0x1a1106d
            errType = 0x1259
            lcsr = 0x0
            lastLcsr = 0x3      ==>  (XMT err) (RCV err) UPNODES ( none )
            rmerr = 0x80000     ==>
            rmport = 0x80000    ==>  !AOK, !HOK, !HONL, TYP=0 NID=8
            statusView = 0x1
            sequence = 0x0
        }
        [2] struct {
            when = 0x3040052010001      ==> Tue 08/06/13 09:54:09.197632(musec)
            zenState = 0x8f35000000004  ==> ALT OFFLINE !AOK !HOK !HONL
            bolt = 0x9a1106d
            errType = 0x1259
            lcsr = 0x0
            lastLcsr = 0x3      ==>  (XMT err) (RCV err) UPNODES ( none )
            rmerr = 0x84000     ==>
            rmport = 0x80078    ==>  !AOK, !HOK, !HONL, TYP=0 NID=8
            statusView = 0x1
            sequence = 0x0
        }
        [3] struct {
            when = 0x1040072420001      ==> Sun 09/29/30 12:00:01.66560(musec)
            zenState = 0x928b000000004  ==> ALT OFFLINE !AOK !HOK !HONL
            bolt = 0x1e1b26d
            errType = 0x1267
            lcsr = 0x0
            lastLcsr = 0x3      ==>  (XMT err) (RCV err) UPNODES ( none )
            rmerr = 0x280000    ==> PCT-RPE
            rmport = 0x280000   ==>  !AOK, !HOK, !HONL, TYP=0 NID=40
            statusView = 0x1
            sequence = 0x0
        }
        [4] struct {
            when = 0x3040072420001      ==> Sun 09/29/30 12:00:01.197632(musec)
            zenState = 0x8903032d9c58d  ==> ALT OFFLINE !AOK HOK HONL
            bolt = 0x1f1b26d
            errType = 0x3bf463f
            lcsr = 0x0
            lastLcsr = 0x3      ==>  (XMT err) (RCV err) UPNODES ( none )
            rmerr = 0x84002     ==>
            rmport = 0x280078   ==>  !AOK, !HOK, !HONL, TYP=0 NID=40
            statusView = 0x1
            sequence = 0x404    ==>         }
        [5] struct {
            when = 0x40042420001        ==> Wed 03/23/05 18:47:13.1024(musec)
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [6] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [7] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [8] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [9] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [10] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [11] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [12] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [13] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [14] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
        [15] struct {
            when = 0x0
            zenState = 0x0
            bolt = 0x0
            errType = 0x0
            lcsr = 0x0
            lastLcsr = 0x0
            rmerr = 0x0
            rmport = 0x0
            statusView = 0x0
            sequence = 0x0
        }
    }
    dmaMapCalls = 0x0
    dmaPagesMapped = 0x0
    dmaUnmapCalls = 0x59
    dmaPagesUnmapped = 0xba
    ulong1 = 0x500000003
    ulong2 = 0x0
    pScatGath = (nil)
    pErrHandler = 0xfffffc001fa16360
    pNotifHandlr = 0xfffffc00004e3730
    consoleConfigHeader = struct {
        vendor_id = 0x7f90
        device_id = 0x4d
        command = 0xfc00
        status = 0xffff
        rev_id = 0x11
        class_code = struct {
            pio_int = 0x10
            sub_class = 0x18
            base = 0x0
        }
        cache_line_size = 0x6
        latency_timer = 0x0
        hdr_type = 0x0
        bist = 0x4
        hdrtype_u = union {
            type0 = struct {
                bar0 = 0xf8000280000b
                bar1 = 0x83c0c0000000
                bar2 = 0x820000000000
                bar3 = 0x820000000000
                bar4 = 0x820000000000
                bar5 = 0x820000000000
                rsvd0 = 0x820000000000
                rsvd1 = 0x0
                exp_rom_bar = 0x0
                cis_ptr = 0x820000000000
                sub_vendor_id = 0x0
                sub_device_id = 0x0
                rsvd3 = 0x0
                intr_line = 0x0
                intr_pin = 0x0
                min_gnt = 0x0
                max_lat = 0x0
                rsvd4 = 0x0
            }
            type1 = struct {
                bar0 = 0xf8000280000b
                bar1 = 0x83c0c0000000
                pri_bus_num = 0x0
                sec_bus_num = 0x0
                sub_bus_num = 0x0
                sec_max_lat = 0x0
                io_base = 0x0
                io_limit = 0x82
                sec_status = 0x0
                mem_base = 0x0
                mem_limit = 0x0
                premem_base = 0x8200
                premem_limit = 0x0
                premem_base_upper = 0x0
                premem_limit_upper = 0x8200
                io_base_upper = 0x0
                io_limit_upper = 0x0
                rsvd5 = {
                    [0] 0x8200
                    [1] 0x0
                    [2] 0x8200
                }
                ppb_exp_rom_bar = 0x0
                rsvd6 = {
                    [0] 0x0
                    [1] 0x820000000000
                    [2] 0x0
                }
                intr_line = 0x0
                intr_pin = 0x0
                bridge_ctl = 0x0
                rsvd7 = 0x0
            }
        }
        private = 0x124
        config_base = 0x0
    }
    OSFConfigHeader = struct {
        vendor_id = 0x3800
        device_id = 0x0
        command = 0x8390
        status = 0x0
        rev_id = 0x11
        class_code = struct {
            pio_int = 0x10
            sub_class = 0x18
            base = 0x0
        }
        cache_line_size = 0x46
        latency_timer = 0x1
        hdr_type = 0x0
        bist = 0x4
        hdrtype_u = union {
            type0 = struct {
                bar0 = 0xf8000280000b
                bar1 = 0x83c0c0000000
                bar2 = 0x820000000000
                bar3 = 0x820000000000
                bar4 = 0x820000000000
                bar5 = 0x820000000000
                rsvd0 = 0x820000000000
                rsvd1 = 0x0
                exp_rom_bar = 0x0
                cis_ptr = 0x820000000000
                sub_vendor_id = 0x0
                sub_device_id = 0x0
                rsvd3 = 0x0
                intr_line = 0x0
                intr_pin = 0x0
                min_gnt = 0x0
                max_lat = 0x0
                rsvd4 = 0x0
            }
            type1 = struct {
                bar0 = 0xf8000280000b
                bar1 = 0x83c0c0000000
                pri_bus_num = 0x0
                sec_bus_num = 0x0
                sub_bus_num = 0x0
                sec_max_lat = 0x0
                io_base = 0x0
                io_limit = 0x82
                sec_status = 0x0
                mem_base = 0x0
                mem_limit = 0x0
                premem_base = 0x8200
                premem_limit = 0x0
                premem_base_upper = 0x0
                premem_limit_upper = 0x8200
                io_base_upper = 0x0
                io_limit_upper = 0x0
                rsvd5 = {
                    [0] 0x8200
                    [1] 0x0
                    [2] 0x8200
                }
                ppb_exp_rom_bar = 0x0
                rsvd6 = {
                    [0] 0x0
                    [1] 0x820000000000
                    [2] 0x0
                }
                intr_line = 0x0
                intr_pin = 0x0
                bridge_ctl = 0x0
                rsvd7 = 0x0
            }
        }
        private = 0x124
        config_base = 0x0
    }
    pCtlr = 0x839000003800
    pDevice = 0xfffffc00005fdb18
    statCmd = 0x0
    plcsr = (nil)
    ptbar = 0xfffffc83c0040000
    prbar = 0xfffffc83c0040004
    pgubar = 0xfffffc83c0040008
    prmerr = 0xfffffc83c0041008 ==> FATAL; PCT-XPE HUB-TO
    prmport = 0xfffffc83c0041000        ==>  VHM, !AOK, HOK, !HONL, TYP=0 NID=4
    ISRid = 0xfffffc83c0041004
    defaultCMD = 0x1fe50030
    pTransmitBase = 0xffff0146l1 address 0xffff0146 not mapped, pte 0x0
 = "l1 address 0xffff0146 not mapped, pte 0x0

can't read from process (address 0xffff0146)

1795.6Any Ideas?TUXEDO::SWEENEYTom Sweeney in LKGTue Feb 25 1997 13:289
Still hoping for some ideas.  I don't know what else to test on our
cluster, and it's down right now.  We'd really like to get it up
so we can test our product on it.

Does the previous reply shed ANY light on our problem?

thanks!

tom
1795.7Any Clue?TUXEDO::SWEENEYTom Sweeney in LKGWed Mar 05 1997 14:148
Hi,

	My last reply was almost two weeks ago, and that
too was a call for help.  Does anyone know what's going 
on in .5 of this thread?  I've given you all the input that
you've asked for, but don't know where else to turn now.

	tom
1795.8out of rev hardware???AFW4::CLEMENCEFri Mar 07 1997 16:3911
I'm not too familier with how to read the thread, but I would do this:

At the console prompt type 'show config' the MC adapters will give a rev
number. The modules should be at rev 'b' or '8b' or greater. If either one is 
not then replace it....


	If this doesn't work then replace the hub box..... (I don't how to
read the rev of that thing)

1795.9Thanks!TUXEDO::SWEENEYTom Sweeney in LKGMon Mar 10 1997 13:1525
>At the console prompt type 'show config' the MC adapters will give a rev
>number. The modules should be at rev 'b' or '8b' or greater. If either one is 
>not then replace it....

	I believe both are at or above rev 'b'.  I checked that at one
point.  If that fails, should I really replace the entire hub?  Or just
the cards?

	I'm a little hesitant to do this as we experienced similar
problems when we had the cluster set up in virtual hub mode.  But to be
truthful, I can't recall if we went the virtual hub mode route
after or before we replaced the Memory channel cards in the machines.

	Ok, looks like my next steps are:

1)	Check the revs on the two adapter cards.
2)	If not in rev, replace them.

3)	If in rev, then turn the cluster into a virtual hub mode.
4)	If virtual hub mode is successful, replace the entire hub?

5)	If virtual hub mode is not successful, punt once again.

	tom
	
1795.10Still No Joy In DCE-villeTUXEDO::SWEENEYTom Sweeney in LKGThu Mar 13 1997 14:3626
I double checked the boards on SKULK and CELL and 
both are at rev b.  For giggles and grins I decided
to switch the cables on the hub, so the cluster node
id was switched between the two machines.  This seemed
to help slightly as the cluster stayed up for 24+ 
hours, but then it crashed.

To eliminate the hub as being the problem, I reset
the jumper pins, wired the machine in a virtual hub
configuration, and defined a tie-breaker disk.  They
passed all of the MC_* tests, and rebooted as a cluster.

Three hours later, we experienced the same problem 
AGAIN....

I don't think it's a hub problem.  The CCMAA cards and
cables have all been replaced.  Are there known problems
clustering between 2100 and 2100A?  

One of the engineers here said he had tried to mount a 
disk just moments before the crash.  Could I be suffering
from some kind of SCSI craziness?

Where to next?

Tom
1795.11Could be the SCSI at that....AFW4::CLEMENCEFri Mar 14 1997 09:3223
>I don't think it's a hub problem.  The CCMAA cards and
>cables have all been replaced.  Are there known problems
>clustering between 2100 and 2100A?  

	I'll agree that it looks like it is not a hub problem. There are no 
known problems with that configuration that I know of...

>One of the engineers here said he had tried to mount a 
>disk just moments before the crash.  Could I be suffering
>from some kind of SCSI craziness?

	The could certainally be true. An improperlly configured SCSI bus
could cause random crashes.

	During the developemnt stages of MC I do recall a problem discovered
with a particullar rev of KZPSA that sent voltages from one system to the
other I.E. the pci bus on the system that was turned off would get some voltage
via the Clustered SCSI cable and confuse the MC adapters....


	I would check the revs of the KZPSAs and confirm your SCSI 
configuration is proper; not too long and terminated properly, ETC....
1795.12SCSI here I go...TUXEDO::SWEENEYTom Sweeney in LKGTue Mar 18 1997 09:426
Thanks for the reply,

	I'll go reverify the SCSI and KZPSA stuff on the two
machines then....

	t
1795.13Still crashing.....TUXEDO::SWEENEYTom Sweeney in LKGThu Mar 20 1997 12:07643
>	I would check the revs of the KZPSAs and confirm your SCSI 
>configuration is proper; not too long and terminated properly, ETC....


	That all appears to be ok.  The rev on the boards is A10,
and according to the book that I have from the "TruCluster 
Software and Configuration Management" class that I took, every
thing is within spec for termination and length of cables.


	Just another random thought.  The 2100a is a dual CPU
machine.  There shouldn't be the problem right?

	Latest Crash data below.  Mean time between crashes,
12 hours.

	tom

#
# Crash Data Collection (Version 1.4)
#
_crash_data_collection_time: Wed Mar 19 18:54:46 EST 1997
_current_directory: /
_crash_kernel: /var/adm/crash/vmunix.30
_crash_core: /var/adm/crash/vmcore.30
_crash_arch: alpha
_crash_os: Digital UNIX
Digital UNIX
_host_version: Digital UNIX TruCluster V1.0 (Rev. 432); 03/12/96 18:37 
Digital UNIX V3.2G (Rev. 62); Mon Jan 27 12:09:15 EST 1997 
_crash_version: Digital UNIX TruCluster V1.0 (Rev. 432); 03/12/96 18:37 
Digital UNIX V3.2G (Rev. 62); Mon Jan 27 12:09:15 EST 1997 

_crashtime:  struct {
    tv_sec = 858815431
    tv_usec = 605120
} 
_boottime:  struct {
    tv_sec = 858775450
    tv_usec = 940864
} 
_config:  struct {
    sysname = "OSF1"
    nodename = "skulk.lkg.dec.com"
    release = "V3.2"
    version = "62"
    machine = "alpha"
} 
_cpu:  41 
_system_string:  0xffffffffff8010b8 = "AlphaServer 2100A 5/300" 
_ncpus:  1 
_avail_cpus:  1 
_partial_dump:  1 
_physmem(MBytes):  511 
_panic_string:  0xfffffc00005e3f30 = "rmerror_int: fatal error and no alternate mc to failover\n" 
_paniccpu:  0 
_panic_thread:  0xfffffc000a219b80 
_preserved_message_buffer_begin: 
struct {
    msg_magic = 0x63061
    msg_bufx = 0xc72
    msg_bufr = 0xb66
    msg_bufc = "PCXAL keyboard, language English (American)

Alpha boot: available memory from 0x1022000 to 0x1fffe000
Digital UNIX V3.2G (Rev. 62); Mon Jan 27 12:09:15 EST 1997 
physical memory = 512.00 megabytes.
available memory = 495.85 megabytes.
using 1958 buffers containing 15.29 megabytes of memory
Firmware revision: 4.5
PALcode: OSF version 1.21
ibus0 at nexus
AlphaServer 2100A 5/300
cpu 0 EV-5 4mb b-cache
gpc0 at ibus0
pci0 at ibus0 slot 0
eisa0 at pci0
ace0 at eisa0
ace1 at eisa0
lp0 at eisa0
fdi0 at eisa0
fd0 at fdi0 unit 0
pci2000 at pci0 slot 3
psiop0 at pci2000 slot 1
Loading SIOP: script 1007a00, reg 81810000, data 406719a0
scsi0 at psiop0 slot 0
rz0 at scsi0 bus 0 target 0 lun 0 (DEC     RZ28D    (C) DEC 0008)
rz1 at scsi0 bus 0 target 1 lun 0 (DEC     RZ28D    (C) DEC 0008)
rz2 at scsi0 bus 0 target 2 lun 0 (DEC     RZ28D    (C) DEC 0008)
rz3 at scsi0 bus 0 target 3 lun 0 (DEC     RZ28D    (C) DEC 0010)
rz6 at scsi0 bus 0 target 6 lun 0 (DEC     RRD45   (C) DEC  0436)
tu0: DECchip 21040-AA: Revision: 2.4
tu0 at pci2000 slot 6
tu0: DEC TULIP Ethernet Interface, hardware address: 00-00-F8-22-E0-70
tu0: console mode: selecting 10Base5 (AUI) port
tu1: DECchip 21140-AA: Revision: 1.2
tu1 at pci2000 slot 7
tu1: DEC Fast Ethernet Interface, hardware address: 00-00-F8-01-80-5D
tu1: console mode: selecting 100BaseTX (UTP) port: half duplex: no link
vga0 at pci2000 slot 8
 1024x768 (S3TRIO  )
MC: probing Rev 11 unit 0 Board #1
	Adapter is in pci bus 0, slot 7
MC: Rev 11 unit 0 board #1 is primary adapter
Memory Channel jumpered as STD (connect to real hub) mode
rmspur0 at pci0 slot 7
pza0 at pci0 slot 8
pza0 firmware version: DEC  P01  A10   
scsi1 at pza0 slot 0
rz9 at scsi1 bus 1 target 1 lun 0 (DEC     RZ28M    (C) DEC 0616)
rz10 at scsi1 bus 1 target 2 lun 0 (DEC     RZ28M    (C) DEC 0616)
rz11 at scsi1 bus 1 target 3 lun 0 (DEC     RZ28M    (C) DEC 0568)
rz14 at scsi1 bus 1 target 6 lun 0 (DEC     RZ28M    (C) DEC 0616)
dli: configured
SuperLAT. Copyright 1993 Meridian Technology Corp. All rights reserved.
clubase: configured
Cluster Memory Channel primary adapter is online.
	Rev 11 adapter is the primary channel (pci bus 0, slot 7)
	connected to a real hub (STD) as node 0.
drd: configured.
dlmsl: configured
cnxagent: configured
dlm: configured.
memory channel thread init
checking for existing memory channel nodes
booting as primary memory channel node on mc0
memory channel software inited - node 0 on mc0
ccomsub: configured
mcnet: configured
cnxlock: acquired director lock: entering CNX_RUN state
rmerror_state_change: unit = 0  Err_reg = 0x1802 node = 2
memory channel status request from node 2
memory channel request from node 2
memory channel update request from node 2
memory channel - adding node 2
AM contacted from host at bus 1 target 5 lun 7
cnxagent: added node mccell
cnxagent: mcskulk is now a cluster member
dlm_agent: resuming lock activity
cnxagent: resuming
rmerror_int: Error_count = 1 unit = 0 Err_reg = 0xffffffffa0000000 Node = 0
panic (cpu 0): rmerror_int: fatal error and no alternate mc to failover

syncing disks... DUMP.prom: dev SCSI 0 2001 0 0 0 0 0, block 131072
DUMP.prom: dev SCSI 0 2001 0 0 0 0 0, block 131072
"
} 
_preserved_message_buffer_end: 
_kernel_process_status_begin: 
  PID	COMM
00000	kernel idle
00001	init
00003	kloadsrv
00019	update
03110	cfgmgr
01114	dxconsole
01153	csh
01154	rlogind
01165	csh
00150	syslogd
00152	binlogd
03283	rlogin
03284	rlogin
00219	portmap
00264	bssd
00266	cfgmgr
00267	rwhod
00268	cnxmond
00270	cnxpingd
00271	cnxagentd
00273	cnxmgrd
00355	aselogger
00365	aseagent
02508	rpc.lockd
02519	rpc.statd
00499	mountd
00506	nfsd
00508	nfsiod
00510	nfsiod
00511	nfsiod
00512	nfsiod
00513	nfsiod
00514	nfsiod
00515	nfsiod
00518	rpc.statd
00525	rpc.lockd
00528	asehsm
00582	sendmail
00637	snmpd
00638	os_mibs
00671	inetd
00676	cron
00772	lpd
00832	xdm
00879	Xdec
00881	getty
00882	asedirector
00973	xdm
03049	tractd
03068	submon
_kernel_process_status_end: 
_current_pid:  267 
_current_tid:  0xfffffc000a219b80 
_proc_thread_list_begin: 
thread 0xfffffc000a219b80 stopped at  [boot:1760 ,0xfffffc00003eb91c]	 Source not available
_proc_thread_list_end: 
_dump_begin: 
>  0 boot(0x0, 0x4, 0xfffffc0000334340, 0xfffffc0000200100, 0xfffffc00003a8eec)
["../../../../src/kernel/arch/alpha/machdep.c":1760, 0xfffffc00003eb91c]

   1 panic(s = 0xfffffc00005ad4e0 = "thread_block: interrupt level call")
["../../../../src/kernel/bsd/subr_prf.c":673, 0xfffffc00003aba38]
pcpu = 0x5
i = 3845864
bootopt = 7
mycpu = 6580472
spl = 5
prevcc = 18446739675666885876
nextcc = 0
timer = -4397563142416
limit = 168

   2 thread_block() ["../../../../src/kernel/kern/sched_prim.c":1768, 0xfffffc00003ddef8]
thread = 0xfffffc000a219b80
new_thread = 0xfffffc000a219b80
mycpu = 0
myprocessor = (nil)
s = 5
pset = 0xfffffc00003e0884
prev = 0xfffffc00005e3f30

   3 thread_preempt(thread = 0xfffffc000a219b80, processor = 0xfffffc0000200100)
["../../../../src/kernel/kern/sched_prim.c":3515, 0xfffffc00003e089c]
s = 2
pri = 4110288
pset = 0xfffffc000064cf20

   4 boot(0x0, 0x0, 0xfffffc00005e3f30, 0xfffffc0000724000, 0xb66)
["../../../../src/kernel/arch/alpha/machdep.c":1704, 0xfffffc00003eb7f4]

   5 panic(s = 0xfffffc00005e3f30 = "rmerror_int: fatal error and no alternate mc to failover\n")
["../../../../src/kernel/bsd/subr_prf.c":757, 0xfffffc00003abbf4]
pcpu = 0xffffffffa0252000
i = 3
bootopt = 6
mycpu = 0
spl = 3
prevcc = 18446739675667269852
nextcc = 18446739675667289380
timer = 32
limit = -4398042262236

   6 rm_reboot(0x0, 0x1, 0x0, 0xffffffffa0000000, 0x0) ["../../../../src/kernel/rm/rm_kern.c":5144,
0xfffffc00004e8300]

   7 rmerror_int(0xfffffc001fa06000, 0xffffffffa0000000, 0x0, 0x400, 0xfffffc0008000000)
["../../../../src/kernel/rm/rm_error.c":783, 0xfffffc00004ed830]

   8 rmspurISR(0xfffffc001fa06000, 0xfffffffffffffe51, 0x0, 0xfffffc00003ee8f4, 0xfffffc00003eb5e0)
["../../../../src/kernel/io/dec/pci/rm_spur.c":3349, 0xfffffc00004cee78]

   9 intr_dispatch_post(0xfffffc001fe42fc0, 0xfffffc0000619910, 0x1, 0xfffffc00003eb5e0, 0x7f)
["../../../../src/kernel/arch/alpha/hal/shared_intr.c":238, 0xfffffc000040eda8]

  10 _XentInt(0x0, 0xfffffc00003f8d44, 0xfffffc0000619910, 0x0, 0xffffffff802f6000)
["../../../../src/kernel/arch/alpha/locore.s":934, 0xfffffc00003e86bc]

  11 swap_ipl(0x0, 0xfffffc00003f8d44, 0xfffffc0000619910, 0x0, 0xffffffff802f6000)
["../../../../src/kernel/arch/alpha/spl.s":131, 0xfffffc00003f8d40]

  12 outputWire(tap = 0xfffffc00007114a0, flags = 17, callID = -4397533396240, userIOVecCount = 0, ioBufs = (nil),
m = 0xfffffc0002c50b00, frame = (nil), lenMsg = 0, type = 0, freeFunc = 0xfffffc0000540260, msgNum =
0xffffffffa085b020, callLevel = 4) ["../../../../src/kernel/rm/ccomsub.c":8706, 0xfffffc0000519424]
len = 128
s = 0
element = 0xffffffff802f6000
element0 = 0xfffffc0000473510
looper = 0
lenFrameChain = 1
frameNum = 1
currErrorNum = 1
firstErrorNum = 1
secondErrorNum = -2147483647
noData = 0
isMsg = 1
queuePtr = 0xfffffc0000711648
num_sent = 36401
hp = 0xffffffffa04fa000
lh = struct {
    seqNum = 36401
    replyNum = 18446744073709551615
    frameNum = 1
    type = 0
    synched = 0
    checksum = 0
    len = 128
    numVectors = 1
}

  13 sendWire(tap = 0xfffffc00007114a0, flags = 17, callID = -4397533396240, countVecs = 0, sendBufs = (nil),
mbufs = 0xfffffc0002c50b00, frame = (nil), lenMsg = 0, type = 0, reconstructMsg = (nil), freeFunc =
0xfffffc0000540260, order = 0, numReplies = 0, provideReplyFrames = 0, numReplyFramesPerReply = 0, replyFrames =
(nil), msgNum = 0xffffffffa085b020) ["../../../../src/kernel/rm/ccomsub.c":8199, 0xfffffc00005182a0]
tempMsgNum = -1
tempResult = 3882260
result = 0

  14 mcnet_output_internal(0xfffffc0002c50b00, 0xfffffc001e9582f0, 0xfffffc001e9582f0, 0x1, 0xfffffc0002c51000)
["../../../../src/kernel/io/dec/netif/if_rm.c":979, 0xfffffc0000540108]

  15 mcnet_output(0xfffffc0000646ad8, 0xfffffc001d66e340, 0xfffffc0002c50b00, 0xfffffc0002c50b8c,
0xfffffc000071cf38) ["../../../../src/kernel/io/dec/netif/if_rm.c":883, 0xfffffc000053ff50]

  16 ether_output(0xfffffc000071cf38, 0xfffffc0002c50b00, 0xfffffc001d66e340, 0xfffffc001dd1ea00,
0xfffffc00002e8988) ["../../../../src/kernel/net/if_ethersubr.c":963, 0xfffffc00003e1918]

  17 ip_output(0xfffffc0002c50b00, 0x0, 0xfffffc001d66e338, 0x20, 0x0)
["../../../../src/kernel/netinet/ip_output.c":524, 0xfffffc00002f8424]

  18 udp_output(inp = 0xfffffc00074d0278, m = 0xfffffc000a219b80, addr = 0xfffffc0002c51c00, control =
0xfffffc0000000000) ["../../../../src/kernel/netinet/udp_usrreq.c":989, 0xfffffc00003020f8]
ui = 0xfffffc0002c50b8c
len = 84
laddr = struct {
    s_addr = 0
}
error = 0

  19 udp_usrreq(so = 0xfffffc0002c50b00, req = 168, m = 0xfffffc0002c50b00, addr = 0x23ac, control = (nil))
["../../../../src/kernel/netinet/udp_usrreq.c":1100, 0xfffffc0000302368]
inp = 0xfffffc001d66e300
error = 2450924

  20 sosend(0xfffffc001f97fb00, 0xfffffc0002c51c00, 0xffffffffa085b6a0, 0xfffffc0002c50b00, 0x0)
["../../../../src/kernel/bsd/uipc_socket.c":1076, 0xfffffc00002566dc]

  21 sendit(0x1d4, 0xffffffffa085b728, 0x0, 0xffffffffa085b8b8, 0x2f1000800000)
["../../../../src/kernel/bsd/uipc_syscalls.c":785, 0xfffffc000025ab50]

  22 sendto(0xfffffc000a219210, 0xffffffffa085b8c8, 0xfffffc00003f97a8, 0x1, 0x0)
["../../../../src/kernel/bsd/uipc_syscalls.c":624, 0xfffffc000025a740]

  23 syscall(0x3, 0x140000508, 0x11ffffa14, 0xffffffffa0858000, 0x85)
["../../../../src/kernel/arch/alpha/syscall_trap.c":519, 0xfffffc00003f9214]

  24 _Xsyscall(0x8, 0x3ff800f1168, 0x140008550, 0x5, 0x1400009e0)
["../../../../src/kernel/arch/alpha/locore.s":1094, 0xfffffc00003e8854]

_dump_end: 

warning: Files compiled -g3: parameter values probably wrong
_kernel_thread_list_begin: 
thread 0xfffffc001fe0e000 stopped at   [thread_run:2302 ,0xfffffc00003dea14]	 Source not available
thread 0xfffffc001fe0e400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001fe40800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001fe40c00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001fe41000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001fe41400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001fe41800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001fe41c00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001e94a000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001e94a400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001e94ac00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001e94b000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001e94b400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001e94b800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001e94bc00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001f95e000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001f95e400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001f95e800 stopped at   [thread_block:1934 ,0xfffffc00003de228]	 Source not available
thread 0xfffffc001f95ec00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001f95f000 stopped at   [thread_block:1934 ,0xfffffc00003de228]	 Source not available
thread 0xfffffc001f95f800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb8a000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb8a400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb8a800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb8ac00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb8b000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb8b400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb8b800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb8bc00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb7e000 stopped at   [thread_block:1934 ,0xfffffc00003de228]	 Source not available
thread 0xfffffc001cb7e800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb7ec00 stopped at   [thread_block:1934 ,0xfffffc00003de228]	 Source not available
thread 0xfffffc001cb7f000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb7f400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb7f800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc001cb7fc00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc00061be000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc00061be400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc00061be800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc00061bec00 stopped at   [thread_block:1934 ,0xfffffc00003de228]	 Source not available
thread 0xfffffc00061bf400 stopped at   [thread_block:1934 ,0xfffffc00003de228]	 Source not available
thread 0xfffffc00061bf800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc00061bf000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc00061bfc00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc0004ed2000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc0004ed2400 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc0004ed2800 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc0004ed2c00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc0004ed3000 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
thread 0xfffffc0005e43c00 stopped at   [thread_block:1919 +0x28,0xfffffc00003de1b8]	 Source not available
_kernel_thread_list_end: 
_savedefp:  (nil) 
_kernel_memory_fault_data_begin:  
struct {
    fault_va = 0x0
    fault_pc = 0x0
    fault_ra = 0x0
    fault_sp = 0x0
    access = 0x0
    status = 0x0
    cpunum = 0x0
    count = 0x0
    pcb = (nil)
    thread = (nil)
    task = (nil)
    proc = (nil)
} 
_kernel_memory_fault_data_end:  
Invalid character in input
_uptime: 11.10 hours

paniccpu: 0x0 
machine_slot[paniccpu]: struct {
    is_cpu = 0x1
    cpu_type = 0xf
    cpu_subtype = 0x18
    running = 0x1
    cpu_ticks = {
        [0] 0x8bfc
        [1] 0x0
        [2] 0x14d72
        [3] 0x26d4487
        [4] 0x1b877
    }
    clock_freq = 0x400
    error_restart = 0x0
    cpu_panicstr = 0xfffffc00005e3f30 = "rmerror_int: fatal error and no alternate mc to failover\n"
    cpu_panic_thread = 0xfffffc000a219b80
} 
tset machine_slot[paniccpu].cpu_panic_thread: 
Begin Trace for machine_slot[paniccpu].cpu_panic_thread: 
>  0 boot(0x0, 0x4, 0xfffffc0000334340, 0xfffffc0000200100, 0xfffffc00003a8eec)
["../../../../src/kernel/arch/alpha/machdep.c":1760, 0xfffffc00003eb91c]
   1 panic(s = 0xfffffc00005ad4e0 = "thread_block: interrupt level call")
["../../../../src/kernel/bsd/subr_prf.c":673, 0xfffffc00003aba38]
   2 thread_block() ["../../../../src/kernel/kern/sched_prim.c":1768, 0xfffffc00003ddef8]
   3 thread_preempt(thread = 0xfffffc000a219b80, processor = 0xfffffc0000200100)
["../../../../src/kernel/kern/sched_prim.c":3515, 0xfffffc00003e089c]
   4 boot(0x0, 0x0, 0xfffffc00005e3f30, 0xfffffc0000724000, 0xb66)
["../../../../src/kernel/arch/alpha/machdep.c":1704, 0xfffffc00003eb7f4]
   5 panic(s = 0xfffffc00005e3f30 = "rmerror_int: fatal error and no alternate mc to failover\n")
["../../../../src/kernel/bsd/subr_prf.c":757, 0xfffffc00003abbf4]
   6 rm_reboot(0x0, 0x1, 0x0, 0xffffffffa0000000, 0x0) ["../../../../src/kernel/rm/rm_kern.c":5144,
0xfffffc00004e8300]
   7 rmerror_int(0xfffffc001fa06000, 0xffffffffa0000000, 0x0, 0x400, 0xfffffc0008000000)
["../../../../src/kernel/rm/rm_error.c":783, 0xfffffc00004ed830]
   8 rmspurISR(0xfffffc001fa06000, 0xfffffffffffffe51, 0x0, 0xfffffc00003ee8f4, 0xfffffc00003eb5e0)
["../../../../src/kernel/io/dec/pci/rm_spur.c":3349, 0xfffffc00004cee78]
   9 intr_dispatch_post(0xfffffc001fe42fc0, 0xfffffc0000619910, 0x1, 0xfffffc00003eb5e0, 0x7f)
["../../../../src/kernel/arch/alpha/hal/shared_intr.c":238, 0xfffffc000040eda8]
  10 _XentInt(0x0, 0xfffffc00003f8d44, 0xfffffc0000619910, 0x0, 0xffffffff802f6000)
["../../../../src/kernel/arch/alpha/locore.s":934, 0xfffffc00003e86bc]
  11 swap_ipl(0x0, 0xfffffc00003f8d44, 0xfffffc0000619910, 0x0, 0xffffffff802f6000)
["../../../../src/kernel/arch/alpha/spl.s":131, 0xfffffc00003f8d40]
  12 outputWire(tap = 0xfffffc00007114a0, flags = 0x11, callID = 0xfffffc001e9582f0, userIOVecCount = 0x0, ioBufs
= (nil), m = 0xfffffc0002c50b00, frame = (nil), lenMsg = 0x0, type = 0x0, freeFunc = 0xfffffc0000540260, msgNum =
0xffffffffa085b020, callLevel = 0x4) ["../../../../src/kernel/rm/ccomsub.c":8706, 0xfffffc0000519424]
  13 sendWire(tap = 0xfffffc00007114a0, flags = 0x11, callID = 0xfffffc001e9582f0, countVecs = 0x0, sendBufs =
(nil), mbufs = 0xfffffc0002c50b00, frame = (nil), lenMsg = 0x0, type = 0x0, reconstructMsg = (nil), freeFunc =
0xfffffc0000540260, order = 0x0, numReplies = 0x0, provideReplyFrames = 0x0, numReplyFramesPerReply = 0x0,
replyFrames = (nil), msgNum = 0xffffffffa085b020) ["../../../../src/kernel/rm/ccomsub.c":8199, 0xfffffc00005182a0]
  14 mcnet_output_internal(0xfffffc0002c50b00, 0xfffffc001e9582f0, 0xfffffc001e9582f0, 0x1, 0xfffffc0002c51000)
["../../../../src/kernel/io/dec/netif/if_rm.c":979, 0xfffffc0000540108]
  15 mcnet_output(0xfffffc0000646ad8, 0xfffffc001d66e340, 0xfffffc0002c50b00, 0xfffffc0002c50b8c,
0xfffffc000071cf38) ["../../../../src/kernel/io/dec/netif/if_rm.c":883, 0xfffffc000053ff50]
  16 ether_output(0xfffffc000071cf38, 0xfffffc0002c50b00, 0xfffffc001d66e340, 0xfffffc001dd1ea00,
0xfffffc00002e8988) ["../../../../src/kernel/net/if_ethersubr.c":963, 0xfffffc00003e1918]
  17 ip_output(0xfffffc0002c50b00, 0x0, 0xfffffc001d66e338, 0x20, 0x0)
["../../../../src/kernel/netinet/ip_output.c":524, 0xfffffc00002f8424]
  18 udp_output(inp = 0xfffffc00074d0278, m = 0xfffffc000a219b80, addr = 0xfffffc0002c51c00, control =
0xfffffc0000000000) ["../../../../src/kernel/netinet/udp_usrreq.c":989, 0xfffffc00003020f8]
  19 udp_usrreq(so = 0xfffffc0002c50b00, req = 0xa8, m = 0xfffffc0002c50b00, addr = 0x23ac, control = (nil))
["../../../../src/kernel/netinet/udp_usrreq.c":1100, 0xfffffc0000302368]
  20 sosend(0xfffffc001f97fb00, 0xfffffc0002c51c00, 0xffffffffa085b6a0, 0xfffffc0002c50b00, 0x0)
["../../../../src/kernel/bsd/uipc_socket.c":1076, 0xfffffc00002566dc]
  21 sendit(0x1d4, 0xffffffffa085b728, 0x0, 0xffffffffa085b8b8, 0x2f1000800000)
["../../../../src/kernel/bsd/uipc_syscalls.c":785, 0xfffffc000025ab50]
  22 sendto(0xfffffc000a219210, 0xffffffffa085b8c8, 0xfffffc00003f97a8, 0x1, 0x0)
["../../../../src/kernel/bsd/uipc_syscalls.c":624, 0xfffffc000025a740]
  23 syscall(0x3, 0x140000508, 0x11ffffa14, 0xffffffffa0858000, 0x85)
["../../../../src/kernel/arch/alpha/syscall_trap.c":519, 0xfffffc00003f9214]
  24 _Xsyscall(0x8, 0x3ff800f1168, 0x140008550, 0x5, 0x1400009e0)
["../../../../src/kernel/arch/alpha/locore.s":1094, 0xfffffc00003e8854]
End Trace for machine_slot[paniccpu].cpu_panic_thread: 

"cpu_data" is not an array
_stack_trace[0]_begin: 
>  0 boot(0x0, 0x4, 0xfffffc0000334340, 0xfffffc0000200100, 0xfffffc00003a8eec)
["../../../../src/kernel/arch/alpha/machdep.c":1760, 0xfffffc00003eb91c]
   1 panic(s = 0xfffffc00005ad4e0 = "thread_block: interrupt level call")
["../../../../src/kernel/bsd/subr_prf.c":673, 0xfffffc00003aba38]
   2 thread_block() ["../../../../src/kernel/kern/sched_prim.c":1768, 0xfffffc00003ddef8]
   3 thread_preempt(thread = 0xfffffc000a219b80, processor = 0xfffffc0000200100)
["../../../../src/kernel/kern/sched_prim.c":3515, 0xfffffc00003e089c]
   4 boot(0x0, 0x0, 0xfffffc00005e3f30, 0xfffffc0000724000, 0xb66)
["../../../../src/kernel/arch/alpha/machdep.c":1704, 0xfffffc00003eb7f4]
   5 panic(s = 0xfffffc00005e3f30 = "rmerror_int: fatal error and no alternate mc to failover\n")
["../../../../src/kernel/bsd/subr_prf.c":757, 0xfffffc00003abbf4]
   6 rm_reboot(0x0, 0x1, 0x0, 0xffffffffa0000000, 0x0) ["../../../../src/kernel/rm/rm_kern.c":5144,
0xfffffc00004e8300]
   7 rmerror_int(0xfffffc001fa06000, 0xffffffffa0000000, 0x0, 0x400, 0xfffffc0008000000)
["../../../../src/kernel/rm/rm_error.c":783, 0xfffffc00004ed830]
   8 rmspurISR(0xfffffc001fa06000, 0xfffffffffffffe51, 0x0, 0xfffffc00003ee8f4, 0xfffffc00003eb5e0)
["../../../../src/kernel/io/dec/pci/rm_spur.c":3349, 0xfffffc00004cee78]
   9 intr_dispatch_post(0xfffffc001fe42fc0, 0xfffffc0000619910, 0x1, 0xfffffc00003eb5e0, 0x7f)
["../../../../src/kernel/arch/alpha/hal/shared_intr.c":238, 0xfffffc000040eda8]
  10 _XentInt(0x0, 0xfffffc00003f8d44, 0xfffffc0000619910, 0x0, 0xffffffff802f6000)
["../../../../src/kernel/arch/alpha/locore.s":934, 0xfffffc00003e86bc]
  11 swap_ipl(0x0, 0xfffffc00003f8d44, 0xfffffc0000619910, 0x0, 0xffffffff802f6000)
["../../../../src/kernel/arch/alpha/spl.s":131, 0xfffffc00003f8d40]
  12 outputWire(tap = 0xfffffc00007114a0, flags = 17, callID = -4397533396240, userIOVecCount = 0, ioBufs = (nil),
m = 0xfffffc0002c50b00, frame = (nil), lenMsg = 0, type = 0, freeFunc = 0xfffffc0000540260, msgNum =
0xffffffffa085b020, callLevel = 4) ["../../../../src/kernel/rm/ccomsub.c":8706, 0xfffffc0000519424]
  13 sendWire(tap = 0xfffffc00007114a0, flags = 17, callID = -4397533396240, countVecs = 0, sendBufs = (nil),
mbufs = 0xfffffc0002c50b00, frame = (nil), lenMsg = 0, type = 0, reconstructMsg = (nil), freeFunc =
0xfffffc0000540260, order = 0, numReplies = 0, provideReplyFrames = 0, numReplyFramesPerReply = 0, replyFrames =
(nil), msgNum = 0xffffffffa085b020) ["../../../../src/kernel/rm/ccomsub.c":8199, 0xfffffc00005182a0]
  14 mcnet_output_internal(0xfffffc0002c50b00, 0xfffffc001e9582f0, 0xfffffc001e9582f0, 0x1, 0xfffffc0002c51000)
["../../../../src/kernel/io/dec/netif/if_rm.c":979, 0xfffffc0000540108]
  15 mcnet_output(0xfffffc0000646ad8, 0xfffffc001d66e340, 0xfffffc0002c50b00, 0xfffffc0002c50b8c,
0xfffffc000071cf38) ["../../../../src/kernel/io/dec/netif/if_rm.c":883, 0xfffffc000053ff50]
  16 ether_output(0xfffffc000071cf38, 0xfffffc0002c50b00, 0xfffffc001d66e340, 0xfffffc001dd1ea00,
0xfffffc00002e8988) ["../../../../src/kernel/net/if_ethersubr.c":963, 0xfffffc00003e1918]
  17 ip_output(0xfffffc0002c50b00, 0x0, 0xfffffc001d66e338, 0x20, 0x0)
["../../../../src/kernel/netinet/ip_output.c":524, 0xfffffc00002f8424]
  18 udp_output(inp = 0xfffffc00074d0278, m = 0xfffffc000a219b80, addr = 0xfffffc0002c51c00, control =
0xfffffc0000000000) ["../../../../src/kernel/netinet/udp_usrreq.c":989, 0xfffffc00003020f8]
  19 udp_usrreq(so = 0xfffffc0002c50b00, req = 168, m = 0xfffffc0002c50b00, addr = 0x23ac, control = (nil))
["../../../../src/kernel/netinet/udp_usrreq.c":1100, 0xfffffc0000302368]
  20 sosend(0xfffffc001f97fb00, 0xfffffc0002c51c00, 0xffffffffa085b6a0, 0xfffffc0002c50b00, 0x0)
["../../../../src/kernel/bsd/uipc_socket.c":1076, 0xfffffc00002566dc]
  21 sendit(0x1d4, 0xffffffffa085b728, 0x0, 0xffffffffa085b8b8, 0x2f1000800000)
["../../../../src/kernel/bsd/uipc_syscalls.c":785, 0xfffffc000025ab50]
  22 sendto(0xfffffc000a219210, 0xffffffffa085b8c8, 0xfffffc00003f97a8, 0x1, 0x0)
["../../../../src/kernel/bsd/uipc_syscalls.c":624, 0xfffffc000025a740]
  23 syscall(0x3, 0x140000508, 0x11ffffa14, 0xffffffffa0858000, 0x85)
["../../../../src/kernel/arch/alpha/syscall_trap.c":519, 0xfffffc00003f9214]
  24 _Xsyscall(0x8, 0x3ff800f1168, 0x140008550, 0x5, 0x1400009e0)
["../../../../src/kernel/arch/alpha/locore.s":1094, 0xfffffc00003e8854]
_stack_trace[0]_end: 

_kdbx_sum_start:
Hostname : skulk.lkg.dec.com
cpu: AlphaServer 2100A 5/300	avail: 1
Boot-time:	Wed Mar 19 07:44:10 1997
Time:	Wed Mar 19 18:50:31 1997
Kernel : OSF1 release V3.2 version 62 (alpha)
_kdbx_sum_end:
_kdbx_swap_start:

       Swap device name              Size       In Use       Free
--------------------------------  ----------  ----------  ----------
/dev/rz0b                            200704k       7776k     192928k
                                      25088p        972p      24116p

/dev/rz2b                            200704k       6808k     193896k
                                      25088p        851p      24237p
--------------------------------  ----------  ----------  ----------
Total swap partitions:    2          401408k      14584k     386824k
                                      50176p       1823p      48353p
_kdbx_swap_end:
_kdbx_proc_start:
Addr        PID   PPID  PGRP  UID   NICE SIGCATCH P_SIG    Event       Flags
=========== ===== ===== ===== ===== ==== ======== ======== =========== ============
k0x1fe07210     0     0     0     0    0 00000000 00000000        NULL in sys
k0x1cb8d210     1     0     1     0    0 307a7eff 00000000        NULL in pagv exec
k0x1cb87210     3     1     2     0    0 00004006 00000000        NULL in pagv exec
k0x1cb82210    19     1    19     0    0 00002000 00000000        NULL in pagv
k0x04de9210  3110  3068  3068     0    0 00000000 00000000        NULL in pagv exec
k0x07283210  1114     1  1040     0    0 00000000 00000000        NULL in pagv
k0x07282210  1153  1154  1153  1331    0 01882003 00000000        NULL in pagv ctty exec
k0x0b9b4210  1154   671  1154     0    0 00084027 00000000        NULL in pagv exec
k0x062f7210  1165  1153  1165     0    0 00082002 00000000        NULL in pagv ctty exec
k0x1ccf9210   150     1   150     0    0 00086001 00000000        NULL in pagv
k0x07417210   152     1   152     0    0 00004001 00000000        NULL in pagv
k0x052fd210  3283  1165  3283     0    0 28089005 00000000        NULL in pagv ctty exec
k0x07256210  3284  3283  3283     0    0 20009005 00000000        NULL in pagv ctty
k0x1cb6b210   219     1   219     0    0 00080628 00000000        NULL in pagv
k0x1d52f210   264     1   264     0    0 00000000 00000000        NULL in pagv
k0x1cb6a210   266   264   264     0    0 00000000 00000000        NULL in pagv exec
k0x0a219210   267     1   267     0    0 00002001 00000000        NULL in pagv
k0x07201210   268     1   268     0  -24 70086000 00000000        NULL in pagv
k0x0a218210   270     1   270     0  -24 30086000 00000000        NULL in pagv
k0x07257210   271   270   270     0  -24 10004000 00000000        NULL in pagv exec
k0x1d52e210   273   268   268     0  -24 70004000 00000000        NULL in pagv exec
k0x069f9210   355     1   355     0   -5 00004000 00000000        NULL in pagv
k0x07200210   365     1   365     0   -5 00084603 00000000        NULL in pagv
k0x1eaa7210  2508     1  2508     0  -15 00002000 00000000        NULL in pagv
k0x1eaa6210  2519     1     0     0  -15 00002000 00000000        NULL in pagv
k0x069f1210   499     1   499     0    0 66006001 00000000        NULL in pagv
k0x059d9210   506     1   506     0    0 00000000 00000000        NULL in pagv
k0x04de8210   508     1   508     0    0 00000000 00000000        NULL in pagv
k0x059d8210   510   508   508     0    0 00000000 00000000        NULL in pagv
k0x05d7e210   511   508   508     0    0 00000000 00000000        NULL in pagv
k0x05d7f210   512   508   508     0    0 00000000 00000000        NULL in pagv
k0x0556f210   513   508   508     0    0 00000000 00000000        NULL in pagv
k0x0556e210   514   508   508     0    0 00000000 00000000        NULL in pagv
k0x06e74210   515   508   508     0    0 00000000 00000000        NULL in pagv
k0x06e75210   518     1     0     0    0 00002000 00000000        NULL in pagv ctty
k0x069f0210   525     1   525     0    0 00002000 00000000        NULL in pagv
k0x05a44210   528     1   528     0  -24 20004000 00000000        NULL in pagv
k0x05a45210   582     1     0     0    0 00086000 00000000        NULL in pagv
k0x062f6210   637     1   637     0    0 20004002 00000000        NULL in pagv
k0x069f8210   638     1   638     0    0 00004002 00000000        NULL in pagv
k0x052fc210   671     1   671     0    0 00086001 00000000        NULL in pagv
k0x1d306210   676     1   676     0    0 00002000 00000000        NULL in pagv
k0x059cd210   772     1   772     0    0 00084007 00000000        NULL in pagv
k0x1d307210   832     1     0     0    0 20084003 00000000        NULL in pagv
k0x059cc210   879   832   879     0   -2 00004003 00000000        NULL in pagv exec
k0x1cb86210   881     1   881     0    0 00000000 00000000        NULL in pagv ctty exec
k0x1ccf8210   882     1   882     0   -5 00004000 00000000        NULL in pagv
k0x1cb83210   973   832   973     0    0 20000002 00000000        NULL in pagv
k0x0b9b5210  3049     1  3049     0    0 00084001 00000000        NULL in pagv
k0x095c5210  3068     1  3068     0    0 00004003 00000000        NULL in pagv
_kdbx_proc_end:

Audit subsystem disabled

No audit data to be saved
#
_crash_data_collection_finished:

% ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ======
% Received: from skulk.lkg.dec.com by us1rmc.bb.dec.com (5.65/rmc-22feb94) id AA22726; Thu, 20 Mar 97 11:46:51
-0500
% Received: by skulk.lkg.dec.com; id AA02259; Thu, 20 Mar 1997 11:49:10 -0500
% Date: Thu, 20 Mar 1997 11:49:10 -0500
% From: system PRIVILEGED account <[email protected]>
% Message-Id: <[email protected]>
% To: tuxedo::sweeney
% Subject: crash-data
1795.14HW revs not SW revsAFW4::CLEMENCETue Apr 08 1997 09:5719
>>	I would check the revs of the KZPSAs and confirm your SCSI 
>>configuration is proper; not too long and terminated properly, ETC....
>
>
>	That all appears to be ok.  The rev on the boards is A10,
>and according to the book that I have from the "TruCluster 
>Software and Configuration Management" class that I took, every
>thing is within spec for termination and length of cables.

Sorry I meant the hw rev on the module like b03 or something. Software
rev a10 could aloso be loaded into the modules that we were seeing the problems
on. There was no way in the software to check that power eco that was done.

	To play it safe and it has been a year now with your systems not
running (March 1996 to March 1997), why don't you just replace all you KSPSAs
with the latest (HW) rev modules......


						Bill
1795.15Just checked the setupTUXEDO::SWEENEYTom Sweeney in LKGTue Apr 08 1997 17:4611
Thanks for the reply.

	As it happens, I just came back from the lab after 
rechecking everything one more time.  Everything looks 
perfectly fine, but the machine is still crashing.  I think
I'll try what you suggested Bill and will have the KZPSA's 
replaced.   I might do the CCMAA's while I'm at it.  I don't
know what else the problem could be at this point.

	t 
	
1795.16Now What?TUXEDO::SWEENEYTom Sweeney in LKGFri May 02 1997 18:0462
Hi Again,

	Well it took us two weeks to be able to schedule a time for the 
customer service guy to come in and replace the memory channel cards and the
KZPSA's on both Skulk and Cell as suggested here in earlier notes.  After they
were replace, the cluster came up fine, and ran well for the day.

	But then 14 hours after coming up, it crashed.  Skulk itself crashed
and sent a panic to CELL.  Skulk came back up, but CELL remained in the panic
state and completely unaccessible until I powered it down and back up.  

	There were no crash logs on Cell.  Skulk had the following entry in
it's crash log.



Cluster Memory Channel primary adapter is online.
        Rev 11 adapter is the primary channel (pci bus 0, slot 7)
        connected to a real hub (STD) as node 0.
clubase: configured

        connected to a real hub (STD) as node 0.
clubase: configured
drd: configured.
dlmsl: configured
cnxagent: configured
dlm: configured.
memory channel thread init
checking for existing memory channel nodes
booting as primary memory channel node on mc0
memory channel software inited - node 0 on mc0
ccomsub: configured
mcnet: configured
cnxlock: acquired director lock: entering CNX_RUN state
cnxagent: mcskulk is now a cluster member
cnxagent: resuming
dlm_agent: resuming lock activity
ISR: LOSING CONNECTION WITH HUB (primary adapter)
rmerror_int: Error_count = 1 unit = 0 Err_reg = 0xffffffff80000000 Node = 0
panic (cpu 0): rmerror_int: fatal error and no alternate mc to failover

syncing disks... done
DUMP.prom: dev SCSI 0 2001 0 0 0 0 0, block 131072
DUMP.prom: dev SCSI 0 2001 0 0 0 0 0, block 131072
"
}
_preserved_message_buffer_end:
_kernel_process_status_begin:
  PID   COMM
00000   kernel idle
000


	What's my next step?  Having a cluster used to be a nice to have item
for us.  It is now becoming a necessity.  I must get this thing up on it's feet.
Crashing every 10 to 16 hours is not acceptable.  Especially since one of the
nodes will not come up by itself.

	Thanks!

	Tom Sweeney
	DCE Engineering