[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:+ OpenVMS Clusters - The best clusters in the world! +
Notice:This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:PROXY::MOORE
Created:Fri Aug 26 1988
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5320
Total number of notes:23384

5240.0. "MC shows offline" by CSC32::S_DANNEN (Live long and slobber) Fri Feb 28 1997 14:35

    I have been working with a customer and Field Engineer on a brand new
    Alpha 4100 cluster running V7.1. The memory channel comes up as
    "offline". Each system is configured identically with CIPCA, CCMAA-AA,
    100Mbit Ethernet, 1.5GB memory. They share a common system disk off
    of an HSJ50, (they have several of these as well), there is a quorum
    disk, too. The mc_services_pn are set correctly, the MC and PM drivers
    do get loaded. An SDA> show ports lists PMA as valid. The SCA path
    between these machines is the CIPCA. They have PEDRIVER loaded, but
    that is not being used. There is no MC hub, just point to point 3m
    cable. 
    
    I am looking for any ideas that you folks may have as to why the MC
    shows "offline". An SDA> show device MCA0 or PMA0 has 00000000's for
    device status. This is the first "problem" call we have recieved here
    in the CSC for MC on OpenVMS. Are there any diagnostics or anything
    we can use to try to figure this out?
    
    Thanks,
    Steve  
T.RTitleUserPersonal
Name
DateLines
5240.1EEMELI::MOSEROrienteers do it in the bush...Sun Mar 02 1997 12:5818
    since you have a virtual hub, one is the master and one is the slave.
    Have you correctly set the jumpers on the MC board (CCMAA)? You could
    check this from the running system via SDA> show dev mca0/pdt, and do
    this on both machines. One should be the virtual master and one the
    virtual slave.
    
    You mention that the MC_xxx sysgen parameters are set correctly. By
    this I assume you mean that all 'current' values are the same as the
    'default', right?
    
    As a next step, I would do the following:
    - power-cycle both machines
    - then boot first the virtual hub master system all the way up
    - check that MCA0 is 'unavailable'
    - now boot the virtual hub slave
    - both MCA0 should become online
    
    /cmos
5240.2CSC32::S_DANNENLive long and slobberMon Mar 03 1997 09:0012
    Yes, all of the MC_SERVICES_Pn are at default, except _P7. I was
    hoping that maybe this would give us some additional verbage during
    boot to try to determine the cause of the problem. I will ask the
    Field Engineer to go on-site and verify the jumpers, and use your
    procedure to verify that it is working and post a reply here.
    
    Would anyone have any information as to whether the following are
    available on-line?
    EK-PCIRM-SV-A01 & EK-PCIRM-UG-A01
    
    Thanks!
    Steve
5240.3Almost working now, but...CSC32::S_DANNENLive long and slobberFri Mar 14 1997 18:3633
        It turns out that the jumpers were set incorrectly. The local
        Field Engineers set them up for virtual master and virtual slave,
        and got this on boot:
        %CNXMAN,  Using local access method for quorum disk
        %CNXMAN,  Established "connection" to quorum disk
        %MCA0 CPU00:  Init retries exceeded, going off-line.ng
        %MCA0 CPU00:  Too many initialization retries.
    
        The MCA0 from $ show device is again offline
    
        We then booted the master first, then the slave, and now we do
        indeed see the MCA0 and PMA0 in "Unavailable". A $ show clu/cont
        with LPORT shows the PMA0 port correctly. The customer rebooted
        the machines - out of sequence - slave first then master, and
        again the %MCA0 CPU00:  Too many initialization retries.
    
        We then booted the master first, then the slave, and now we do
        indeed see the MCA0 and PMA0 in "Unavailable". A $ show clu/cont
        with LPORT shows the PMA0 port correctly. The customer rebooted
        the machines - out of sequence - slave first then master, and
        again the %MCA0 CPU00:  Too many initialization retries.
    
        Is this the intended behavior, to be forced to boot the master
    first
        then the slave to get these things to synch up and become usable?
    
        If it is, this really shoots holes in the "High availablity model"
        for OpenVMSclusters. Since the hardware is strapped for master and
        slave, I would hope that this would be automagic, and not have to
        have operator intervention to get things right.
    
        /steve
    
5240.4STAR::NCARRTalk dates & features - but never together....Mon Mar 17 1997 15:5713
The boot order dependency for Memory Channel clusters without a hub (master first,
then the slave) is a temporary restriction.

This problem was found very late in testing, and does not occur with all AlphaServer
pairings. (It's system speed dependent.)


The restriction will be lifted in the soon-to-be-available Memory Channal patch
kit.

If this becomes a serious issue at your customer, loan them a proper Memory Channel
hub until the patch kit comes through. This problem does not occur with real hubs.
(Of course, I realise that hubs are not exactly thick on the ground....) 
5240.5thanks, anticipating the patchCSC32::S_DANNENLive long and slobberTue Mar 18 1997 12:107
    Thanks Nick, the customer is not yet in production for this cluster
    so they can wait for the patch to become available. We are starting
    to see several calls come into the CSC recently about this boot
    dependency issue, and a few Field Engineers have been swapping the
    hardware in an attempt to correct the behavior.
    
    /steve
5240.6STAR::NCARRTalk dates & features - but never together....Tue Mar 18 1997 14:392
I should have added that this restriction is not documented anywhere. The patch kit
will be available before we could update the documentation....
5240.7MCSHOW.EXEEVMS::SCHUETZVMS Clusters Memory Channel 381-6075Wed Apr 30 1997 18:3370
    There exists a small utility that uses the call interface to show
    some basic information on Memory Channel.  If someone could host it,
    I'll MAIL/FOREIGN it to you.
    
    	$run MCSHOW.EXE
    
    
     Memory Channel information for community 0		0 is MCA, 1 is MCB etc
          device_name       = MCA
          adapter_type      = 49
          adapter_version   = e		"e" is V1.5 - CCMAA-AB	65MB
    					"b" is V1.0 - CCMAA-AA	35MB
          software_revision = 1
          PDT               = FFFFFFFF80CA5200
          adapter is          REAL_HUB_SLAVE	or VIRTUAL_HUB_MASTER/SLAVE
    >If this line is blank, you have a real hardware problem.
          node ID           = 4		0=Master, 1=Slave, else = slot on hub
          max nodes         = 0
          current state     = 66
          total error count = 0
          tags size         = 81920	function of MC_SERVICES_P3
          regions size      = 32768			       _P4	
          locks size        = 32768
          channels size     = 10526720	(_P6+56) * _P9 * 64 + some more
    >THIS IS REAL PHYSICAL MEMORY THAT GETS CONSUMED !
    >You can lower P6 to the SYSGEN MIN of 544, and P9 to 50 or less
    >if you're not pushing a lot of traffic to get back memory if
    >you need to.  In the TIMA kit, MC will go OFFLINE unless you have at
    >least 32 MB free AFTER MC is installed. i.e. don't put 2 adapters on
    >a 64 MB system.
          channel msg size  = 992	MC_SERVICES_P6
          sync addr ptr     = 0
                            = 3a
                            = 0
                            = 83c24000
                            = 2
                            = 0
    >On a hub, the node numbers come from which linecard you're plugged into.
          MC node 0 state is 0, name is , device .
          MC node 1 state is 1, name is BEEF    , device MCA.
          MC node 2 state is 0, name is , device .
          MC node 3 state is 1, name is TSTDOS  , device MCA.
          MC node 4 state is 1, name is LYNX03  , device MCA.
          MC node 5 state is 0, name is , device .
          MC node 6 state is 1, name is PRMMC4  , device MCA.
          MC node 7 state is 0, name is , device .
    
    From a fully-redundant cluster:
          MC node 0 state is 1, name is RAWHI2  , device MCA.
          MC node 1 state is 1, name is LYNX05  , device MCA.
          MC node 2 state is 0, name is , device .
          MC node 3 state is 1, name is PRMMC2  , device MCA.
          MC node 4 state is 1, name is PRMMC3  , device MCA.
          MC node 5 state is 1, name is FLAM29  , device MCA.
          MC node 6 state is 1, name is SABL5   , device MCA.
          MC node 7 state is 1, name is SABL6   , device MCA.
    
          MC node 0 state is 1, name is RAWHI2  , device MCB.
          MC node 1 state is 1, name is LYNX05  , device MCB.
          MC node 2 state is 0, name is , device .
          MC node 3 state is 1, name is PRMMC2  , device MCB.
          MC node 4 state is 1, name is PRMMC3  , device MCB.
          MC node 5 state is 1, name is FLAM29  , device MCB.
          MC node 6 state is 1, name is SABL5   , device MCB.
          MC node 7 state is 1, name is SABL6   , device MCB.
    Note that VMS does NOT require you to wire up MCB in the same order as
    MCA, but it sure makes maintenance simpler if you do.
    
    /Chris