[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:+ OpenVMS Clusters - The best clusters in the world! +
Notice:This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:PROXY::MOORE
Created:Fri Aug 26 1988
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:5320
Total number of notes:23384

5317.0. "VAX-ALPHA-HS121-FDDI-GIGASWITCH cluster problems/question" by SUOBOS::SCHWIEZER () Wed May 28 1997 10:19

In the past my customer had several problems with his 
cluster configuration:

                                FDDI ring
                                    |
            +----------------------------------------------+
            |       Gigaswitch                             |
            +----------------------------------------------+
             |        |    |        |        |   |        |
 FDDI   +----+   +----+    |        |        |   +----+   +----+
        |        |         |        |        |        |        |
    +-----+   +-----+   +-----+   +---------------+ +------+ +------+
    |V6610|   |V6640|   |V7620|   |AS1000| |AS1000| |AS2100| |AS2100|
    +-----+   +-----+   +-----+   |---+--+ +---+--| +------+ +------+
       |         |         |      |   | HS121  |  |
       |         |         |      |   |        |  |
       |       +---+       |      |   +--DSSI--+  |
   CI  +-------|SC |-------+      |disks and tapes|
               +---+              +---------------+
                 |
                 |
           different HSJ
          disks and tapes


Each Vax system has 1 vote. None of the Alpha systems has got a vote.
LOCKDIRWT = 1 and RECNXINTERVAL = 180 for every cpu.

The only NI interconnect is FDDI (no Ethernet). Each system is connected
to an own port of the gigaswitch.

The three Vax systems are connected via CI to serveral HSJ controller.
The system disk for the vaxes is located at the HSJ controller.

Each AS1000 system of the HS121 has its own system disk.
The HS121 serves the system disk for the two AS2100 systems via MSCP
(Booting by MOP).

FDDI controller: VAX    - DEMFA
                 AS1000 - EISA-FDDI
                 AS2100 - EISA-FDDI

Operating System Versions: VAX   - V6.2 (V6610 - V5.5-2)
                           ALPHA - V6.2

The systems are now located at one computerroom but will be distributed 
at customers site in near future due to disaster tolerance (with second
gigaswitch).


Problems:

1.) When the customer interrupted the FDDI connection for a short time
    (i.e. 5 seconds to put the cable on another port) all other Alpha 
    systems crashed after a while (Crashtype unknown, probably CLUEXIT???).
    This did happen with a lower value for RECNXINTERVAL (80).
    
    We expected that the interrupted Alpha continues operation if 
    the connection is reestablisched within RECCNXINTERVAL?!?


2.) When we disconnected the FDDI cable at one AS1000 system (of HS121)
    last week and reconnected it after 3 seconds we saw that FDDI was 
    operating again after about 30 seconds (listening and learning
    of gigaswitch port).
    But the system (with FDDI interruption) still hung 
    (until RECNXINTERVAL???) and crashed. 
    The second AS1000 of the HS121 hung for the same period and 
    continued operation until the first AS1000 crashed.

    All disks within the cluster which are served by the HS121 were not 
    accessible until the crash.

    We expected continuous availabilty of all HS121 connected disks in
    the case of one AS1000 failure. 


3.) Last monday the V6610 lost the connection to all Alpha systems. Three
    minutes later the four Alphas crashed with CLUEXIT.
    It seems that even for a short interruption of the FDDI connection 
    (within RECNXINTERVAL in the range of a few seconds) connection is 
    not reestablished. 
    There aren't any logs at the gigaswitch to determine the period
    for FDDI interruption.

    If this period is beyond 180 seconds the three vaxes should survive
    and the Alphas should crash.
    If the interruption interval is below 180 seconds we expect the
    systems to reestablish connection to each other and to continue 
    operation but not a crash. All nodes of the cluster should survive!?!



Questions:

What happens/should happen if the operation of the HS121 stalls and the 
two other AS2100 are not capable of accessing their system disk (because
it's 
served via HS121)?

What happens/should happen if one AS1000 system of HS121 suffers a short 
FDDI interruption and the link is reestablished within RECNXINTERVAL)?
How does the DSSI contribute to/influence cluster transition?

What happens/should happen if one or two AS2100 system/s are disconnected
from FDDI for a short while (link reestablished within RECNXINTERVAL)? 

Suggestion for VOTES, LOCKDIRWT, ... for optimum availability of the
cluster?
(VAX and CI at one computer room, HS121 and AS2100's at another computer
room)

Should Ethernet be connected additonaly to the systems to provide 
a second NI?


Any help/idea/suggestion/answer highly appreciated.

Regards
Hermann
T.RTitleUserPersonal
Name
DateLines
5317.1Could be talking over CIESSB::JNOLANJohn NolanThu Jun 05 1997 16:476
    
      Have you stopped SCS traffic on the CI, my memory of interconnect
    priority was CI,FDDI,DSSI,Ethernet maybe when the node becomes
    unavailable on the FDDI the VAXen switch over to using to using the
    CI leaving the Alphas do 'voluntary leave the VMScluster'
    
5317.2Add Interconnects...XDELTA::HOFFMANSteve, OpenVMS EngineeringThu Jun 05 1997 18:2535
   If you want availability, configure one or more additional interconnects.

   Right now, I'd expect to see a disconnected/wedged Gigaswitch would
   cause one or more `lobes' of the partitioned VMScluster to crash when
   the connection is restored, if RECNXINTERVAL is exceeded.  (Each lobe
   with low/no votes will crash.)  And given the current configuration,
   the VAX systems likely maintained communications via the CI -- the
   other Gigaswitch nodes will likely all crash if RECNXINTERVAL passes.

   When timing the total disconnect time here, you'll want to factor in any
   connection lag caused by the Gigaswitch -- if it takes some time to
   reconfigure and restart communications, then you will have to factor
   this into your calculations.

   The HS121 is an OpenVMS Alpha system, and will react like any other
   OpenVMS Alpha system in a VMScluster.

   The failure of the HS121 (AlphaServer 1000) FDDI to restart after the
   disconnected FDDI link is reconnected might point to a software
   problem -- depending on the particular state of the system, I'd
   normally expect to see it  successfully restart operations after a
   short FDDI outage.  (You might want to pursue this through IPMT
   channels.)

   As for suggestions for parameters, I'd divide the two configurations
   up into lobes, and I would tend to give the three VAX systems and the
   two Sable (2100) systems votes.  And I'd set EXPECTED_VOTES to five,
   accordingly.

   For best availability, you'll want to look at the MDF/BRS package.
   And you'll want to provide a communications path for each lobe -- the
   VAX systems have CI, but the AlphaServer systems could certainly use
   another communications path in parallel to the Gigaswitch/FDDI.