[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference spezko::cluster

Title:	+ OpenVMS Clusters - The best clusters in the world! +
Notice:	This conference is COMPANY CONFIDENTIAL. See #1.3
Moderator:	PROXY::MOORE

Created:	Fri Aug 26 1988
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	5320
Total number of notes:	23384

5317.0. "VAX-ALPHA-HS121-FDDI-GIGASWITCH cluster problems/question" by SUOBOS::SCHWIEZER () Wed May 28 1997 09:19

In the past my customer had several problems with his 
cluster configuration:

                                FDDI ring
                                    |
            +----------------------------------------------+
            |       Gigaswitch                             |
            +----------------------------------------------+
             |        |    |        |        |   |        |
 FDDI   +----+   +----+    |        |        |   +----+   +----+
        |        |         |        |        |        |        |
    +-----+   +-----+   +-----+   +---------------+ +------+ +------+
    |V6610|   |V6640|   |V7620|   |AS1000| |AS1000| |AS2100| |AS2100|
    +-----+   +-----+   +-----+   |---+--+ +---+--| +------+ +------+
       |         |         |      |   | HS121  |  |
       |         |         |      |   |        |  |
       |       +---+       |      |   +--DSSI--+  |
   CI  +-------|SC |-------+      |disks and tapes|
               +---+              +---------------+
                 |
                 |
           different HSJ
          disks and tapes


Each Vax system has 1 vote. None of the Alpha systems has got a vote.
LOCKDIRWT = 1 and RECNXINTERVAL = 180 for every cpu.

The only NI interconnect is FDDI (no Ethernet). Each system is connected
to an own port of the gigaswitch.

The three Vax systems are connected via CI to serveral HSJ controller.
The system disk for the vaxes is located at the HSJ controller.

Each AS1000 system of the HS121 has its own system disk.
The HS121 serves the system disk for the two AS2100 systems via MSCP
(Booting by MOP).

FDDI controller: VAX    - DEMFA
                 AS1000 - EISA-FDDI
                 AS2100 - EISA-FDDI

Operating System Versions: VAX   - V6.2 (V6610 - V5.5-2)
                           ALPHA - V6.2

The systems are now located at one computerroom but will be distributed 
at customers site in near future due to disaster tolerance (with second
gigaswitch).


Problems:

1.) When the customer interrupted the FDDI connection for a short time
    (i.e. 5 seconds to put the cable on another port) all other Alpha 
    systems crashed after a while (Crashtype unknown, probably CLUEXIT???).
    This did happen with a lower value for RECNXINTERVAL (80).
    
    We expected that the interrupted Alpha continues operation if 
    the connection is reestablisched within RECCNXINTERVAL?!?


2.) When we disconnected the FDDI cable at one AS1000 system (of HS121)
    last week and reconnected it after 3 seconds we saw that FDDI was 
    operating again after about 30 seconds (listening and learning
    of gigaswitch port).
    But the system (with FDDI interruption) still hung 
    (until RECNXINTERVAL???) and crashed. 
    The second AS1000 of the HS121 hung for the same period and 
    continued operation until the first AS1000 crashed.

    All disks within the cluster which are served by the HS121 were not 
    accessible until the crash.

    We expected continuous availabilty of all HS121 connected disks in
    the case of one AS1000 failure. 


3.) Last monday the V6610 lost the connection to all Alpha systems. Three
    minutes later the four Alphas crashed with CLUEXIT.
    It seems that even for a short interruption of the FDDI connection 
    (within RECNXINTERVAL in the range of a few seconds) connection is 
    not reestablished. 
    There aren't any logs at the gigaswitch to determine the period
    for FDDI interruption.

    If this period is beyond 180 seconds the three vaxes should survive
    and the Alphas should crash.
    If the interruption interval is below 180 seconds we expect the
    systems to reestablish connection to each other and to continue 
    operation but not a crash. All nodes of the cluster should survive!?!



Questions:

What happens/should happen if the operation of the HS121 stalls and the 
two other AS2100 are not capable of accessing their system disk (because
it's 
served via HS121)?

What happens/should happen if one AS1000 system of HS121 suffers a short 
FDDI interruption and the link is reestablished within RECNXINTERVAL)?
How does the DSSI contribute to/influence cluster transition?

What happens/should happen if one or two AS2100 system/s are disconnected
from FDDI for a short while (link reestablished within RECNXINTERVAL)? 

Suggestion for VOTES, LOCKDIRWT, ... for optimum availability of the
cluster?
(VAX and CI at one computer room, HS121 and AS2100's at another computer
room)

Should Ethernet be connected additonaly to the systems to provide 
a second NI?


Any help/idea/suggestion/answer highly appreciated.

Regards
Hermann

T.R	Title	User	Personal Name	Date	Lines
5317.1	Could be talking over CI	ESSB::JNOLAN	John Nolan	`Thu Jun 05 1997 15:47`	6
	Have you stopped SCS traffic on the CI, my memory of interconnect priority was CI,FDDI,DSSI,Ethernet maybe when the node becomes unavailable on the FDDI the VAXen switch over to using to using the CI leaving the Alphas do 'voluntary leave the VMScluster'
5317.2	Add Interconnects...	XDELTA::HOFFMAN	Steve, OpenVMS Engineering	`Thu Jun 05 1997 17:25`	35
	If you want availability, configure one or more additional interconnects. Right now, I'd expect to see a disconnected/wedged Gigaswitch would cause one or more `lobes' of the partitioned VMScluster to crash when the connection is restored, if RECNXINTERVAL is exceeded. (Each lobe with low/no votes will crash.) And given the current configuration, the VAX systems likely maintained communications via the CI -- the other Gigaswitch nodes will likely all crash if RECNXINTERVAL passes. When timing the total disconnect time here, you'll want to factor in any connection lag caused by the Gigaswitch -- if it takes some time to reconfigure and restart communications, then you will have to factor this into your calculations. The HS121 is an OpenVMS Alpha system, and will react like any other OpenVMS Alpha system in a VMScluster. The failure of the HS121 (AlphaServer 1000) FDDI to restart after the disconnected FDDI link is reconnected might point to a software problem -- depending on the particular state of the system, I'd normally expect to see it successfully restart operations after a short FDDI outage. (You might want to pursue this through IPMT channels.) As for suggestions for parameters, I'd divide the two configurations up into lobes, and I would tend to give the three VAX systems and the two Sable (2100) systems votes. And I'd set EXPECTED_VOTES to five, accordingly. For best availability, you'll want to look at the MDF/BRS package. And you'll want to provide a communications path for each lobe -- the VAX systems have CI, but the AlphaServer systems could certainly use another communications path in parallel to the Gigaswitch/FDDI.