Title: | + OpenVMS Clusters - The best clusters in the world! + |
Notice: | This conference is COMPANY CONFIDENTIAL. See #1.3 |
Moderator: | PROXY::MOORE |
Created: | Fri Aug 26 1988 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 5320 |
Total number of notes: | 23384 |
In the past my customer had several problems with his cluster configuration: FDDI ring | +----------------------------------------------+ | Gigaswitch | +----------------------------------------------+ | | | | | | | FDDI +----+ +----+ | | | +----+ +----+ | | | | | | | +-----+ +-----+ +-----+ +---------------+ +------+ +------+ |V6610| |V6640| |V7620| |AS1000| |AS1000| |AS2100| |AS2100| +-----+ +-----+ +-----+ |---+--+ +---+--| +------+ +------+ | | | | | HS121 | | | | | | | | | | +---+ | | +--DSSI--+ | CI +-------|SC |-------+ |disks and tapes| +---+ +---------------+ | | different HSJ disks and tapes Each Vax system has 1 vote. None of the Alpha systems has got a vote. LOCKDIRWT = 1 and RECNXINTERVAL = 180 for every cpu. The only NI interconnect is FDDI (no Ethernet). Each system is connected to an own port of the gigaswitch. The three Vax systems are connected via CI to serveral HSJ controller. The system disk for the vaxes is located at the HSJ controller. Each AS1000 system of the HS121 has its own system disk. The HS121 serves the system disk for the two AS2100 systems via MSCP (Booting by MOP). FDDI controller: VAX - DEMFA AS1000 - EISA-FDDI AS2100 - EISA-FDDI Operating System Versions: VAX - V6.2 (V6610 - V5.5-2) ALPHA - V6.2 The systems are now located at one computerroom but will be distributed at customers site in near future due to disaster tolerance (with second gigaswitch). Problems: 1.) When the customer interrupted the FDDI connection for a short time (i.e. 5 seconds to put the cable on another port) all other Alpha systems crashed after a while (Crashtype unknown, probably CLUEXIT???). This did happen with a lower value for RECNXINTERVAL (80). We expected that the interrupted Alpha continues operation if the connection is reestablisched within RECCNXINTERVAL?!? 2.) When we disconnected the FDDI cable at one AS1000 system (of HS121) last week and reconnected it after 3 seconds we saw that FDDI was operating again after about 30 seconds (listening and learning of gigaswitch port). But the system (with FDDI interruption) still hung (until RECNXINTERVAL???) and crashed. The second AS1000 of the HS121 hung for the same period and continued operation until the first AS1000 crashed. All disks within the cluster which are served by the HS121 were not accessible until the crash. We expected continuous availabilty of all HS121 connected disks in the case of one AS1000 failure. 3.) Last monday the V6610 lost the connection to all Alpha systems. Three minutes later the four Alphas crashed with CLUEXIT. It seems that even for a short interruption of the FDDI connection (within RECNXINTERVAL in the range of a few seconds) connection is not reestablished. There aren't any logs at the gigaswitch to determine the period for FDDI interruption. If this period is beyond 180 seconds the three vaxes should survive and the Alphas should crash. If the interruption interval is below 180 seconds we expect the systems to reestablish connection to each other and to continue operation but not a crash. All nodes of the cluster should survive!?! Questions: What happens/should happen if the operation of the HS121 stalls and the two other AS2100 are not capable of accessing their system disk (because it's served via HS121)? What happens/should happen if one AS1000 system of HS121 suffers a short FDDI interruption and the link is reestablished within RECNXINTERVAL)? How does the DSSI contribute to/influence cluster transition? What happens/should happen if one or two AS2100 system/s are disconnected from FDDI for a short while (link reestablished within RECNXINTERVAL)? Suggestion for VOTES, LOCKDIRWT, ... for optimum availability of the cluster? (VAX and CI at one computer room, HS121 and AS2100's at another computer room) Should Ethernet be connected additonaly to the systems to provide a second NI? Any help/idea/suggestion/answer highly appreciated. Regards Hermann
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
5317.1 | Could be talking over CI | ESSB::JNOLAN | John Nolan | Thu Jun 05 1997 16:47 | 6 |
Have you stopped SCS traffic on the CI, my memory of interconnect priority was CI,FDDI,DSSI,Ethernet maybe when the node becomes unavailable on the FDDI the VAXen switch over to using to using the CI leaving the Alphas do 'voluntary leave the VMScluster' | |||||
5317.2 | Add Interconnects... | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Thu Jun 05 1997 18:25 | 35 |
If you want availability, configure one or more additional interconnects. Right now, I'd expect to see a disconnected/wedged Gigaswitch would cause one or more `lobes' of the partitioned VMScluster to crash when the connection is restored, if RECNXINTERVAL is exceeded. (Each lobe with low/no votes will crash.) And given the current configuration, the VAX systems likely maintained communications via the CI -- the other Gigaswitch nodes will likely all crash if RECNXINTERVAL passes. When timing the total disconnect time here, you'll want to factor in any connection lag caused by the Gigaswitch -- if it takes some time to reconfigure and restart communications, then you will have to factor this into your calculations. The HS121 is an OpenVMS Alpha system, and will react like any other OpenVMS Alpha system in a VMScluster. The failure of the HS121 (AlphaServer 1000) FDDI to restart after the disconnected FDDI link is reconnected might point to a software problem -- depending on the particular state of the system, I'd normally expect to see it successfully restart operations after a short FDDI outage. (You might want to pursue this through IPMT channels.) As for suggestions for parameters, I'd divide the two configurations up into lobes, and I would tend to give the three VAX systems and the two Sable (2100) systems votes. And I'd set EXPECTED_VOTES to five, accordingly. For best availability, you'll want to look at the MDF/BRS package. And you'll want to provide a communications path for each lobe -- the VAX systems have CI, but the AlphaServer systems could certainly use another communications path in parallel to the Gigaswitch/FDDI. |