| Title: | + OpenVMS Clusters - The best clusters in the world! + |
| Notice: | This conference is COMPANY CONFIDENTIAL. See #1.3 |
| Moderator: | PROXY::MOORE |
| Created: | Fri Aug 26 1988 |
| Last Modified: | Fri Jun 06 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 5320 |
| Total number of notes: | 23384 |
In the past my customer had several problems with his
cluster configuration:
FDDI ring
|
+----------------------------------------------+
| Gigaswitch |
+----------------------------------------------+
| | | | | | |
FDDI +----+ +----+ | | | +----+ +----+
| | | | | | |
+-----+ +-----+ +-----+ +---------------+ +------+ +------+
|V6610| |V6640| |V7620| |AS1000| |AS1000| |AS2100| |AS2100|
+-----+ +-----+ +-----+ |---+--+ +---+--| +------+ +------+
| | | | | HS121 | |
| | | | | | |
| +---+ | | +--DSSI--+ |
CI +-------|SC |-------+ |disks and tapes|
+---+ +---------------+
|
|
different HSJ
disks and tapes
Each Vax system has 1 vote. None of the Alpha systems has got a vote.
LOCKDIRWT = 1 and RECNXINTERVAL = 180 for every cpu.
The only NI interconnect is FDDI (no Ethernet). Each system is connected
to an own port of the gigaswitch.
The three Vax systems are connected via CI to serveral HSJ controller.
The system disk for the vaxes is located at the HSJ controller.
Each AS1000 system of the HS121 has its own system disk.
The HS121 serves the system disk for the two AS2100 systems via MSCP
(Booting by MOP).
FDDI controller: VAX - DEMFA
AS1000 - EISA-FDDI
AS2100 - EISA-FDDI
Operating System Versions: VAX - V6.2 (V6610 - V5.5-2)
ALPHA - V6.2
The systems are now located at one computerroom but will be distributed
at customers site in near future due to disaster tolerance (with second
gigaswitch).
Problems:
1.) When the customer interrupted the FDDI connection for a short time
(i.e. 5 seconds to put the cable on another port) all other Alpha
systems crashed after a while (Crashtype unknown, probably CLUEXIT???).
This did happen with a lower value for RECNXINTERVAL (80).
We expected that the interrupted Alpha continues operation if
the connection is reestablisched within RECCNXINTERVAL?!?
2.) When we disconnected the FDDI cable at one AS1000 system (of HS121)
last week and reconnected it after 3 seconds we saw that FDDI was
operating again after about 30 seconds (listening and learning
of gigaswitch port).
But the system (with FDDI interruption) still hung
(until RECNXINTERVAL???) and crashed.
The second AS1000 of the HS121 hung for the same period and
continued operation until the first AS1000 crashed.
All disks within the cluster which are served by the HS121 were not
accessible until the crash.
We expected continuous availabilty of all HS121 connected disks in
the case of one AS1000 failure.
3.) Last monday the V6610 lost the connection to all Alpha systems. Three
minutes later the four Alphas crashed with CLUEXIT.
It seems that even for a short interruption of the FDDI connection
(within RECNXINTERVAL in the range of a few seconds) connection is
not reestablished.
There aren't any logs at the gigaswitch to determine the period
for FDDI interruption.
If this period is beyond 180 seconds the three vaxes should survive
and the Alphas should crash.
If the interruption interval is below 180 seconds we expect the
systems to reestablish connection to each other and to continue
operation but not a crash. All nodes of the cluster should survive!?!
Questions:
What happens/should happen if the operation of the HS121 stalls and the
two other AS2100 are not capable of accessing their system disk (because
it's
served via HS121)?
What happens/should happen if one AS1000 system of HS121 suffers a short
FDDI interruption and the link is reestablished within RECNXINTERVAL)?
How does the DSSI contribute to/influence cluster transition?
What happens/should happen if one or two AS2100 system/s are disconnected
from FDDI for a short while (link reestablished within RECNXINTERVAL)?
Suggestion for VOTES, LOCKDIRWT, ... for optimum availability of the
cluster?
(VAX and CI at one computer room, HS121 and AS2100's at another computer
room)
Should Ethernet be connected additonaly to the systems to provide
a second NI?
Any help/idea/suggestion/answer highly appreciated.
Regards
Hermann
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 5317.1 | Could be talking over CI | ESSB::JNOLAN | John Nolan | Thu Jun 05 1997 15:47 | 6 |
Have you stopped SCS traffic on the CI, my memory of interconnect
priority was CI,FDDI,DSSI,Ethernet maybe when the node becomes
unavailable on the FDDI the VAXen switch over to using to using the
CI leaving the Alphas do 'voluntary leave the VMScluster'
| |||||
| 5317.2 | Add Interconnects... | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Thu Jun 05 1997 17:25 | 35 |
If you want availability, configure one or more additional interconnects. Right now, I'd expect to see a disconnected/wedged Gigaswitch would cause one or more `lobes' of the partitioned VMScluster to crash when the connection is restored, if RECNXINTERVAL is exceeded. (Each lobe with low/no votes will crash.) And given the current configuration, the VAX systems likely maintained communications via the CI -- the other Gigaswitch nodes will likely all crash if RECNXINTERVAL passes. When timing the total disconnect time here, you'll want to factor in any connection lag caused by the Gigaswitch -- if it takes some time to reconfigure and restart communications, then you will have to factor this into your calculations. The HS121 is an OpenVMS Alpha system, and will react like any other OpenVMS Alpha system in a VMScluster. The failure of the HS121 (AlphaServer 1000) FDDI to restart after the disconnected FDDI link is reconnected might point to a software problem -- depending on the particular state of the system, I'd normally expect to see it successfully restart operations after a short FDDI outage. (You might want to pursue this through IPMT channels.) As for suggestions for parameters, I'd divide the two configurations up into lobes, and I would tend to give the three VAX systems and the two Sable (2100) systems votes. And I'd set EXPECTED_VOTES to five, accordingly. For best availability, you'll want to look at the MDF/BRS package. And you'll want to provide a communications path for each lobe -- the VAX systems have CI, but the AlphaServer systems could certainly use another communications path in parallel to the Gigaswitch/FDDI. | |||||