T.R | Title | User | Personal Name | Date | Lines |
---|
6652.1 | commit only supported | FRAIS::KHAN | | Tue May 06 1997 06:52 | 12 |
| I have had lots of discussions of this kind ... basically they need a
few cluster functionalities and would donot want to buy the license.
Yes, one can run such a configuration, and ignore this message, take
care not to do this or that ... but it does not make it to asupported
one !! In my option, the customer is trying to find someone who would
commit himself and carry the responsibility.
I donot argue with them. I just tell them ( like your note .0 ) all the
techical issues why we donot recommend such a configuration. Also that
with a cluster he can move into the 'supported' one.
I have seen such configurations running, even with VAXs. The problem is
that once the application is running 'hot' the customer reacts
differently (" I asked DIGITAL and they said yes ... ").
|
6652.2 | I think it's dangerous, but it's their choice | SUBSYS::BROWN | SCSI and DSSI advice given cheerfully | Tue May 06 1997 09:34 | 31 |
| Although there is nothing to stop the customer from running an
unsupported configuration, and technically astute customers are
welcome to take informed risks, I sense that this customer doesn't
know what the risks are.
I don't sense that the customer knows enough about the SCSI protocol
and about host-based volume shadowing to appreciate all the issues,
but it would be prudent to summarize some of the obvious problems.
First, the customer shouldn't use DWZZBs with KZPAAs. KZPAAs are
already single-ended. Perhaps the customer intends to use KZPSAs?
Second, it would be prudent to use at least the VMS version that
would have been required for a supported configuration. Such a version
might have driver fixes and volume-shadowing fixes that would help
shared buses and failover work properly.
Third, the customer should realize that host-based shadowing stores
some information about the shadow-set on the host, not on the disk.
When you boot the second system, it won't know that the two disks
in the BA356 are supposed to be a shadow-set. If you mount them as
a shadow-set, the new shadow-set may not be in the same state as
the shadow set on the first system. Writes may not have completed,
master-slave relationships (rebuilds) may be lost, etc.
In general, unsupported configurations work most of the time. The
two main reasons for not supporting some configurations are bugs
(known or anticipated) and lack of testing time. The problem with
this configuration is potential data loss or data corruption, depending
on the state of the shadow-set at the time of the failover. If your
customer is willing to take that risk, that's their call.
|
6652.3 | what to say? | SAYER::ELMORE | Steve [email protected] 4123645893 | Tue May 06 1997 10:33 | 17 |
| Based on .2, may I say that system "B" will NOT generate traffic on
the SCSI that could cause the other system to crash, or its disks to
become corrupted when system A has "control". Also, may I say that
system "A" can't crash system "B" ?
I understand the nature of "unsupported" configurations. I understand
the long term implication of that too. We can tell the customer that
we will not service it, we won't support it, we won't configure it. We
can write that in a contract. But, will it RUN under the circumstances
we've outlined?
I suppose what we are really asking is what events are happening on the
SCSI bus, if any, and how do the systems (hardware and OS) handle those
events? Will those events damage hardware or crash the OS?
Thanks,
Steve
|
6652.4 | specific problems to watch for | SUBSYS::BROWN | SCSI and DSSI advice given cheerfully | Tue May 06 1997 13:28 | 14 |
| In VMS V6.2, there were problems with one of the fast wide adapters
(KZPSA or QLogic, I forget which) being the target of an INQUIRY
command from the other system. With the KZPAA, that shouldn't be
a problem. The problem was fixed in 6.2-1H1 and 7.1.
Also, booting the second system will cause SCSI bus resets. We've seen
crashes with the KZPAA with lots of resets in a short period, when it
had a heavy I/O load, but the one or two resets generated during a boot
are very unlikely to cause problems. Still, since a reset can cause a
command in progress to fail, you should probably check in VMSNOTES to see
if bus resets can cause problems with HBVS.
Also, don't put tape drives on the shared bus. We haven't proved it
works, even in a cluster.
|
6652.5 | Partitioned VMScluster: Prelude to Data Corruption | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Tue May 06 1997 16:58 | 57 |
|
This configuration is a partitioned VMScluster.
Partitioned VMScluster configurations are bad configurations.
If you want to tell the customer something, tell them that we do not
recommend this configuration, and we have seen *massive* corruptions
result, and that the supported configuration involves a VMScluster,
or involves configuring disjoint (non-shared) SCSI busses.
: Based on .2, may I say that system "B" will NOT generate traffic on
: the SCSI that could cause the other system to crash, or its disks to
: become corrupted when system A has "control". Also, may I say that
: system "A" can't crash system "B" ?
I would not say anything of the kind.
I would expect each host would detect the other's SCSI controller on
the shared SCSI, and I would expect I might have to alter a few console
variables to keep the systems from squawking about the SCSI controllers.
: I understand the nature of "unsupported" configurations. I understand
: the long term implication of that too. We can tell the customer that
: we will not service it, we won't support it, we won't configure it. We
: can write that in a contract. But, will it RUN under the circumstances
: we've outlined?
It might run. It might randomly crash. It might randomly corrupt
the user and system data. And the customer gets to find all this
out -- whether this configuration works, and whether or not the
customer can correctly (and safely) manage this configuration.
I've already seen cases where the two nodes were incorrectly booted
from the same system root -- which is *very* easy to do in this
particular configuration -- and *massive* data corruptions resulted.
I will admit to having configured and run partitioned VMScluster
configurations -- which is what this is -- but I have also seen
these lead to trashed disks and random system crashes. Things
get very interesting during upgrades, too.
If a customer is asking the questions raised here, I'd recommend
against this configuration.
: I suppose what we are really asking is what events are happening on the
: SCSI bus, if any, and how do the systems (hardware and OS) handle those
: events? Will those events damage hardware or crash the OS?
You are headed off into dangerous territory, territory where the
VMScluster connection manager and the distributed lock manager were
explicitly designed to prevent just the sorts of problems your
customer may/will see with this configuration.
If the customer wants something like this, I'd look for a SCSI
bus switch or similar widget -- hardware that can prevent two
systems from being on the same bus at the same time...
|
6652.6 | Thanks | SAYER::ELMORE | Steve [email protected] 4123645893 | Tue May 06 1997 21:10 | 6 |
| Thank you all for the information. We'll tell the customer no.
I'd sure like to find a SCSI switch though. Anyone ever heard of one?
Thanks again,
Steve
|
6652.7 | | JACEK::waldek.rpw.dec.com::agatka::calka | Waldek Calka | Wed May 07 1997 08:46 | 6 |
| Have a look on ANCOT, my customer has 30 of them and they are
working. Check ANNECY::WDX notes conference to learn more.
Regards/Waldemar
|
6652.8 | we do something like this every day for 10 years now. | EPS::VANDENHEUVEL | Hein | Thu May 08 1997 10:34 | 35 |
|
IMHO the tone of the replies in general and Steve's in .5 in particular
are overly pessimistic and just try to avoid core technical questions like
- what is the expected results from bus resets after reboots
- does mount reserve the target device on the scsi bus.
- is it shadowing that makes it impossible because critical
reconstruction data may be present on an other disk on the other node?
(couldn't be in memory because it should be able to cope with crashes)
The way I read .0 this customer perfectly understands that both systems
should never ever try to mount the same disk from tow non-clustered
systems. And they show to understand the consequence of getting it wrong.
It would seem to me that we can possibly simply tell them that
- this is an unsupported configuration which we recommend against
- electrically it will work (no smoke)
- we would encourage them to buy a (minimal) cluster licence which
will solve their problem and add more value to boot but
- we do expect this configuration to work just fine. (don't we?!)
- it is ultimatly their choice. we can never condone the config
fwiw, we have a 'sneaker' disk here in the lab for the past 10+ years
which is dual ported and either mounted to one cluster or to an other
cluster, never at the same time. The HSC's + VMS actually make sure
that you can not mount it twice. Never came close to any corruption.
VMS Mounting on SCSI also 'reserves' the device does it not?
If it does not, we may want to encourage the customer to come up
with a 'token' that the mount procedure looks for to be present
before continuing. I think that token _could_ be information on the
very disk. For example changing the volume label to reflect who's
got it. It could perhaps be a network name, or a dongle behind
a com port or whatever.
2�,
Hein
|
6652.9 | VMS MOUNTing on SCSI ***DOES NOT*** RESERVE (i.e. SCSI RESERVE command) the DEVICE. | STAR::WCLOGHER | | Thu May 08 1997 11:06 | 0 |
6652.10 | | LEFTY::CWILLIAMS | CD or not CD, that's the question | Thu May 08 1997 11:24 | 4 |
| It is so easy to screw this up that it is not worth the risk.
Just the bus scanning and resets from a reboot could cause problems.
Bad idea.
|
6652.11 | Removal of "Blade Guards"? | XDELTA::HOFFMAN | Steve, OpenVMS Engineering | Mon May 12 1997 11:05 | 22 |
| : fwiw, we have a 'sneaker' disk here in the lab for the past 10+ years
: which is dual ported and either mounted to one cluster or to an other
: cluster, never at the same time. The HSC's + VMS actually make sure
: that you can not mount it twice. Never came close to any corruption.
Which `sneaker' disk are you refering to? RA-series disks -- you say
HSC, so I'm assuming RAs -- are quite a bit different from SCSI here,
as they can be mounted only from one controller or the other.
I removed one of the major sets of RA-series `sneaker' disks used here
in OpenVMS engineering last year...
There is nothing to prevent this configuration from working as expected.
With HSJs in the same allocation class, one can operate fairly well.
But there is every reason to assume a small user screwup will massively
corrupt the user and system disks, and probably at some critical time
in the user's operations environment. (And with the multi-host SCSI
configuration, bootstrapping multiple systems off the same SYSn root
is trivially easy, and extremely dangerous...)
We _have_ a solution to this problem -- the VMScluster.
|