T.R | Title | User | Personal Name | Date | Lines |
---|
1848.1 | | usr505.zko.dec.com::Marshall | Rob Marshall | Tue Jan 28 1997 14:50 | 24 |
| > After this message that should only mean that a disk is faild, but the
> service should be already availabe, not only the service not has
> been automaticlly relocated, but were not possible to stop the service
> or relocate it manually becaouse in wrong state.
> Changed the failed disk, everithing returned to work properly.
> Do you have any ideas....?
Hi,
I'm not really sure what you mean by this. For one, a disk failure should not
cause the service to relocate, perhaps this is just a misunderstanding of how
ASE deals with different kinds of errors?? Plus, when using LSM this should
not have prevented you from stopping/starting the service. Unless the failure
is such that LSM hangs when trying to talk to the disk, which could cause the
lsm_dg_action script to timeout. But, that doesn't seem to be the case here.
Also, I'm not sure I understand how you tried to relocate the service. It
should always be possible to set a service offline and then online it.
Basically, I would need more detail about exactly what happened, and exactly
what you did, before I could try to tell you why you saw this problem.
Rob
|
1848.2 | A bit more info | ROMOIS::CIARAMELLA | | Wed Jan 29 1997 17:57 | 26 |
|
Rob, Thanks for your answer.
I am sorry but I do not have more detail. What happes in simpler words
is that a member of a mirrored set failed, and after this evet the
disk service has been not available.
This installation is about one year that is running and during the
acceptance test phase, service availability following a mirror set
failure has been tested (removing one of the member of the mirroring
set), so during the normal operations the functionality is supplied.
Someting has been tried as: To understand the error, but is not
reportedin the messages list , has been tried to stop/start or to relocate the
services on the other cluster node with no success.
A bit more detailed description of the services, and of operations:
The service is composed of three parts:
- a cluster internet alias node
- mount of database disks
- starting of oracle applications (are on the shared disks)
Regards,
enzo
|
1848.3 | ASE_PARTIAL_MIRRORING ?? | BACHUS::DEVOS | Manu Devos DEC/SI Brussels 856-7539 | Fri Jan 31 1997 03:23 | 15 |
| Hi
Is the ASE_PARTIAL_MIRRORING variable not set to OFF?
In this case, thes service should continue at the time of the
error, simply giving you a mail on the device failure.
But, if later (I repeat later), you stop/start (or relocate or reboot)
the service, then the above variable prevents you to start a service
when ASE discovers the "PARTIAL MIRROR" of one of the volume of the
service.
Read the manual...
Regards, Manu.
|
1848.4 |
| USCTR1::ASCHER | Dave Ascher | Fri Jan 31 1997 05:45 | 13 |
| Manu,
Certainly this is a design bug if it works as you descibe.
How can it be acceptable for the service to be unable to restart
when ASE is there when the volumes would be available without
ASE? ASE is supposed to enhance availability, not degrade it.
Are you sure?
d
|
1848.5 | | SMURF::MARSHALL | Rob Marshall - USEG | Fri Jan 31 1997 14:28 | 45 |
| Hi,
Yes, Manu is sure. If you have ASE_PARTIAL_MIRRORING set to 'off' ASE
will not start a service unless all plexes are available. This is to
try to prevent data corruption, or stale data.
Assume the following situation:
+-------+ scsi0 scsi0 +-------+
| hostA |------+-------------| hostB |
| |-------------+------| |
+-------+ scsi1| |scsi1 +-------+
| |
| vol0 |
+------------+
| pl0 pl1 |
+------------+
OK, vol0 is an LSM volume consisting of plexes pl0 (attached to scsi0)
and pl1 (attached to scsi1). Let's assume that ASE_PARTIAL_MIRRORING
is set to "on" on both machines.
ASE_PARTIAL_MIRRORING="on" says that, if during startup of the service,
it notices that not all of the plexes are available, ASE will still
start the service.
Now assume that hostA initially has the service with vol0. During the
time that hostA is running scsi1 on hostB breaks. ASE does nothing
because hostB isn't offering a service. Shortly after that, however,
scsi0 on hostA breaks. No biggy, one plex is still available (pl1) so
everything keeps on truckin'. This goes for a while (with lots of data
being written to pl1).
Suddenly hostA crashes. ASE relocates the service to hostB who, then
tries to start the service. hostB discovers that one plex is not
available (in this case pl0) but since ASE_PARTIAL_MIRRORING is set to
"on" it starts the service anyway without pl1. The users get the stale
data off pl1.
Now you have a real mess :-)
That's why ASE_PARTIAL_MIRRORING should be wisely used.
Rob Marshall
USEG
|
1848.6 | This situation is referred as the "triple I/O Failure" ... | BACHUS::DEVOS | Manu Devos DEC/SI Brussels 856-7539 | Tue Feb 04 1997 03:19 | 0 |
1848.7 | | USCTR1::ASCHER | Dave Ascher | Tue Feb 04 1997 08:15 | 28 |
| re: <<< Note 1848.5 by SMURF::MARSHALL "Rob Marshall - USEG" >>>
I misunderstood Manu's response wherein he stated
"Is the ASE_PARTIAL_MIRRORING variable not set to OFF?" and
subsequently described what should happen if ASE_PARTIAL_MIRRORING
IS set to OFF (or at least is NOT set to ON).
Is it not also the case that if ASE_PARTIAL_MIRRORING is NOT
set to ON then the failure of a single mirrored plex, will
cause failover to be initiated? That's what we all thought
this parameter (will value = ON) was invented to 'fix'.
Generally, the failure of a single mirrored plex is not perceived
by customers to be a good reason for interrupting service.
Clearly there is at least the one scenario that you describe
that would lead to serious consequences, however, this seems
to be only one of many possible holes in the decsafe design.
I tell customers that DECsafe is not bulletproof, but we can
effectively minimize loss of application availability using
it. Single plex failures are certainly a lot higher on the
list of potential problems that customers believe they may
encounter than the triple failure scenario... hence we opt
for ASE_PARTIAL_MIRRORING ON.
d
|
1848.8 | | usr405.zko.dec.com::Marshall | Rob Marshall | Tue Feb 04 1997 21:06 | 16 |
| Hi,
> Is it not also the case that if ASE_PARTIAL_MIRRORING is NOT
> set to ON then the failure of a single mirrored plex, will
> cause failover to be initiated? That's what we all thought
> this parameter (will value = ON) was invented to 'fix'.
I think the answer to your question is: ASE_PARTIAL_MIRRORING has nothing
to do with whether, or not, a service gets relocated. It has to do with
whether, or not, a service should be started when plexes are missing from
a mirrored volume.
ASE will not relocate a service if a plex fails.
Rob
|
1848.9 | | USCTR1::ASCHER | Dave Ascher | Thu Feb 06 1997 08:47 | 27 |
|
ASE will not relocate a service if a plex fails.
You must mean that in your opinion ASE will not relocate
a service if a plex fails... or only if one plex fails?
See note 309 (my note from 2 1/2 years ago)
I have also just tested at another customer site what happens
when a pair of dual redundant hsz40s get powered off when
ASE_PARTIAL_MIRRORING is OFF and we have mirrors on another
pair of dual redundant hsz40s. The short story is that the
service got moved to the other node (which had
ASE_PARTIAL_MIRRORING=ON ).
If you would like an opportunity to see how DECsafe works
in the real world I would be delighted to help set up a
visit to one of our customers who are attempting to use
it with SAP R/3 to run their most important business critical
applications - and want very much to have the most available
system that they can have.
thanks,
d
|
1848.10 | | USCTR1::ASCHER | Dave Ascher | Thu Feb 06 1997 08:54 | 6 |
| you might also see note 568.8... doug franks's explanation
certainly implies that ASE_PARTIAL_MIRRORING makes the difference
between whether or not a single failing plex will cause the
service to fail.
d
|
1848.11 | | USCTR1::ASCHER | Dave Ascher | Thu Feb 06 1997 09:02 | 3 |
| There is also Manu's note 957.1 and doug's note 921.1.
|
1848.12 | an attempt to unmuddy the waters | USCTR1::ASCHER | Dave Ascher | Sat Mar 01 1997 18:16 | 29 |
|
In an attempt to help future searchers through this conference:
ASE_PARTIAL_MIRRORING=ON is there to allow ASE to start a service
even though all of the disks may not be available. If each
volume has a good mirror plex and if ASE_PARTIAL_MIRRORING=ON
the service will start (or at least make the attempt). Otherwise,
if any of the disks are not available the service not start.
No matter what the setting of ASE_PARTIAL_MIRRORING (on, ON, off,
OFF, undefined, abc, xyz) whenever any disk in the service fails
ASE is supposed to check if that disk is part of any of the
volumes that is knows about and if so, if there is a good mirror
plex for the plexes on that disk. If so, nothing is supposed to
happen. If either no good plex exists or the disk is not part of a
lsm volume, then - I will let Engineering describe what should
happen.
Some of the behaviors that I have been observing are bugs -
somewhere between LSM, ASE, and the HSZ40 something is not
cooperating in identifying what is going out there with the disks.
I hope that the bugs will be identified and repaired some time
soon.
I hope that makes things clear.
d
|