[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

1848.0. "Critical ASE error hard device error" by ROMOIS::CIARAMELLA () Tue Jan 28 1997 12:28

    hello,
    
    
    I have a customer that is running DECsafe 1.2a on OSF Rel 3.2C, while
    the hardware is a couple of Alpha 1000 with a couple of couple of KZPSA
    scsi controller and two BA350 where the data disk are installed.
     
    The only DECsafe service available is an alias internet address, a data base
    disk mount and at the end the oracle startup. Tha data disks are
    mirrored using LSM.
    On the system console of the system that manages the service appared
    the following message:
    
    Critical ASE error: hard device error on /dev/rz26g from "hostname"
    
    After this message that should only mean that a disk is faild, but the
    service should be already availabe, not only the service not has
    been automaticlly relocated, but were not possible to stop the service
    or relocate it manually becaouse in wrong state.
    Changed the failed disk, everithing returned to work properly.
    Do you have any ideas....?
    
    Thanks in Advance	
    						enzo
T.RTitleUserPersonal
Name
DateLines
1848.1usr505.zko.dec.com::MarshallRob MarshallTue Jan 28 1997 14:5024
>    After this message that should only mean that a disk is faild, but the
>    service should be already availabe, not only the service not has
>    been automaticlly relocated, but were not possible to stop the service
>    or relocate it manually becaouse in wrong state.
>    Changed the failed disk, everithing returned to work properly.
>    Do you have any ideas....?

Hi,

I'm not really sure what you mean by this.  For one, a disk failure should not
cause the service to relocate, perhaps this is just a misunderstanding of how 
ASE deals with different kinds of errors??  Plus, when using LSM this should 
not have prevented you from stopping/starting the service.  Unless the failure
is such that LSM hangs when trying to talk to the disk, which could cause the
lsm_dg_action script to timeout.  But, that doesn't seem to be the case here.

Also, I'm not sure I understand how you tried to relocate the service.  It
should always be possible to set a service offline and then online it.

Basically, I would need more detail about exactly what happened, and exactly
what you did, before I could try to tell you why you saw this problem.

Rob

1848.2A bit more infoROMOIS::CIARAMELLAWed Jan 29 1997 17:5726
    
    Rob, Thanks for your answer.
    
    I am sorry but I do not have more detail. What happes in simpler words
    is that a member of a mirrored set failed, and after this evet the
    disk service has been not available.
    
    This installation is about one year that is running and during the
    acceptance test phase, service availability following a mirror set
    failure has been tested (removing one of the member of the mirroring
    set), so during the normal operations the functionality is supplied.
    
    Someting has been tried as: To understand the error, but is not
    reportedin the messages list , has been tried to stop/start or to relocate the
    services on the other cluster node with no success.
    
    A bit more detailed description of the services, and of operations:
    
    The service is composed of three parts:
    - a cluster internet alias node
    - mount of database disks
    - starting of oracle applications (are on the shared disks)
    
    Regards,
    
    							enzo
1848.3ASE_PARTIAL_MIRRORING ??BACHUS::DEVOSManu Devos DEC/SI Brussels 856-7539Fri Jan 31 1997 03:2315
    Hi
    
    Is the ASE_PARTIAL_MIRRORING variable not set to OFF? 
    
    In this case, thes service should continue at the time of the
    error, simply giving you a mail on the device failure.
    
    But, if later (I repeat later), you stop/start (or relocate or reboot)
    the service, then the above variable prevents you to start a service
    when ASE discovers the "PARTIAL MIRROR" of one of the volume of the
    service.
    
    Read the manual...
    
    Regards, Manu.
1848.4 USCTR1::ASCHERDave AscherFri Jan 31 1997 05:4513
Manu,
    
    Certainly this is a design bug if it works as you descibe.
    
    How can it be acceptable for the service to be unable to restart
    when ASE is there when the volumes would be available without
    ASE? ASE is supposed to enhance availability, not degrade it.
    
    Are you sure?
    
    d
    
    
1848.5SMURF::MARSHALLRob Marshall - USEGFri Jan 31 1997 14:2845
    Hi,
    
    Yes, Manu is sure.  If you have ASE_PARTIAL_MIRRORING set to 'off' ASE
    will not start a service unless all plexes are available.  This is to
    try to prevent data corruption, or stale data.
    
    Assume the following situation:
    	
    	+-------+ scsi0	       scsi0 +-------+
    	| hostA |------+-------------| hostB |
    	|  	|-------------+------|       |
    	+-------+ scsi1|      |scsi1 +-------+   
                       |      |
    		       | vol0 |
    		    +------------+
    		    | pl0    pl1 |
    		    +------------+
    
    OK, vol0 is an LSM volume consisting of plexes pl0 (attached to scsi0)
    and pl1 (attached to scsi1).  Let's assume that ASE_PARTIAL_MIRRORING 
    is set to "on" on both machines.
    
    ASE_PARTIAL_MIRRORING="on" says that, if during startup of the service,
    it notices that not all of the plexes are available, ASE will still 
    start the service.
    
    Now assume that hostA initially has the service with vol0.  During the
    time that hostA is running scsi1 on hostB breaks.  ASE does nothing
    because hostB isn't offering a service.  Shortly after that, however,
    scsi0 on hostA breaks.  No biggy, one plex is still available (pl1) so
    everything keeps on truckin'.  This goes for a while (with lots of data
    being written to pl1).
    
    Suddenly hostA crashes.  ASE relocates the service to hostB who, then
    tries to start the service.  hostB discovers that one plex is not
    available (in this case pl0) but since ASE_PARTIAL_MIRRORING is set to
    "on" it starts the service anyway without pl1.  The users get the stale
    data off pl1.
    
    Now you have a real mess :-)
    
    That's why ASE_PARTIAL_MIRRORING should be wisely used.
    
    Rob Marshall
    USEG
1848.6This situation is referred as the "triple I/O Failure" ...BACHUS::DEVOSManu Devos DEC/SI Brussels 856-7539Tue Feb 04 1997 03:190
1848.7USCTR1::ASCHERDave AscherTue Feb 04 1997 08:1528
re:          <<< Note 1848.5 by SMURF::MARSHALL "Rob Marshall - USEG" >>>

   
    I misunderstood Manu's response wherein he stated
    
    "Is the ASE_PARTIAL_MIRRORING variable not set to OFF?" and
    subsequently described what should happen if ASE_PARTIAL_MIRRORING
    IS set to OFF (or at least is NOT set to ON).
    
    Is it not also the case that if ASE_PARTIAL_MIRRORING is NOT
    set to ON then the failure of a single mirrored plex, will
    cause failover to be initiated? That's what we all thought
    this parameter (will value = ON) was invented to 'fix'.
    
    Generally, the failure of a single mirrored plex is not perceived
    by customers to be a good reason for interrupting service.

    Clearly there is at least the one scenario that you describe
    that would lead to serious consequences, however, this seems
    to be only one of many possible holes in the decsafe design.
    I tell customers that DECsafe is not bulletproof, but we can
    effectively minimize loss of application availability using
    it.  Single plex failures are certainly a lot higher on the
    list of potential problems that customers believe they may
    encounter than the triple failure scenario... hence we opt
    for ASE_PARTIAL_MIRRORING ON.
    
    d
1848.8usr405.zko.dec.com::MarshallRob MarshallTue Feb 04 1997 21:0616
Hi,

>    Is it not also the case that if ASE_PARTIAL_MIRRORING is NOT
>    set to ON then the failure of a single mirrored plex, will
>    cause failover to be initiated? That's what we all thought
>    this parameter (will value = ON) was invented to 'fix'.

I think the answer to your question is: ASE_PARTIAL_MIRRORING has nothing
to do with whether, or not, a service gets relocated.  It has to do with
whether, or not, a service should be started when plexes are missing from
a mirrored volume.

ASE will not relocate a service if a plex fails.

Rob

1848.9USCTR1::ASCHERDave AscherThu Feb 06 1997 08:4727
    
    ASE will not relocate a service if a plex fails.

    
       You must mean that in your opinion ASE will not relocate
       a service if a plex fails... or only if one plex fails?
       See note 309 (my note from 2 1/2 years ago) 
    
       I have also just tested at another customer site what happens
       when a pair of dual redundant hsz40s get powered off when
       ASE_PARTIAL_MIRRORING is OFF and we have mirrors on another
       pair of dual redundant hsz40s. The short story is that the
       service got moved to the other node (which had
       ASE_PARTIAL_MIRRORING=ON ).
       
       If you would like an opportunity to see how DECsafe works
       in the real world I would be delighted to help set up a
       visit to one of our customers who are attempting to use
       it with SAP R/3 to run their most important business critical
       applications - and want very much to have the most available
       system that they can have.
       
       thanks,
       
       d 

1848.10USCTR1::ASCHERDave AscherThu Feb 06 1997 08:546
    you might also see note 568.8... doug franks's explanation
    certainly implies that ASE_PARTIAL_MIRRORING makes the difference
    between whether or not a single failing plex will cause the
    service to fail.
    
    d 
1848.11USCTR1::ASCHERDave AscherThu Feb 06 1997 09:023
There is also Manu's note 957.1 and doug's note 921.1.
    
    
1848.12an attempt to unmuddy the watersUSCTR1::ASCHERDave AscherSat Mar 01 1997 18:1629
    
    In an attempt to help future searchers through this conference:
    
    ASE_PARTIAL_MIRRORING=ON is there to allow ASE to start a service
    even though all of the disks may not be available. If each
    volume has a good mirror plex and if ASE_PARTIAL_MIRRORING=ON
    the service will start (or at least make the attempt). Otherwise,
    if any of the disks are not available the service not start.
    
    No matter what the setting of ASE_PARTIAL_MIRRORING (on, ON, off,
    OFF, undefined, abc, xyz) whenever any disk in the service fails
    ASE is supposed to check if that disk is part of any of the
    volumes that is knows about and if so, if there is a good mirror
    plex for the plexes on that disk. If so, nothing is supposed to
    happen. If either no good plex exists or the disk is not part of a
    lsm volume, then - I will let Engineering describe what should
    happen. 
    
    Some of the behaviors that I have been observing are bugs -
    somewhere between LSM, ASE, and the HSZ40 something is not
    cooperating in identifying what is going out there with the disks.
    I hope that the bugs will be identified and repaired some time
    soon. 
    
    I hope that makes things clear.
    
    d