[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:	ase

Moderator:	SMURF::GROSSO

Created:	Thu Jul 29 1993
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2114
Total number of notes:	7347

1848.0. "Critical ASE error hard device error" by ROMOIS::CIARAMELLA () Tue Jan 28 1997 12:28

    hello,
    
    
    I have a customer that is running DECsafe 1.2a on OSF Rel 3.2C, while
    the hardware is a couple of Alpha 1000 with a couple of couple of KZPSA
    scsi controller and two BA350 where the data disk are installed.
     
    The only DECsafe service available is an alias internet address, a data base
    disk mount and at the end the oracle startup. Tha data disks are
    mirrored using LSM.
    On the system console of the system that manages the service appared
    the following message:
    
    Critical ASE error: hard device error on /dev/rz26g from "hostname"
    
    After this message that should only mean that a disk is faild, but the
    service should be already availabe, not only the service not has
    been automaticlly relocated, but were not possible to stop the service
    or relocate it manually becaouse in wrong state.
    Changed the failed disk, everithing returned to work properly.
    Do you have any ideas....?
    
    Thanks in Advance	
    						enzo

T.R	Title	User	Personal Name	Date	Lines
1848.1		usr505.zko.dec.com::Marshall	Rob Marshall	`Tue Jan 28 1997 14:50`	24
	> After this message that should only mean that a disk is faild, but the > service should be already availabe, not only the service not has > been automaticlly relocated, but were not possible to stop the service > or relocate it manually becaouse in wrong state. > Changed the failed disk, everithing returned to work properly. > Do you have any ideas....? Hi, I'm not really sure what you mean by this. For one, a disk failure should not cause the service to relocate, perhaps this is just a misunderstanding of how ASE deals with different kinds of errors?? Plus, when using LSM this should not have prevented you from stopping/starting the service. Unless the failure is such that LSM hangs when trying to talk to the disk, which could cause the lsm_dg_action script to timeout. But, that doesn't seem to be the case here. Also, I'm not sure I understand how you tried to relocate the service. It should always be possible to set a service offline and then online it. Basically, I would need more detail about exactly what happened, and exactly what you did, before I could try to tell you why you saw this problem. Rob
1848.2	A bit more info	ROMOIS::CIARAMELLA		`Wed Jan 29 1997 17:57`	26
	Rob, Thanks for your answer. I am sorry but I do not have more detail. What happes in simpler words is that a member of a mirrored set failed, and after this evet the disk service has been not available. This installation is about one year that is running and during the acceptance test phase, service availability following a mirror set failure has been tested (removing one of the member of the mirroring set), so during the normal operations the functionality is supplied. Someting has been tried as: To understand the error, but is not reportedin the messages list , has been tried to stop/start or to relocate the services on the other cluster node with no success. A bit more detailed description of the services, and of operations: The service is composed of three parts: - a cluster internet alias node - mount of database disks - starting of oracle applications (are on the shared disks) Regards, enzo
1848.3	ASE_PARTIAL_MIRRORING ??	BACHUS::DEVOS	Manu Devos DEC/SI Brussels 856-7539	`Fri Jan 31 1997 03:23`	15
	Hi Is the ASE_PARTIAL_MIRRORING variable not set to OFF? In this case, thes service should continue at the time of the error, simply giving you a mail on the device failure. But, if later (I repeat later), you stop/start (or relocate or reboot) the service, then the above variable prevents you to start a service when ASE discovers the "PARTIAL MIRROR" of one of the volume of the service. Read the manual... Regards, Manu.
1848.4		USCTR1::ASCHER	Dave Ascher	`Fri Jan 31 1997 05:45`	13
	Manu, Certainly this is a design bug if it works as you descibe. How can it be acceptable for the service to be unable to restart when ASE is there when the volumes would be available without ASE? ASE is supposed to enhance availability, not degrade it. Are you sure? d
1848.5		SMURF::MARSHALL	Rob Marshall - USEG	`Fri Jan 31 1997 14:28`	45
	Hi, Yes, Manu is sure. If you have ASE_PARTIAL_MIRRORING set to 'off' ASE will not start a service unless all plexes are available. This is to try to prevent data corruption, or stale data. Assume the following situation: +-------+ scsi0 scsi0 +-------+ \| hostA \|------+-------------\| hostB \| \| \|-------------+------\| \| +-------+ scsi1\| \|scsi1 +-------+ \| \| \| vol0 \| +------------+ \| pl0 pl1 \| +------------+ OK, vol0 is an LSM volume consisting of plexes pl0 (attached to scsi0) and pl1 (attached to scsi1). Let's assume that ASE_PARTIAL_MIRRORING is set to "on" on both machines. ASE_PARTIAL_MIRRORING="on" says that, if during startup of the service, it notices that not all of the plexes are available, ASE will still start the service. Now assume that hostA initially has the service with vol0. During the time that hostA is running scsi1 on hostB breaks. ASE does nothing because hostB isn't offering a service. Shortly after that, however, scsi0 on hostA breaks. No biggy, one plex is still available (pl1) so everything keeps on truckin'. This goes for a while (with lots of data being written to pl1). Suddenly hostA crashes. ASE relocates the service to hostB who, then tries to start the service. hostB discovers that one plex is not available (in this case pl0) but since ASE_PARTIAL_MIRRORING is set to "on" it starts the service anyway without pl1. The users get the stale data off pl1. Now you have a real mess :-) That's why ASE_PARTIAL_MIRRORING should be wisely used. Rob Marshall USEG
1848.6	This situation is referred as the "triple I/O Failure" ...	BACHUS::DEVOS	Manu Devos DEC/SI Brussels 856-7539	`Tue Feb 04 1997 03:19`	0
1848.7		USCTR1::ASCHER	Dave Ascher	`Tue Feb 04 1997 08:15`	28
	re: <<< Note 1848.5 by SMURF::MARSHALL "Rob Marshall - USEG" >>> I misunderstood Manu's response wherein he stated "Is the ASE_PARTIAL_MIRRORING variable not set to OFF?" and subsequently described what should happen if ASE_PARTIAL_MIRRORING IS set to OFF (or at least is NOT set to ON). Is it not also the case that if ASE_PARTIAL_MIRRORING is NOT set to ON then the failure of a single mirrored plex, will cause failover to be initiated? That's what we all thought this parameter (will value = ON) was invented to 'fix'. Generally, the failure of a single mirrored plex is not perceived by customers to be a good reason for interrupting service. Clearly there is at least the one scenario that you describe that would lead to serious consequences, however, this seems to be only one of many possible holes in the decsafe design. I tell customers that DECsafe is not bulletproof, but we can effectively minimize loss of application availability using it. Single plex failures are certainly a lot higher on the list of potential problems that customers believe they may encounter than the triple failure scenario... hence we opt for ASE_PARTIAL_MIRRORING ON. d
1848.8		usr405.zko.dec.com::Marshall	Rob Marshall	`Tue Feb 04 1997 21:06`	16
	Hi, > Is it not also the case that if ASE_PARTIAL_MIRRORING is NOT > set to ON then the failure of a single mirrored plex, will > cause failover to be initiated? That's what we all thought > this parameter (will value = ON) was invented to 'fix'. I think the answer to your question is: ASE_PARTIAL_MIRRORING has nothing to do with whether, or not, a service gets relocated. It has to do with whether, or not, a service should be started when plexes are missing from a mirrored volume. ASE will not relocate a service if a plex fails. Rob
1848.9		USCTR1::ASCHER	Dave Ascher	`Thu Feb 06 1997 08:47`	27
	ASE will not relocate a service if a plex fails. You must mean that in your opinion ASE will not relocate a service if a plex fails... or only if one plex fails? See note 309 (my note from 2 1/2 years ago) I have also just tested at another customer site what happens when a pair of dual redundant hsz40s get powered off when ASE_PARTIAL_MIRRORING is OFF and we have mirrors on another pair of dual redundant hsz40s. The short story is that the service got moved to the other node (which had ASE_PARTIAL_MIRRORING=ON ). If you would like an opportunity to see how DECsafe works in the real world I would be delighted to help set up a visit to one of our customers who are attempting to use it with SAP R/3 to run their most important business critical applications - and want very much to have the most available system that they can have. thanks, d
1848.10		USCTR1::ASCHER	Dave Ascher	`Thu Feb 06 1997 08:54`	6
	you might also see note 568.8... doug franks's explanation certainly implies that ASE_PARTIAL_MIRRORING makes the difference between whether or not a single failing plex will cause the service to fail. d
1848.11		USCTR1::ASCHER	Dave Ascher	`Thu Feb 06 1997 09:02`	3
	There is also Manu's note 957.1 and doug's note 921.1.
1848.12	an attempt to unmuddy the waters	USCTR1::ASCHER	Dave Ascher	`Sat Mar 01 1997 18:16`	29
	In an attempt to help future searchers through this conference: ASE_PARTIAL_MIRRORING=ON is there to allow ASE to start a service even though all of the disks may not be available. If each volume has a good mirror plex and if ASE_PARTIAL_MIRRORING=ON the service will start (or at least make the attempt). Otherwise, if any of the disks are not available the service not start. No matter what the setting of ASE_PARTIAL_MIRRORING (on, ON, off, OFF, undefined, abc, xyz) whenever any disk in the service fails ASE is supposed to check if that disk is part of any of the volumes that is knows about and if so, if there is a good mirror plex for the plexes on that disk. If so, nothing is supposed to happen. If either no good plex exists or the disk is not part of a lsm volume, then - I will let Engineering describe what should happen. Some of the behaviors that I have been observing are bugs - somewhere between LSM, ASE, and the HSZ40 something is not cooperating in identifying what is going out there with the disks. I hope that the bugs will be identified and repaired some time soon. I hope that makes things clear. d