[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:	ase

Moderator:	SMURF::GROSSO

Created:	Thu Jul 29 1993
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2114
Total number of notes:	7347

2099.0. "ASE BEHAVIOUR IN CASE OF SCSI BUS FAILURE" by SOSGPX::FIORINI () Thu May 29 1997 11:10

    Hi all,
     
    I have installed a DECsafe configuration with two ALPSRV 5/400, 
    and the Customer involved is one of the biggest in Italy.
    The operating system is Digital_unix 4.0b and DECsafe 1.4
    The hardware configuration is the following:
    
    
    SYSTEM 1                                       SYSTEM2
    +----------+            +-----+               +----------+
    |A1000  |K |            |BA356|               |A1000  |K |
    |5/400  |Z |            |     |               |5/400  |Z |
    |       |P |            |     |               |       |P |
    |       |S |            |     |               |       |S |
    |       |A |            |     |               |       |A |
    +----------+            +-----+               +----------+
             |                | |                          |
             |                | |                          |
             +----------------+ +--------------------------+
    
    
    In the BA356 there are two disks, (the application disk and its
    mirroring).
    The mirroring is done with LSM.
    
    During the system acceptance test the Customer did some actions to
    see the ASE behaviour in case of system failures.
    
    Assume that SYSTEM 1 is running the service and SYSTEM 2 is in
    stand-by.
    If the SCSI cable is disconnected from the KZPSA of SYSTEM 1 (the SCSI
    is still terminated via the Y cable), the AM notify the HSM that the
    ping over the SCSI bus has timed out.
    THE SERVICE IS NOT RIALLOCATED TO SYSTEM 2, THAT CAN STILL ACCESS TO THE
    SHARED DEVICES, AND THE APPLICATION HANGS.
    If the cable is reconnected (after two minutes) the AM notify the HSM that 
    the ping over SCSI bus is ok, and the application is automatically 
    restarted.
     
    My conclusion is that the ASE does not react correctly in case of
    a SCSI BUS failure.
     
    Any idea if it is possible to change this unacceptable behaviour?
    
    Thanks to everybody who can help me.
    
    Regards 
    
    Moreno Fiorini

T.R	Title	User	Personal Name	Date	Lines
2099.1	any disk access failures ???	BACHUS::DEVOS	Manu Devos NSIS Brussels 856-7539	`Thu May 29 1997 18:08`	16
	My first reaction on your note is that you have a single point of failure in your drawing, as there is only one SCSI bus.. Now, concerning the described behaviour, you must provide us with more information. As you correctly mentioned, the AM detects when the "other" member is no longer respondding to SCSI "pings". BUT, this is NOT enough to cause a failover. After all, maybe this bus is not used by any service running on that host. A failover is only started if an access to a shared data is failing, and only if this shared data is not accessible from another disk ( I.E. from another plex of the LSM volume). So, you should tell us if you receive also a notification that ASE is not able to access a specific disk. Regards, Manu
2099.2	ASE AND SCSI BUS PARTITION	SOSGPX::FIORINI		`Tue Jun 03 1997 06:35`	38
	Hi Manu, thanks for the answer to my note. As you have noticed there is a single point of failure, because each system has only one SCSI interface, and the two shared disks (one is used as mirroring) are on that interface. Now I cannot change the configuration, but I have informed the Customer, and I think the configuration will be changed in a short time. In the actual configuration, only one service (that use both disks) has been defined, and the first system gives the service. The second system is in "stand by" and will take the service if the first system fails. If the SCSI cable is disconnected from the system that gives the service, that system is not able to access any data on the shared bus. AM detects the failure, and notify that there is a SCSI bus partition. There is not any entry in the errorlog regarding the access to the disks (the loggin severity level has been set to log notice, warning and errors). The Customer's application is run via the start action script. In it there are two lines that point to two other two scripts in order to run the database (oracle) and the application. Once the application is started it accesses the data on the shared disks continuosly. I think that ASE does not reallocate the service because it does not ping the disks itself, and does not know that the application is unable to reach the data. Is there any possibility to solve the problem? Regards, Moreno
2099.3	How is the service defined?	HERON::BLOMBERG	Trapped inside the universe	`Tue Jun 03 1997 08:23`	1

2099.4	.0: It seems that ASE1.4 can't handle SCSI single point failure	EPS::NGUYEN	Without fools there would be no wisdom.	`Tue Jun 03 1997 10:17`	18
	Hi there, I've got the same problem when my application service does not fail over when the SCSI bus is disconnected. My software versions are the same as those in .0, and I use HSZ40 instead of the BA box. I configured the application favor member "system1,system2" and do NOT fail over to the higher favor member automatically. In order to activate the fail over in the old version of ASE1.3, I used to "cd" into the directory on the shared disk and "ls", but it seems does NOT work anymore, the application can not to data on the shared disk and just hang there without failing over to the other system. Any recommendations/suggestions are highly appreciated. Regards, Gina Nguyen
2099.5		COMICS::CORNEJ	What's an Architect?	`Tue Jun 03 1997 11:24`	12
	>In order to activate the fail over in the old version of ASE1.3, I used >to "cd" into the directory on the shared disk and "ls", but it seems does >NOT work anymore, the application can not to data on the shared disk and >just hang there without failing over to the other system. The service will not fail over if you "cd" to the filesystem on the shared disk (look in daemon.log - it will show the umount failing because the device is still busy). Jc
2099.6	SCSI full partition changed in 1.4 ?	BACHUS::DEVOS	Manu Devos NSIS Brussels 856-7539	`Wed Jun 04 1997 16:34`	20
	Hi Jc, > The service will not fail over if you "cd" to the filesystem on the > shared disk (look in daemon.log - it will show the umount failing > because the device is still busy). I think .4 wanted to say that after having disonnected the SCSI bus cable, he had to do a "cd - ls" to cause an IO on the disconnected disk to cause the failover. This simply confirms my answer saying that a cable disconnection is not sufficient to cause a failover, an IO is also needed.. But, the interesting information in .0, .3 and .4 seems that it does not work anymore with version 1.4... Is it any change in version 1.4 concerning the SCSI BUS full partition inregards of version 1.3 ? Manu.
2099.7	:-)	COMICS::CORNEJ	What's an Architect?	`Thu Jun 05 1997 07:45`	5
	Ooops! Sorry about that. It is what comes from doing what I wrote in .4 too often that day myself:-) Jc
2099.8	ScSi fail over for less than 1-minute	EPS::NGUYEN	Without fools there would be no wisdom.	`Thu Jun 05 1997 09:29`	23
	Hello there, Thank you very much for discussion on the case. What .6 have written is precisely what I mean. Since the ASE already discovered that there is a failure in the SCSI, it won't hurt it to "cd" in the directory (after all, what that be is an empty bucket to mount the shared disk). It only doesn't work as .5 suggested in the normal condition. Well, after some more extensively testing, I've found out that the system will fail over if any "IO" action is quick enough (I mean within less than 2 minutes or so) around the SCSI failure time. The longer the time, the more difficult it is for the system to do anything. I mean again is that after "1" minute, one should expect that the system would hang. However, I believe that if our customers buy our product, they would expect that it would work unless it's stating otherwise. What if the SCSI fails at night or sometime when there are not many IO trafics? Any suggestions from the product team? Gina Nguyen
2099.9		KITCHE::schott	Eric R. Schott USG Product Management	`Thu Jun 05 1997 13:15`	9
	Hi If you have behavior that you think is incorrect (or different between releases), I suggest you file a QAR or CLD/IPMT. The system should not hang...so I think this is a serious problem. I think to get the attention this deserves, you should escalate as required by your customer.