[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:	ase

Moderator:	SMURF::GROSSO

Created:	Thu Jul 29 1993
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2114
Total number of notes:	7347

1982.0. "ASE not detecting simulated failure?" by NETRIX::"[email protected]" (John McDonald) Thu Apr 03 1997 16:27

I have customer that's trying to demo an ase config to their customer. It's
2 8400's running ase v1.4 with 6 shared SCSI busses on KZPSA's. They've
setup a Sybase service that accesses all 6 of the shared busses, which have
HSZ's on them. No LSM mirroring.

They are trying to demo ase's failover capability, and one of the tests
they're doing is to disconnect the SCSI cable from one of the KZPSA's
to simulate a KZPSA failure. However, when they pull the cable, nothing
happens. No errors show up in the daemon.log file and the service
doesn't failover. They mentioned that the service may not actually be
doing anything (there aren't any clients yet), so I suggested that they
pull the cable and enter a 'disklabel -r' command for one device on the
shared bus they disconnected to simulate a disk access. Still nothing
happens.

They waited several minutes after pulling the cable and still nothing
showed up. What could cause ase NOT to see a failure?

John McDonald
Atlanta CSC

[Posted by WWW Notes gateway]

T.R	Title	User	Personal Name	Date	Lines
1982.1	Humph ..... worked on my systems	NETRIX::"[email protected]"	Gregory P. Myrdal	`Thu Apr 03 1997 16:45`	36
	John, Not sure what to say. I agree what you did should have worked. Note, however, the aseagent registers itself to get I/O errors. Thus, if nothing is going on the service will not be moved. But the disklabel read from the physical disk so I gave it a try on my system (running post V1.4) and it worked ok for me. Did they do a disklabel command to a disk within the HSZ that ASE is not aware of? Ie. if you gave drive rz17 to ASE, do a disklabel on it. The agent will be registered for I/O errors to this drive. You could also just create a filesystem and make a change to a file on it. Following is the output of my test from daemon.log after doing a disklabel -r rz17. -- Greg Apr 3 16:34:07 greg2 ASE: fgreg1 Agent *ALERT: device access failure on /dev/rz17a from fgreg1 Apr 3 16:34:10 greg2 ASE: fgreg1 Agent Error: can't unreserve device Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Warning: AM can't ping /dev/rz17a Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Warning: can't reach device '/dev/rz17a' Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Info: exec'ing with pipe: /var/ase/sbin/ase_run_sh 15583 Apr 3 16:34:13 greg2 ASE: fgreg1 Agent *ALERT: possible device failure: /dev/rz17a Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Error: can't unreserve device /dev/rz17a Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Notice: can't unreserve disk's devices, stopping it anyway [Posted by WWW Notes gateway]
1982.2	I'll give it a shot...	NETRIX::"[email protected]"	John McDonald	`Thu Apr 03 1997 18:55`	17
	Greg, thanx for the reply. I'm not able to get direct access to the system, so I have to rely on what I'm told. I'll double check tomorrow that the device they did the disklabel on really was part of a service. BTW - I want to double check something. I'm under the impression that as long as ase can ping other members over at least 1 SCSI bus, it won't generate an alert, even if the other 5 break. That's the behavior I've seen in the past and that's what they saw here. Once again, Thanx. John McDonald Atlanta CSC [Posted by WWW Notes gateway]
1982.3	rz40 was part of a service	NETRIX::"[email protected]"	Clair Garman	`Thu Apr 03 1997 19:05`	16
	I am the customer (DEC employee at AOL) for which John posted the note. Sybase is using raw disks. rz40b, rz40c, rz40d are raw partitions being used by one disk service. We altered the default partitions. The service is running on dec02. A disklabel to rz40 works fine. We run a script that performs a constant disklabel command to rz40 and disconnect the KZPSA cable to that bus. The disklabel command stalls - no output. The daemon.log and DECevent show no notice of the disconnection. I aborted the disklabel command and tried a dd command from rz40. It stalled as well. Clair Garman [Posted by WWW Notes gateway]
1982.4	Problem solved.	NETRIX::"[email protected]"	John McDonald	`Fri Apr 04 1997 11:57`	13
	Problem solved. It turns out that they weren't waiting long enough for ase to detect the failure - It took almost 2 minutes for the error to show up. Since the system is going to be demo'd to a customer, I suggested that they consider modifying the timeout values using /etc/hsm.conf, with the usual warning about possible false alerts showing up. Thanx for the replies. John McDonald Atlanta CSC [Posted by WWW Notes gateway]
1982.5		XIRTLU::schott	Eric R. Schott USG Product Management	`Fri Apr 04 1997 12:23`	8
	Hi The timeout problem may be in the CAM driver, not in ASE. You may find changing /etc/hsm.conf won't fix this. You may need to qar/IPMT this... I would not close it quite yet...
1982.6		dust.zk3.dec.com::Marshall	Rob Marshall USEG	`Fri Apr 04 1997 13:18`	11
	Hi, Eric is right, the timeouts are in the CAM layer, and there is nothing in hsm.conf that you can change that will help. Plus, there are changes being made (not sure, but they may be in PTmin - 4.0c) that will fail a device that is not answering much more quickly (somewhere around 15 seconds). But, don't quote me on the version for this change. Rob
1982.7	Confusion	NETRIX::"[email protected]"	John McDonald	`Fri Apr 04 1997 16:55`	12
	Eric & Rob, I'm confused. Are you saying that the changes in /etc/hsm.conf will have no effect at all, or that they won't have any significant effect in this case? The reason I'm confused is that I've used hsm.conf before, and it can make a difference. Also, according to the source, HSM replaces it's internal values with those specified by hsm.conf. John McDonald Atlanta CSC [Posted by WWW Notes gateway]
1982.8	things are improving?	namix.fno.dec.com::jpt	FIS and Chips	`Mon Apr 07 1997 04:20`	12
	As previous replys state, the problem may not be the ASE timeout itself, but underlying layer of SCSI CAM driver, which seem not to notice the error soon enough. And before CAM sees the problem, ASE can't do absolutely anything to solve it!!! I'm glad to hear that someone has put some effort on this, as this similar problem was reported first time almost two years ago, and again one year later with both LSM and ASE. This will solve some issues we've been fighting against in couple of customer cases. -jari
1982.9		SMURF::KNIGHT	Fred Knight	`Tue Apr 08 1997 13:39`	32
	The exact failure code followed in the CAM driver is very dependent on exactly what the failure is. Removing a device for example may be similar to disconnecting a cable, but then again, it may not. It depends on what else is going on out on the SCSI bus at the time, it depends on what adapter is being used, and a number of other items. Consider if a device is removed from an idle bus and the device had NEVER been used. When you first access it we will notice it fairly quickly. Then take a device that is being used, and you remove the device immediatly after a command has been sent to the device. We sent a command, so we wait for the command to complete. In some devices it is legal to take 60 seconds to complete some commands. So, if after 60 seconds it isn't done, we abort the command and try again (and we do this several times). So, you then end up with a several minute detection time for the removal of that particular device. The basic problem is that failure detection is not predictable. The goal of our future work is to make it more predictable. It will never be 100%, but it will be more predictable than it is today. Why will it never be 100% - consider a device that is broken in such a way that it accepts commands but NEVER executes them. I think it unlikely that a device would break in such a way, but if it does, it will take us a long time to figure it out. Fred Knight

Conference smurf::ase

1982.0. "ASE not detecting simulated failure?" by NETRIX::&quot;[email protected]&quot; (John McDonald) Thu Apr 03 1997 16:27

1982.0. "ASE not detecting simulated failure?" by NETRIX::"[email protected]" (John McDonald) Thu Apr 03 1997 16:27