[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference aosg::lsm

Title:	LSM

Moderator:	SMURF::SHIDERLY

Created:	Mon Jan 17 1994
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	803
Total number of notes:	2852

764.0. "Failed LSM disk freezes UNIX" by BPSOF::TELEKI (Laszlo Teleki) Tue Mar 11 1997 12:04

One of our customers has two A4100s in ASE configuration. The O/S is Digital
UNIX v3.2G, the application is SAP on Oracle. The database is installed on
mirrored LSM (1.2A) volumes. These volumes are created from RZ29B pairs hanging
on KZPSAs (A10).

The following strange situation occured:

An RZ29B failed in a mirrored volume comprised of two RZ29s. This disk error
resulted in a system hang. ( No process could be started. But the system could
be pinged from the network and probably from the SCSI buses because the other
ASE member saw the services runnig on the frozen machine online.) The only thing
I could do to press the halt button. (Here I should've forced a crash dump but I
forgot to do so.) After a little struggling I could replace the failing disk and
restart both systems.

I've read in the notesfiles that during reads/writes LSM waits for the
underlying SCSI driver to give back status. If no status comes back than LSM
hangs. I think the SCSI driver should've timed out to let LSM push the bad plex
out of the volume and carry on normal operation with a reduced volume. This
whole thing should have been invisible for the users (except a warning message
in the syslog).

The customer says he payed for ASE and LSM to have a highly available computing
environment and avoid the above mentioned situations (Right). He wants us to
give him a position statement about it and guarantee not to reoccure again.

I know it is quite hard to say anything (especially without crash dump) but any
suggestion would be highly appreciated. Thank you in advance.

Regards,

Laszlo

T.R	Title	User	Personal Name	Date	Lines
764.1		LEXSS1::GINGER	Ron Ginger	`Tue Mar 18 1997 10:19`	16
	My customer had a similar situtation. One of the Y cables was bad in such a way that one system was prevented from reaching a shared bus. It would get into a reset/retry mode which would hang the other system. We were able to force a crash dump on the system that was hung, and all analysis showed it seemed to be fine. We had this as an IPMT, but were never able to solve it. When the cable was proven to be bad the case was closed. There are ways for one member of an ASE pair to keep resetting the bus such that the other member will not get any work done. It wont crash, and it wont log any errors, and if the failing machine ever stops resetting the bus it will resume work. It could eaisly just drop that one plex form LSM but it never tries. I gave up trying to get anyone in engineering interested in solving this.
764.2	Re: .1	NETRIX::"[email protected]"		`Tue Mar 18 1997 12:57`	8
	> I gave up trying to get anyone in engineering interested in solving > this. If you haven't already, I would suggest filing a QAR on gorge. See http://www-notes.lkg.dec.com/aosg/lsm/165.0 for more details. [Posted by WWW Notes gateway]