Title: | LSM |
Moderator: | SMURF::SHIDERLY |
Created: | Mon Jan 17 1994 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 803 |
Total number of notes: | 2852 |
One of our customers has two A4100s in ASE configuration. The O/S is Digital UNIX v3.2G, the application is SAP on Oracle. The database is installed on mirrored LSM (1.2A) volumes. These volumes are created from RZ29B pairs hanging on KZPSAs (A10). The following strange situation occured: An RZ29B failed in a mirrored volume comprised of two RZ29s. This disk error resulted in a system hang. ( No process could be started. But the system could be pinged from the network and probably from the SCSI buses because the other ASE member saw the services runnig on the frozen machine online.) The only thing I could do to press the halt button. (Here I should've forced a crash dump but I forgot to do so.) After a little struggling I could replace the failing disk and restart both systems. I've read in the notesfiles that during reads/writes LSM waits for the underlying SCSI driver to give back status. If no status comes back than LSM hangs. I think the SCSI driver should've timed out to let LSM push the bad plex out of the volume and carry on normal operation with a reduced volume. This whole thing should have been invisible for the users (except a warning message in the syslog). The customer says he payed for ASE and LSM to have a highly available computing environment and avoid the above mentioned situations (Right). He wants us to give him a position statement about it and guarantee not to reoccure again. I know it is quite hard to say anything (especially without crash dump) but any suggestion would be highly appreciated. Thank you in advance. Regards, Laszlo
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
764.1 | LEXSS1::GINGER | Ron Ginger | Tue Mar 18 1997 10:19 | 16 | |
My customer had a similar situtation. One of the Y cables was bad in such a way that one system was prevented from reaching a shared bus. It would get into a reset/retry mode which would hang the other system. We were able to force a crash dump on the system that was hung, and all analysis showed it seemed to be fine. We had this as an IPMT, but were never able to solve it. When the cable was proven to be bad the case was closed. There are ways for one member of an ASE pair to keep resetting the bus such that the other member will not get any work done. It wont crash, and it wont log any errors, and if the failing machine ever stops resetting the bus it will resume work. It could eaisly just drop that one plex form LSM but it never tries. I gave up trying to get anyone in engineering interested in solving this. | |||||
764.2 | Re: .1 | NETRIX::"[email protected]" | Tue Mar 18 1997 12:57 | 8 | |
> I gave up trying to get anyone in engineering interested in solving > this. If you haven't already, I would suggest filing a QAR on gorge. See http://www-notes.lkg.dec.com/aosg/lsm/165.0 for more details. [Posted by WWW Notes gateway] |