T.R | Title | User | Personal Name | Date | Lines |
---|
1906.1 | | CSC32::KIRK | | Fri Feb 28 1997 13:17 | 23 |
| Dick,
I have also seen the same thing happen wit v3.2g/ase1.3 with patches
and 2-8200s.
sysA sysb
kspsa0------hsz40-----kzpsa0
kzpsa1------hsz40-----kzpsa1
The disk are lsm mirrored across the SCSI's..
We have seen this once with pulling the power on the hsz40 shelve. We
see this everytime when the y-cable is pulled off the kzpsa to try and
simulate a scsi bus failure.
What we see in the daemon.log file is lsm_lv_action times out and then
ase tries to shutdown and relocate the service. During the shutdown
lsm_dg_action times out and the system is rebooted which fails the
service over to the other system.
|
1906.2 | .0 and .1 are different, need daemon.log for .0 | NETRIX::"[email protected]" | Greory P. Myrdal | Mon Mar 03 1997 11:02 | 31 |
| It appears to me that the problems in notes .0 and .1 are
different.
.1 The scsi cable was pulled of a system in which I/O was
going to a drive off of this bus. Since the cable was
pulled the scsi is now in an unterminated state. We are
at the mercy of the layers below us (ie. cam, device drivers,
hardware, etc). The scsi engineers tell me that you get
unexpected results when you are dealing with an unterminated
bus. In this case we timed out when we were trying to
determine if we should relocate the system by asking if
all disks within this service were mirrored. This requires
access to the drives on the scsi bus which hung up. ASE did
the correct thing by eventually relocating the service (via
a force method) to the other system to keep it running.
.0 This is a better test of hardware failure as it does not
unterminate the scsi bus. ASE should not have failed this
service over in a correctly configured environment. If
you include (or email me) the daemon.log during the time
in which you turn off the power from the hsz40 I might be
able to give you an idea what is going or if we have a
problem. Please make sure informational logging is turned
on first.
What happens is the customer does not power back on the
hsz40 for a long time?
-- Greg
[Posted by WWW Notes gateway]
|
1906.3 | | USCTR1::ASCHER | Dave Ascher | Mon Mar 03 1997 15:17 | 44 |
| re: <<< Note 1906.2 by NETRIX::"[email protected]" "Greory P. Myrdal" >>>
-< .0 and .1 are different, need daemon.log for .0 >-
It appears to me that the problems in notes .0 and .1 are
different.
Yes, they are...
.1 The scsi cable was pulled of a system in which I/O was
going to a drive off of this bus. Since the cable was
pulled the scsi is now in an unterminated state. We are
at the mercy of the layers below us (ie. cam, device drivers,
hardware, etc). The scsi engineers tell me that you get
unexpected results when you are dealing with an unterminated
bus. In this case we timed out when we were trying to
determine if we should relocate the system by asking if
all disks within this service were mirrored. This requires
access to the drives on the scsi bus which hung up. ASE did
the correct thing by eventually relocating the service (via
a force method) to the other system to keep it running.
I agree that you are 'at the mercy of the layers below', and the
scenarios are also different due to the fact that in .0 there is
still a path between the two systems over this scsi while in .1
there is not.
However, the problem is that your current logic is not robust
enough to deal with this situation. Without ASE, a system can keep
working fine with the cable pulled out of a KZPSA. With ASE, the
same should be true. If it is a matter of longer timeouts required
on the lsm_lv_action script or timeouts on the vold show diskgroup
and voldisk list commands within that script, then that's what
needs to be done. ASE should not be forcing a failover when one is
not necessary...
Assuming that this once worked, perhaps changes in the behavior of
the HSZ or LSM have not been responded to by ASE yet?
btw we also tried this test with a terminator stuck onto the
kzpsa. That made no difference.
An IPMT is on the way.
dave
|
1906.4 | clarification of .0 | KYOSS1::GREEN | | Mon Mar 03 1997 15:37 | 7 |
| The problem reported in .0 was pulling power on HSZ box.
We did this twice. The first time we left HSZ down and NO
FAILOVER.
During the second test (different pair of HSZs, same firmware),
the power was re-applied to the HSZs (possibly prematurely) and
the service failed over.
dick
|
1906.5 | Timeouts do not help .... pulling scsi cables were NEVER a supported failure case | NETRIX::"[email protected]" | Gregory P. Myrdal | Mon Mar 03 1997 16:56 | 59 |
| Note .3 reads:
I agree that you are 'at the mercy of the layers below', and the
scenarios are also different due to the fact that in .0 there is
still a path between the two systems over this scsi while in .1
there is not.
However, the problem is that your current logic is not robust
enough to deal with this situation. Without ASE, a system can keep
working fine with the cable pulled out of a KZPSA. With ASE, the
same should be true. If it is a matter of longer timeouts required
If you do not agree with the fact that ASE decides to reboot the system
that is fine. The reason does not always lie in the hands of ASE. For
example, a common case for this is the umount command failing in this
situation. If we had a forced umount we could have kept the system
available and relocated the service. ASE engineering has worked for about
2 years to get a forced umount to avoid things like this.
We are continuing to work harder with the base such that we can act
correctly when we get a failure. This process is always slower than
any of us like.
same should be true. If it is a matter of longer timeouts required
on the lsm_lv_action script or timeouts on the vold show diskgroup
and voldisk list commands within that script, then that's what
needs to be done. ASE should not be forcing a failover when one is
not necessary...
Ah, no this is not a matter of timeouts. I already tried that. It might
actually work, however, not in all cases. I am not a scsi engineer, so
when I asked them about this they could not tell me exactly how long the
timeout should be. In undeterministic.
Assuming that this once worked, perhaps changes in the behavior of
the HSZ or LSM have not been responded to by ASE yet?
It is not clear to me what once worked. If this is a regression of
behavior (of which we support) then please enter a QAR. This will be
fixed in the next release.
If something like this worked in the past its not because of our changes.
It would have been because of the base. Once the QAR is entered we can
determine what should be done with it (ie. which group owns it).
btw we also tried this test with a terminator stuck onto the
kzpsa. That made no difference.
I heard about this. Someone would have to explain this to a scsi/cam
engineer and I am sure they could tell you what is going on at that
layer. Of course, putting a terminator back into the kzpsa is not a
real life example of a hardware failure.
An IPMT is on the way.
Thank you.
-- Greg
[Posted by WWW Notes gateway]
|
1906.6 | | USCTR1::ASCHER | Dave Ascher | Mon Mar 03 1997 18:01 | 28 |
| Of course, putting a terminator back into the kzpsa is not a
real life example of a hardware failure.
I don't want to waste a lot of time trying to verify that ASE
is able to help systems survive problems that it cannot actually
help with... or worse, finding that it makes systems less
available then they would be without ASE.
How can I find out what you guys consider 'legitimate' failure
conditions so we can use those as a base for our testing in
the field? If pulling the scsi cable out of a KZPSA is not
a good simulation of a kzpsa failure (or of a cable failure)
then what is? what do you use for testing?
Clearly there are all kinds of conditions that can arise that can
make it impossible for the system to survive and for whcih
rebooting is the only possible alternative for attempting to get
the application available on another node. I image there are
failure modes in a kzpsa that would play havoc with scsi - and
others that would play havoc with PCI. This particular scenario
doesn't seem all that complex or obscure - in fact it was the very
first failure that I observed on a real site over 2 years ago when
one of our 'suits' tripped over a bundle of cables and they got
pulled out of their scsi interface cards. Fortunately, the
connectors were not secured and fortunately there was LSM
mirroring. Also fortunately, I guess, there was no ASE.
d
|
1906.7 | Try our QA group | NETRIX::"[email protected]" | Grgeory P. Myrdal | Tue Mar 04 1997 12:29 | 12 |
|
To get information about what our test group does please contact
someone in that group. If someone from that group reads this notes
file, maybe you can post a pointer to tests. Note: they may actually
pull cables for some test cases, however, keep in mind when they do
this they are looking for specific results (which may cause the
system to reboot).
Cheers,
-- Greg
[Posted by WWW Notes gateway]
|
1906.8 | Same kind of test ... same problems | NNTPD::"[email protected]" | Jose Ignacio Lopez | Tue May 06 1997 06:03 | 12 |
| Hello,
Using the same configuration and same tests we've got the same results.
Customer need to test a single SCSI failure in a redundant scenario
(2 SCSIs, mirrored with LSM) and pulling only one SCSI, ASE shouldn't
fail over the service to the other machine.
Is there any way to avoid the timeout in the lsm_lv_action script ?
Thanks
Jose Ignacio
[Posted by WWW Notes gateway]
|
1906.9 | use a DWZZA | SMURF::MYRDAL | | Thu May 08 1997 10:19 | 5 |
| Put a DWZZA on the scsi bus and turn it off. This will cause a path
failure.
-- Greg
|