T.R | Title | User | Personal Name | Date | Lines |
---|
1982.1 | Humph ..... worked on my systems | NETRIX::"[email protected]" | Gregory P. Myrdal | Thu Apr 03 1997 17:45 | 36 |
| John,
Not sure what to say. I agree what you did should have worked. Note,
however, the aseagent registers itself to get I/O errors. Thus, if
nothing is going on the service will not be moved. But the disklabel
read from the physical disk so I gave it a try on my system (running
post V1.4) and it worked ok for me.
Did they do a disklabel command to a disk within the HSZ that ASE is
not aware of? Ie. if you gave drive rz17 to ASE, do a disklabel on
it. The agent will be registered for I/O errors to this drive.
You could also just create a filesystem and make a change to a file
on it.
Following is the output of my test from daemon.log after doing a
disklabel -r rz17.
-- Greg
Apr 3 16:34:07 greg2 ASE: fgreg1 Agent ***ALERT: device access failure on
/dev/rz17a from fgreg1
Apr 3 16:34:10 greg2 ASE: fgreg1 Agent Error: can't unreserve device
Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Warning: AM can't ping /dev/rz17a
Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Warning: can't reach device
'/dev/rz17a'
Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Info: exec'ing with pipe:
/var/ase/sbin/ase_run_sh 15583
Apr 3 16:34:13 greg2 ASE: fgreg1 Agent ***ALERT: possible device failure:
/dev/rz17a
Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Error: can't unreserve device
/dev/rz17a
Apr 3 16:34:13 greg2 ASE: fgreg1 Agent Notice: can't unreserve disk's
devices, stopping it anyway
[Posted by WWW Notes gateway]
|
1982.2 | I'll give it a shot... | NETRIX::"[email protected]" | John McDonald | Thu Apr 03 1997 19:55 | 17 |
| Greg,
thanx for the reply. I'm not able to get direct access to the system, so
I have to rely on what I'm told. I'll double check tomorrow that
the device they did the disklabel on really was part of a service.
BTW - I want to double check something. I'm under the impression that
as long as ase can ping other members over at least 1 SCSI bus, it won't
generate an alert, even if the other 5 break. That's the behavior I've
seen in the past and that's what they saw here.
Once again, Thanx.
John McDonald
Atlanta CSC
[Posted by WWW Notes gateway]
|
1982.3 | rz40 was part of a service | NETRIX::"[email protected]" | Clair Garman | Thu Apr 03 1997 20:05 | 16 |
| I am the customer (DEC employee at AOL) for which John posted the note.
Sybase is using raw disks. rz40b, rz40c, rz40d are raw partitions
being used by one disk service. We altered the default partitions.
The service is running on dec02. A disklabel to rz40 works fine.
We run a script that performs a constant disklabel command to rz40
and disconnect the KZPSA cable to that bus. The disklabel command
stalls - no output. The daemon.log and DECevent show no notice of
the disconnection.
I aborted the disklabel command and tried a dd command from rz40.
It stalled as well.
Clair Garman
[Posted by WWW Notes gateway]
|
1982.4 | Problem solved. | NETRIX::"[email protected]" | John McDonald | Fri Apr 04 1997 12:57 | 13 |
| Problem solved. It turns out that they weren't waiting long enough for
ase to detect the failure - It took almost 2 minutes for the error
to show up. Since the system is going to be demo'd to a customer,
I suggested that they consider modifying the timeout values using
/etc/hsm.conf, with the usual warning about possible false alerts
showing up.
Thanx for the replies.
John McDonald
Atlanta CSC
[Posted by WWW Notes gateway]
|
1982.5 | | XIRTLU::schott | Eric R. Schott USG Product Management | Fri Apr 04 1997 13:23 | 8 |
| Hi
The timeout problem may be in the CAM driver, not in ASE. You may
find changing /etc/hsm.conf won't fix this. You may need to qar/IPMT
this...
I would not close it quite yet...
|
1982.6 | | dust.zk3.dec.com::Marshall | Rob Marshall USEG | Fri Apr 04 1997 14:18 | 11 |
| Hi,
Eric is right, the timeouts are in the CAM layer, and there is
nothing in hsm.conf that you can change that will help. Plus,
there are changes being made (not sure, but they *may* be in
PTmin - 4.0c) that will fail a device that is not answering
much more quickly (somewhere around 15 seconds). But, don't
quote me on the version for this change.
Rob
|
1982.7 | Confusion | NETRIX::"[email protected]" | John McDonald | Fri Apr 04 1997 17:55 | 12 |
| Eric & Rob,
I'm confused. Are you saying that the changes in /etc/hsm.conf will have
no effect at all, or that they won't have any significant effect in this
case? The reason I'm confused is that I've used hsm.conf before, and it
can make a difference. Also, according to the source, HSM replaces it's
internal values with those specified by hsm.conf.
John McDonald
Atlanta CSC
[Posted by WWW Notes gateway]
|
1982.8 | things are improving? | namix.fno.dec.com::jpt | FIS and Chips | Mon Apr 07 1997 05:20 | 12 |
|
As previous replys state, the problem may not be the ASE timeout
itself, but underlying layer of SCSI CAM driver, which seem not to
notice the error soon enough. And before CAM sees the problem, ASE
can't do absolutely anything to solve it!!!
I'm glad to hear that someone has put some effort on this, as this
similar problem was reported first time almost two years ago, and
again one year later with both LSM and ASE. This will solve some
issues we've been fighting against in couple of customer cases.
-jari
|
1982.9 | | SMURF::KNIGHT | Fred Knight | Tue Apr 08 1997 14:39 | 32 |
| The exact failure code followed in the CAM driver is
very dependent on exactly what the failure is. Removing
a device for example may be similar to disconnecting a
cable, but then again, it may not. It depends on what
else is going on out on the SCSI bus at the time, it
depends on what adapter is being used, and a number of
other items.
Consider if a device is removed from an idle bus and the
device had NEVER been used. When you first access it
we will notice it fairly quickly. Then take a device that
is being used, and you remove the device immediatly after
a command has been sent to the device. We sent a command,
so we wait for the command to complete. In some devices
it is legal to take 60 seconds to complete some commands.
So, if after 60 seconds it isn't done, we abort the command
and try again (and we do this several times). So, you
then end up with a several minute detection time for the
removal of that particular device.
The basic problem is that failure detection is not predictable.
The goal of our future work is to make it more predictable.
It will never be 100%, but it will be more predictable than
it is today.
Why will it never be 100% - consider a device that is broken
in such a way that it accepts commands but NEVER executes them.
I think it unlikely that a device would break in such a way,
but if it does, it will take us a long time to figure it out.
Fred Knight
|