T.R | Title | User | Personal Name | Date | Lines |
---|
1825.1 | | KITCHE::schott | Eric R. Schott USG Product Management | Tue Jan 14 1997 19:48 | 7 |
1825.2 | | SMURF::KNIGHT | Fred Knight | Thu Jan 23 1997 14:47 | 7 |
1825.3 | | USCTR1::ASCHER | Dave Ascher | Thu Jan 23 1997 23:50 | 3 |
| re: hmmm is there some new new new firmware that mght address this?
d
|
1825.4 | | SMURF::MARSHALL | Rob Marshall - USEG | Fri Jan 24 1997 17:57 | 25 |
| Hi,
Taking a *quick* look at the code, it appears that the agent tries to
determine which disks are bad by first stat'ing the special device file
and then trying to ping it. If that fails, the device should go on a
list of bad devices that are then passed as such to lsm_dg_action
(there is a -b option for bad devices). But, I'm not sure if the time
that it takes to do all this is "charged" to lsm_dg_action, so I'm not
real clear on why you are getting a message that lsm_dg_action timed
out. Unless the above tests somehow returned no bad devices, and
lsm_dg_action was trying to get to the pulled device.
Is your logging set to Informational? If so, you should see some
messages that look something like: "can't stat() device file /dev/xxx",
if the stat failed, or: "can't reach device '/dev/xxx'", if the ping
failed. If you are not seeing either, then it would appear that the
pulled disk is not being seen as failed.
If you could turn Informational logging on, and try this test, and give
us a sample of the daemon.log output, it might help see what is
happening here. Then maybe I could tell you if you should open a CLD
on this, or not.
Rob Marshall
USEG
|
1825.5 | can you check the logs I sent to the csc? | USCTR1::ASCHER | Dave Ascher | Sat Jan 25 1997 10:49 | 35 |
| Taking a *quick* look at the code, it appears that the agent tries to
determine which disks are bad by first stat'ing the special device file
and then trying to ping it. If that fails, the device should go on a
list of bad devices that are then passed as such to lsm_dg_action
(there is a -b option for bad devices).
what is lsm_dg_action supposed to do with the list of bad devices?
But, I'm not sure if the time
that it takes to do all this is "charged" to lsm_dg_action, so I'm not
real clear on why you are getting a message that lsm_dg_action timed
out. Unless the above tests somehow returned no bad devices, and
lsm_dg_action was trying to get to the pulled device.
Is your logging set to Informational? If so, you should see some
messages that look something like: "can't stat() device file /dev/xxx",
if the stat failed, or: "can't reach device '/dev/xxx'", if the ping
failed. If you are not seeing either, then it would appear that the
pulled disk is not being seen as failed.
If you could turn Informational logging on, and try this test, and give
us a sample of the daemon.log output, it might help see what is
happening here. Then maybe I could tell you if you should open a CLD
on this, or not.
I'm afraid that I am not at the customer site at the moment... I will
try to get more information when I am back there or at one of 2 other
sites I will be visiting this week - all of whom have exhibited this
behavior. There is a log with the IPMT I filed (#c970113-534) but I
don't know how to access the IPMT system.
thanks,
d
|
1825.6 | | usr505.zko.dec.com::Marshall | Rob Marshall | Tue Jan 28 1997 14:28 | 8 |
| Hi,
lsm_dg_action goes through the list of devices associated with the service and
does things like: voldisk define disk... So, if the disk is bad, it takes it
off the list of devices in the service, and should not try to access it at all.
Rob
|
1825.7 | comments? | USCTR1::ASCHER | Dave Ascher | Tue Jan 28 1997 21:19 | 29 |
| Rob,
thanks... I'm begining to get a funny feeling about this- I
have assumed for a long time that if I tell asemgr about one
of the disks in the disk group, since it tells me that it now
knows about the whole disk group, there is nothing to be gained
by the tedious exercise of entering each and every lsm volume
associated iwth the storage. I am now getting the feeling that
only the devices explicitly associated with the storage of
a volume are going to be pinged for good health - and that
only a failover of such a disk will exercise the check for
a mirror logic, etc...
I have had an inconsistent set of experiences about what happens
when I pull a disk or a cable or shutdown a shelf or a
controller... I now think that the experiences might have been
different because sometimes I was using a disk that was on
the volume that ASE know about and other times it was not.
Is this in fact the case? I have NOT been telling asemgr about
more than one of the volumes because my start/stop scripts
take care of the mount/umounts anyway.. one of my customers
had 54 volumes - it a very tedious task to get all that into
asemgr if there is not good reason to do so. On the other hand,
if it will make the disk stuff work the way it oughta, I'd
do it.
d
|
1825.8 | ??? | BACHUS::DEVOS | Manu Devos DEC/SI Brussels 856-7539 | Fri Jan 31 1997 03:46 | 14 |
| Hi,
> I have NOT been telling asemgr about more than one of the volumes because
> my start/stop scripts take care of the mount/umounts anyway.
My understanding was that ASE itself is taking note of the LSM devices
of a diskgroup once at least one volume of this diskgroup is involved
in a NFS/DISK service. If it is not the case, I dont see how
lsm_dg_action could "voldisk online rzxx" and thus if not all disks of
a diskgroup are placed online, the LSM volumes could not start...
So, if my understanding is not OK, can Engineering clarify this ...
Regards, Manu.
|
1825.9 | RED ALERT | USCTR1::ASCHER | Dave Ascher | Fri Jan 31 1997 06:12 | 72 |
| Manu,
It is my guess (I have not been able to sit down long enough
with the ase internal scripts to figure this out) that the
part of ASE that figures out what it has to do to move the
whole disk group is not the same as the part that figures out
which disks it needs to check the health of. I assume that
the thinking was something like - you might have disks in the
group that are not necessary for the service, so their health
should not be monitored. Not unreasonable, but not the first
model that would have come to my mind.
The question is, what does it take to get ASE to monitor the
health of the disks making up a volume??? What does one have to do
to make this work? I have almost always had "trouble" when I
tested pulling a disk or a cable, but have somehow explained it
away to myself. With mirrors, which our customers always have, it
is nice that the service does not failover when only one plex is
lost - but I have not been able to consistently see that ASE
notices anything is wrong until I try to access the storage. I
have always been puzzled about how come the ping has picked up the
failure in a few seconds - a minute maybe.
I have ONCE seen a message in the daemon.log saying somethng about
the single volume that told ASE about being okay becauase it
has a mirror.
I have now installed the latest vold, voldisk, am.o, am_scsi.o,
lsm_dg_action and defined all 54 volumes to ASE. Mountpoint NONE.
Now when I pull out a disk, or turn off the HSZ, ASE seems to do
not much of anything. When I try to access the volume (ls or
touch, for example) the process hangs for seconds - or up to
several minutes. WHen I finally give up and put the disk back in
the process completes... ASE makes some lame log entries about
scsi reservation resets and/or failures but nothing whatever about
the fact that there is a possible failure of a volume that needs
to be checked.
I have also verified what happens when I turn the service offline,
online the disks and import the diskgroup manually. I then
pull out a disk (there is a mirror for it) and attempt to access
it and experience no hanging... volwatch reports a problem
(which does not seem to occur when ASE in the picture)...
This is a very distressing and exhausting situation - one that I
can't believe is still occuring after all of this time. The
handling of failures to access storage is one of the basic design
pillars of ASE and it ought to be tested well enough so that it
works reliably and so that we don't have to verify over and over
again at each customer site. and/or it ought to be made much
clearer in the docs what the magic words are that will make it
work.
I am dealing with customers who are implementing SAP - putting
their company's entire business systems on this stuff. WHile
it is a big plus that they won't have to wait for hours for
mcs to show up after a cpu bombs, it is a big minus that when
a disk goes (definitely a more common occurence) they can't
rely upon ASE to detect the failure and do the right thing.
At this point with the latest patches they are getting hung
processes which can only get cleared by replacing the disk...
not a wonderful situation for the system with the company gonads
sitting on it.
How can get some quick action on this situation without involving
too may VPs from too may companies?
ANybody home in Engineering?
d
|
1825.10 | | usr404.zko.dec.com::Marshall | Rob Marshall | Fri Jan 31 1997 09:07 | 66 |
| Hi Dave,
Maybe we need to back off for a minute and clarify a couple of terms:
ASE - Available Server Environment
LSM - Logical Storage Manager
As you may be able to guess by the names, ASE is not designed to monitor, or
manage, storage. It's job is to provide a means to create services that are
not tied to a specific system. This is done to try to improve the availability
of the services clients use. When a system is deemed no longer able to offer
the service, ASE will try to relocate the service to a node that CAN offer
the service.
LSM has been integrated to a certain degree within ASE because it was clear
that system managers would want to use some tool to manage and monitor their
storage (since ASE does not do it). It is LSM's job to try to improve the
accessibility of the data that is on the physical storage media. It uses
mirroring, striping, etc. to accomplish this.
ASE pings the devices when starting a service in an attempt to help the LSM
script so that it does not try to talk to devices that are bad. What LSM then
does with what is left is LSM's decision, and ASE tries to determine what the
health of the SERVICE is based on the return values from the scripts.
LSM also does not monitor devices. It notices (hopefully) when a device has
failed and makes a decision based on what is left of the volume as to whether,
or not, it can still provide access to the data. In most cases, only one
plex is affected, and the other plex in a volume can still be used to access
the data. An error should be returned, and the plex should be disabled.
It is the system manager's job to monitor the health of his system and the
devices. To help him tools have been developed like console manager and
system watchdog. And that may be what you are really looking for to help
you provide the solution your customer needs.(?)
To make this clear:
ASE does NOT monitor the health of storage. It monitors the health of network
connections and notices when members are very sick, or die.
LSM doe NOT monitor the health of storage. It notices when errors occur as it
tries to access a device, and, if the data is mirrored, uses other devices that
should have the same data on it to satisfy the request. It provides redundancy
of data to allow for more highly available data. Even volwatch will only tell
you when a problem is noticed, it does not examine the storage on a constant
basis (at least I don't believe it does...).
System watchdog can be used to monitor the health of different components,
including storage.
Console manager can ensure that messages are routed to some person(s) who needs
to react to them.
ASE is NOT a replacement for good system management. If anything it makes
system management more complex, not less labor intensive. System managers
must still know how to deal with disk failures, network failures, system
crashes, etc. ASE just tries to ensure that the services are available for
the users, as best it can. This should minimize the down time of a service.
Also, engineering is home, and very busy. It has been repeated many times
in this, and other conferences, that notes files are not a support mechanism.
If you have a support issue, ie if LSM is hanging, please open a CLD.
Rob Marshall
USEG
|
1825.11 | I'm NEVER home... I'm delivering YOUR product | USCTR1::ASCHER | Dave Ascher | Fri Jan 31 1997 10:00 | 149 |
| Rob,
I appear to have raised some hackles - this was not my intent. I
appreciate the work you guys do but I don't find the CLD IPMT
process to be at all useful when I am working at a customer site
through the night far from home under tight deadlines with a
complex configuration. I have never actually received any useful
help through those processes in two years of doing SAP R/3
DECsafe projects. But that's another story.
Maybe we need to back off for a minute and clarify a couple of terms:
ASE - Available Server Environment
LSM - Logical Storage Manager
As you may be able to guess by the names, ASE is not designed to monitor, or
manage, storage. It's job is to provide a means to create services that are
not tied to a specific system. This is done to try to improve the availability
of the services clients use. When a system is deemed no longer able to offer
the service, ASE will try to relocate the service to a node that CAN offer
the service.
SO far, this is very clear and I have no problem with it.
LSM has been integrated to a certain degree within ASE because it was clear
that system managers would want to use some tool to manage and monitor their
storage (since ASE does not do it). It is LSM's job to try to improve the
accessibility of the data that is on the physical storage media. It uses
mirroring, striping, etc. to accomplish this.
I can't agree with this. LSM-awareness was put into ASE when
it became clear that the 'service' was not 'available' when
there were storage problems - also ASE had to be able to move
the storage resources around (just like does with the network
alias resource) to a system that can provide the service.
ASE pings the devices when starting a service in an attempt to help the LSM
script so that it does not try to talk to devices that are bad. What LSM then
does with what is left is LSM's decision, and ASE tries to determine what the
health of the SERVICE is based on the return values from the scripts.
Again - ASE apparently does ping devices during service startup
but due to the timeouts on SCSI this seems to typically have
the effect of causing the service to not be able to start at
all. SO the 'service' become less available than it would have
been without ASE. ASE is supposed to figure out if the loss
of 'this' disk will mean the loss of a volume (if the disk is
part of an LSM volume). "Loss of a disk" could also be loss
of a controller or cable - in which case failover to another
node would make the service available again. LSM can't do that...
only ASE with LSM knowlege (and ASE_PARTIAL_MIRRORING) can
do that.
LSM also does not monitor devices. It notices (hopefully) when a device has
failed and makes a decision based on what is left of the volume as to whether,
or not, it can still provide access to the data. In most cases, only one
plex is affected, and the other plex in a volume can still be used to access
the data. An error should be returned, and the plex should be disabled.
I can only think of ONE time when I observed ASE making a
determination that there was another good plex (an LSM concept,
yes?) and that the volume good still be used. THere was a)
no indication that the bad plex was disabled b) no indication
that it checked more than the single volume which it was told
about.
Now that I have updated am.o, am_scsi.o, voldisk, vold, and
lsm_dg_action (3.2G) AND told ase about all 54 volumes it does
not indicate at all that it is going to check for a remaining
good plex.
It is the system manager's job to monitor the health of his system and the
devices. To help him tools have been developed like console manager and
system watchdog. And that may be what you are really looking for to help
you provide the solution your customer needs.(?)
I'm looking for DECsafe/ASE alias Trucluster/Available Server
to do what it takes to ensure that availability is enhanced.
It has been some time since I listened to Eric present the
ASE vision, but I must say that I remain convinced that at
least in the early versions there was talk of monitoring of
the storage to deal with loss of the ability to access a disk
from one system when it could still be accessed from another.
To make this clear:
ASE does NOT monitor the health of storage. It monitors the health of network
connections and notices when members are very sick, or die.
Apparently you are correct about it not doing so. I don't see
it doing it. I also don't see it doing what is definitely supposed
to do (as per your description above) when access to a storage
component is detected.
LSM doe NOT monitor the health of storage. It notices when errors occur as it
tries to access a device, and, if the data is mirrored, uses other devices that
should have the same data on it to satisfy the request. It provides redundancy
of data to allow for more highly available data. Even volwatch will only tell
you when a problem is noticed, it does not examine the storage on a constant
basis (at least I don't believe it does...).
....
ASE is NOT a replacement for good system management. If anything it makes
system management more complex, not less labor intensive. System managers
must still know how to deal with disk failures, network failures, system
crashes, etc. ASE just tries to ensure that the services are available for
the users, as best it can. This should minimize the down time of a service.
ASE should not introduce behaviors that reduce the availability of
'services' to users. What I am seeing is that with the newest
patches when a disk (which has a mirror) is pulled, the
application attempting to access the volume is hung for an
indeterminate and generally unacceptably long time. When the same
volumes are tested with the service turned offline, volumes and
disk groups online and imported by hand, LSM does what it supposed
to do - keep the failure of the disk out of the way of the
application trying to access the volume.
This says to me that the behavior of ASE is somewhere between
"highly undesireable" and "totally unacceptable". I am sure
it is not what was intended by its creators - but that I have
bumped into either a bug or a set of assumptions that the creators
had that my dozen or so pretty big SAP installations do not meet.
I would like to be able to continue to assure customers that
using ASE will enhance their applications availability. If
I have screwed something up or if there is a communications
issue (human communications) that can be clear up and we can
have a bunch of much happier customers and I can get some rest.
If there is a bug, I think we would all like to identify it
and get it fixed.
Also, engineering is home, and very busy. It has been repeated many times
in this, and other conferences, that notes files are not a support mechanism.
If you have a support issue, ie if LSM is hanging, please open a CLD.
My apologies for implying that my desperate pleas for help were
being ignored. Just gotta wonder why nobody would be interested in
what I (and Manu) were asking about - we've been working with this
stuff and this group for a long time and might actually have
something worthwhile to contribute from time to time. On the other
hand we might just be rambling after a 24 hour straight day spent
at a major customer who is going 'live' with SAP in 48 hrs,
struggling to make everything work sensibly.
|
1825.12 | Go to the future ... | BACHUS::DEVOS | Manu Devos DEC/SI Brussels 856-7539 | Tue Feb 04 1997 04:10 | 20 |
| Dave,
I think HSZ40 is the real problem here...
I NEVER encountered the situation you described with NON-HSZ40 config.
I agree with your concern about the fact that we, in the field, are
NEVER playing with ASE alone, but with whole config, soft & hard, with
LSM, ADVFS, DECNSR (oracle-sap-triton-sms clinicom...) and we said in french
"La solidite d'une chaine depend de son plus faible maillon" that I am
going to try to translate in " A chain is so strong as its weakest link" ?
Thus, maybe, (Are you a taker, Eric) a new notefile ASE_IN_THE_FIELD should
be created which should be monitored by the ASE-LSM-ADVFS-DECNSR engineer ?
|
1825.13 | | XIRTLU::schott | Eric R. Schott USG Product Management | Tue Feb 04 1997 08:10 | 24 |
| Hi
I agree with your concern about full system testing. The hi-test
program is suppose to provide a method for better full system
testing to be done. I think if you have inputs to what they should
be testing, you should discuss this with Kevin Dadoly.
As to running into problems with products working together...I think
these need to be raised with the product groups involved. I understand
this may mean interacting in multiple notes conferences, but I don't
see that creating a new conference is going to solve the problem ( I
have no problem with another conference, I just don't think it will
get the interactions your are requesting ).
It would be good to get a write up of the applications folks have
integrated with ASE (so that others might learn from this)...it would
also help folks in teams like hi-test adjust their testing plans to
include the most common products. I think some of the writeups/scripts
should be posted to this conference for ASE/clusters in addition to
the conference for the product involved (if their is a conference).
regards
Eric
|