[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:ase
Moderator:SMURF::GROSSO
Created:Thu Jul 29 1993
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2114
Total number of notes:7347

1825.0. "DECsafe v.13 Unix 3.2g" by USCTR1::ASCHER (Dave Ascher) Tue Jan 14 1997 15:06

T.RTitleUserPersonal
Name
DateLines
1825.1KITCHE::schottEric R. Schott USG Product ManagementTue Jan 14 1997 19:487
1825.2SMURF::KNIGHTFred KnightThu Jan 23 1997 14:477
1825.3USCTR1::ASCHERDave AscherThu Jan 23 1997 23:503
re: hmmm is there some new new new firmware that mght address this?
    
    d
1825.4SMURF::MARSHALLRob Marshall - USEGFri Jan 24 1997 17:5725
    Hi,
    
    Taking a *quick* look at the code, it appears that the agent tries to
    determine which disks are bad by first stat'ing the special device file
    and then trying to ping it.  If that fails, the device should go on a
    list of bad devices that are then passed as such to lsm_dg_action
    (there is a -b option for bad devices).  But, I'm not sure if the time
    that it takes to do all this is "charged" to lsm_dg_action, so I'm not
    real clear on why you are getting a message that lsm_dg_action timed
    out.  Unless the above tests somehow returned no bad devices, and
    lsm_dg_action was trying to get to the pulled device.
    
    Is your logging set to Informational?  If so, you should see some
    messages that look something like: "can't stat() device file /dev/xxx",
    if the stat failed, or: "can't reach device '/dev/xxx'", if the ping
    failed.  If you are not seeing either, then it would appear that the 
    pulled disk is not being seen as failed.
    
    If you could turn Informational logging on, and try this test, and give
    us a sample of the daemon.log output, it might help see what is
    happening here.  Then maybe I could tell you if you should open a CLD
    on this, or not.
    
    Rob Marshall
    USEG
1825.5can you check the logs I sent to the csc?USCTR1::ASCHERDave AscherSat Jan 25 1997 10:4935
    Taking a *quick* look at the code, it appears that the agent tries to
    determine which disks are bad by first stat'ing the special device file
    and then trying to ping it.  If that fails, the device should go on a
    list of bad devices that are then passed as such to lsm_dg_action
    (there is a -b option for bad devices).
 
 what is lsm_dg_action supposed to do with the list of bad devices?
 
   But, I'm not sure if the time
    that it takes to do all this is "charged" to lsm_dg_action, so I'm not
    real clear on why you are getting a message that lsm_dg_action timed
    out.  Unless the above tests somehow returned no bad devices, and
    lsm_dg_action was trying to get to the pulled device.
    
    Is your logging set to Informational?  If so, you should see some
    messages that look something like: "can't stat() device file /dev/xxx",
    if the stat failed, or: "can't reach device '/dev/xxx'", if the ping
    failed.  If you are not seeing either, then it would appear that the 
    pulled disk is not being seen as failed.
    
    If you could turn Informational logging on, and try this test, and give
    us a sample of the daemon.log output, it might help see what is
    happening here.  Then maybe I could tell you if you should open a CLD
    on this, or not.

  I'm afraid that I am not at the customer site at the moment... I will
 try to get more information when I am back there or at one of 2 other
 sites I will be visiting this week  - all of whom have exhibited this
 behavior. There is a log with the IPMT I filed (#c970113-534) but I
 don't know how to access the IPMT system.  
    

 thanks,
 
 d
1825.6usr505.zko.dec.com::MarshallRob MarshallTue Jan 28 1997 14:288
Hi,

lsm_dg_action goes through the list of devices associated with the service and
does things like: voldisk define disk...  So, if the disk is bad, it takes it
off the list of devices in the service, and should not try to access it at all.

Rob

1825.7comments? USCTR1::ASCHERDave AscherTue Jan 28 1997 21:1929
Rob,
    
    thanks... I'm begining to get a funny feeling about this- I
    have assumed for a long time that if I tell asemgr about one
    of the disks in the disk group, since it tells me that it now
    knows about the whole disk group, there is nothing to be gained
    by the tedious exercise of entering each and every lsm volume
    associated iwth the storage. I am now getting the feeling that
    only the devices explicitly associated with the storage of
    a volume are going to be pinged for good health - and that
    only a failover of such a disk will exercise the check for
    a mirror logic, etc...

    I have had an inconsistent set of experiences about what happens
    when I pull a disk  or a cable or shutdown a shelf or a
    controller... I now think that the experiences might have been
    different because sometimes I was using a disk that was on
    the volume that ASE know about and other times it was not.
        
    Is this in fact the case? I have NOT been telling asemgr about
    more than one of the volumes because my start/stop scripts
    take care of the mount/umounts anyway.. one of my customers
    had 54 volumes - it a very tedious task to get all that into
    asemgr if there is not good reason to do so. On the other hand,
    if it will make the disk stuff work the way it oughta, I'd
    do it.
    
    d
    
1825.8???BACHUS::DEVOSManu Devos DEC/SI Brussels 856-7539Fri Jan 31 1997 03:4614
    Hi,
    
    > I have NOT been telling asemgr about more than one of the volumes because
    > my start/stop scripts take care of the mount/umounts anyway.
    
    My understanding was that ASE itself is taking note of the LSM devices
    of a diskgroup once at least one volume of this diskgroup is involved
    in a NFS/DISK service. If it is not the case, I dont see how
    lsm_dg_action could "voldisk online rzxx" and thus if not all disks of
    a diskgroup are placed online, the LSM volumes could not start...
    
    So, if my understanding is not OK, can Engineering clarify this ...
    
    Regards, Manu.
1825.9RED ALERT USCTR1::ASCHERDave AscherFri Jan 31 1997 06:1272
Manu,
    
    It is my guess (I have not been able to sit down long enough
    with the ase internal scripts to figure this out) that the
    part of ASE that figures out what it has to do to move the
    whole disk group is not the same as the part that figures out
    which disks it needs to check the health of. I assume that
    the thinking was something like - you might have disks in the
    group that are not necessary for the service, so their health
    should not be monitored. Not unreasonable, but not the first
    model that would have come to my mind.
    
    The question is, what does it take to get ASE to monitor the
    health of the disks making up a volume??? What does one have to do
    to make this work? I have almost always had "trouble" when I
    tested pulling a disk or a cable, but have somehow explained it
    away to myself.  With mirrors, which our customers always have, it
    is nice that the service does not failover when only one plex is
    lost - but I have not been able to consistently see that ASE
    notices anything is wrong until I try to access the storage. I
    have always been puzzled about how come the ping has picked up the
    failure in a few seconds - a minute maybe.
    
    I have ONCE seen a message in the daemon.log saying somethng about
    the single volume that told ASE about being okay becauase it
    has a mirror.
    
    I have now installed the latest vold, voldisk, am.o, am_scsi.o,
    lsm_dg_action and defined all 54 volumes to ASE. Mountpoint NONE.
    Now when I pull out a disk, or turn off the HSZ, ASE seems to do
    not much of anything. When I try to access the volume (ls or
    touch, for example) the process hangs for seconds - or up to
    several minutes. WHen I finally give up and put the disk back in
    the process completes... ASE makes some lame log entries about
    scsi reservation resets and/or failures but nothing whatever about
    the fact that there is a possible failure of a volume that needs
    to be checked. 
    
    I have also verified what happens when I turn the service offline,
    online the disks and import the diskgroup manually. I then
    pull out a disk (there is a mirror for it) and attempt to access
    it and experience no hanging... volwatch reports a problem
    (which does not seem to occur when ASE in the picture)...
    
    This is a very distressing and exhausting situation - one that I
    can't believe is still occuring after all of this time. The
    handling of failures to access storage is one of the basic design
    pillars of ASE and it ought to be tested well enough so that it
    works reliably and so that we don't have to verify over and over
    again at each customer site. and/or it ought to be made much
    clearer  in the docs what the magic words are that will make it
    work. 
    
    I am dealing with customers who are implementing SAP - putting
    their company's entire business systems on this stuff. WHile
    it is a big plus that they won't have to wait for hours for
    mcs to show up after a cpu bombs, it is a big minus that when
    a disk goes (definitely a more common occurence) they can't
    rely upon ASE to detect the failure and do the right thing.
    At this point with the latest patches they are getting hung
    processes which can only get cleared by replacing the disk...
    not a wonderful situation for the system with the company gonads
    sitting on it.
    
    How can get some quick action on this situation without involving
    too may VPs from too may companies?
    
    ANybody home in Engineering?
    
    d
    
    
1825.10usr404.zko.dec.com::MarshallRob MarshallFri Jan 31 1997 09:0766
Hi Dave,

Maybe we need to back off for a minute and clarify a couple of terms:

ASE - Available Server Environment
LSM - Logical Storage Manager

As you may be able to guess by the names, ASE is not designed to monitor, or
manage, storage.  It's job is to provide a means to create services that are
not tied to a specific system.  This is done to try to improve the availability
of the services clients use.  When a system is deemed no longer able to offer
the service, ASE will try to relocate the service to a node that CAN offer
the service.

LSM has been integrated to a certain degree within ASE because it was clear
that system managers would want to use some tool to manage and monitor their
storage (since ASE does not do it).  It is LSM's job to try to improve the
accessibility of the data that is on the physical storage media.  It uses
mirroring, striping, etc. to accomplish this.

ASE pings the devices when starting a service in an attempt to help the LSM
script so that it does not try to talk to devices that are bad.  What LSM then
does with what is left is LSM's decision, and ASE tries to determine what the 
health of the SERVICE is based on the return values from the scripts.

LSM also does not monitor devices.  It notices (hopefully) when a device has
failed and makes a decision based on what is left of the volume as to whether,
or not, it can still provide access to the data.  In most cases, only one 
plex is affected, and the other plex in a volume can still be used to access
the data.  An error should be returned, and the plex should be disabled. 

It is the system manager's job to monitor the health of his system and the
devices.  To help him tools have been developed like console manager and
system watchdog.  And that may be what you are really looking for to help
you provide the solution your customer needs.(?)

To make this clear:

ASE does NOT monitor the health of storage.  It monitors the health of network
connections and notices when members are very sick, or die.

LSM doe NOT monitor the health of storage.  It notices when errors occur as it
tries to access a device, and, if the data is mirrored, uses other devices that
should have the same data on it to satisfy the request.  It provides redundancy
of data to allow for more highly available data.  Even volwatch will only tell
you when a problem is noticed, it does not examine the storage on a constant
basis (at least I don't believe it does...).

System watchdog can be used to monitor the health of different components, 
including storage.

Console manager can ensure that messages are routed to some person(s) who needs
to react to them.

ASE is NOT a replacement for good system management.  If anything it makes 
system management more complex, not less labor intensive.  System managers
must still know how to deal with disk failures, network failures, system
crashes, etc.  ASE just tries to ensure that the services are available for
the users, as best it can.  This should minimize the down time of a service.

Also, engineering is home, and very busy.  It has been repeated many times 
in this, and other conferences, that notes files are not a support mechanism.
If you have a support issue, ie if LSM is hanging, please open a CLD.

Rob Marshall
USEG
1825.11I'm NEVER home... I'm delivering YOUR productUSCTR1::ASCHERDave AscherFri Jan 31 1997 10:00149
Rob,
    
    I appear to have raised some hackles - this was not my intent. I
    appreciate the work you guys do but I don't find the CLD IPMT
    process to be at all useful when I am working at a customer site
    through the night far from home under tight deadlines with a
    complex configuration. I have never actually received any useful
    help through those processes in two years of doing SAP R/3
    DECsafe projects.  But that's another story. 
    
Maybe we need to back off for a minute and clarify a couple of terms:

ASE - Available Server Environment
LSM - Logical Storage Manager

As you may be able to guess by the names, ASE is not designed to monitor, or
manage, storage.  It's job is to provide a means to create services that are
not tied to a specific system.  This is done to try to improve the availability
of the services clients use.  When a system is deemed no longer able to offer
the service, ASE will try to relocate the service to a node that CAN offer
the service.

    SO far, this is very clear and I have no problem with it.
    
LSM has been integrated to a certain degree within ASE because it was clear
that system managers would want to use some tool to manage and monitor their
storage (since ASE does not do it).  It is LSM's job to try to improve the
accessibility of the data that is on the physical storage media.  It uses
mirroring, striping, etc. to accomplish this.

    I can't agree with this. LSM-awareness was put into ASE when
    it became clear that the 'service' was not 'available' when
    there were storage problems - also ASE had to be able to move
    the storage resources around (just like does with the network
    alias resource) to a system that can provide the service. 
    
ASE pings the devices when starting a service in an attempt to help the LSM
script so that it does not try to talk to devices that are bad.  What LSM then
does with what is left is LSM's decision, and ASE tries to determine what the 
health of the SERVICE is based on the return values from the scripts.

    Again - ASE apparently does ping devices during service startup
    but due to the timeouts on SCSI this seems to typically have
    the effect of causing the service to not be able to start at
    all. SO the 'service' become less available than it would have
    been without ASE. ASE is supposed to figure out if the loss
    of 'this' disk will mean the loss of a volume (if the disk is
    part of an LSM volume). "Loss of a disk" could also be loss
    of a controller or cable - in which case failover to another
    node would make the service available again. LSM can't do that...
    only ASE with LSM knowlege (and ASE_PARTIAL_MIRRORING) can
    do that.
    
    
LSM also does not monitor devices.  It notices (hopefully) when a device has
failed and makes a decision based on what is left of the volume as to whether,
or not, it can still provide access to the data.  In most cases, only one 
plex is affected, and the other plex in a volume can still be used to access
the data.  An error should be returned, and the plex should be disabled. 

    I can only think of ONE time when I observed ASE making a
    determination that there was another good plex (an LSM concept,
    yes?) and that the volume good still be used. THere was a)
    no indication that the bad plex was disabled b) no indication
    that it checked more than the single volume which it was told
    about.
    
    Now that I have updated am.o, am_scsi.o, voldisk, vold, and
    lsm_dg_action (3.2G) AND told ase about all 54 volumes it does
    not indicate at all that it is going to check for a remaining
    good plex.
    
It is the system manager's job to monitor the health of his system and the
devices.  To help him tools have been developed like console manager and
system watchdog.  And that may be what you are really looking for to help
you provide the solution your customer needs.(?)

    I'm looking for DECsafe/ASE alias Trucluster/Available Server
    to do what it takes to ensure that availability is enhanced.
    It has been some time since I listened to Eric present the
    ASE vision, but I must say that I remain convinced that at
    least in the early versions there was talk of monitoring of
    the storage to deal with loss of the ability to access a disk
    from one system when it could still be accessed from another.
    
To make this clear:

ASE does NOT monitor the health of storage.  It monitors the health of network
connections and notices when members are very sick, or die.

    Apparently you are correct about it not doing so. I don't see
    it doing it. I also don't see it doing what is definitely supposed
    to do (as per your description above) when access to a storage
    component is detected.
    
LSM doe NOT monitor the health of storage.  It notices when errors occur as it
tries to access a device, and, if the data is mirrored, uses other devices that
should have the same data on it to satisfy the request.  It provides redundancy
of data to allow for more highly available data.  Even volwatch will only tell
you when a problem is noticed, it does not examine the storage on a constant
basis (at least I don't believe it does...).
    
    ....
    
ASE is NOT a replacement for good system management.  If anything it makes 
system management more complex, not less labor intensive.  System managers
must still know how to deal with disk failures, network failures, system
crashes, etc.  ASE just tries to ensure that the services are available for
the users, as best it can.  This should minimize the down time of a service.

    ASE should not introduce behaviors that reduce the availability of
    'services' to users. What I am seeing is that with the newest
    patches when a disk (which has a mirror) is pulled, the
    application attempting to access the volume is hung for an
    indeterminate and generally unacceptably long time. When the same
    volumes are tested with the service turned offline, volumes and
    disk groups online and imported by hand, LSM does what it supposed
    to do - keep the failure of the disk out of the way of the
    application trying to access the volume.  
    
    This says to me that the behavior of ASE is somewhere between
    "highly undesireable" and "totally unacceptable". I am sure
    it is not what was intended by its creators - but that I have
    bumped into either a bug or a set of assumptions that the creators
    had that my dozen or so pretty big SAP installations do not meet.
   
    I would like to be able to continue to assure customers that
    using ASE will enhance their applications availability. If
    I have screwed something up or if there is a communications
    issue (human communications) that can be clear up and we can
    have a bunch of much happier customers and I can get some rest.
    If there is a bug, I think we would all like to identify it
    and get it fixed.
        
Also, engineering is home, and very busy.  It has been repeated many times 
in this, and other conferences, that notes files are not a support mechanism.
If you have a support issue, ie if LSM is hanging, please open a CLD.

    My apologies for implying that my desperate pleas for help were
    being ignored. Just gotta wonder why nobody would be interested in
    what I (and Manu) were asking about - we've been working with this
    stuff and this group for a long time and might actually have
    something worthwhile to contribute from time to time. On the other
    hand we might just be rambling after a 24 hour straight day spent
    at a major customer who is going 'live' with SAP in 48 hrs,
    struggling to make everything work sensibly.
    
        
     
1825.12Go to the future ...BACHUS::DEVOSManu Devos DEC/SI Brussels 856-7539Tue Feb 04 1997 04:1020
Dave, 

I think HSZ40 is the real problem here...

I NEVER encountered the situation you described with NON-HSZ40 config.

I agree with your concern about the fact that we, in the field, are

NEVER playing with ASE alone, but with whole config, soft & hard, with

LSM, ADVFS, DECNSR (oracle-sap-triton-sms clinicom...) and we said in french

"La solidite d'une chaine depend de son plus faible maillon" that I am

going to try to translate in " A chain is so strong as its weakest link" ?


Thus, maybe, (Are you a taker, Eric) a new notefile ASE_IN_THE_FIELD should

be created which should be monitored by the ASE-LSM-ADVFS-DECNSR engineer ?
1825.13XIRTLU::schottEric R. Schott USG Product ManagementTue Feb 04 1997 08:1024
Hi

  I agree with your concern about full system testing.  The hi-test
program is suppose to provide a method for better full system
testing to be done.  I think if you have inputs to what they should
be testing, you should discuss this with Kevin Dadoly.

  As to running into problems with products working together...I think
these need to be raised with the product groups involved.  I understand
this may mean interacting in multiple notes conferences, but I don't
see that creating a new conference is going to solve the problem ( I
have no problem with another conference, I just don't think it will
get the interactions your are requesting ).

It would be good to get a write up of the applications folks have
integrated with ASE (so that others might learn from this)...it would
also help folks in teams like hi-test adjust their testing plans to
include the most common products.  I think some of the writeups/scripts
should be posted to this conference for ASE/clusters in addition to
the conference for the product involved (if their is a conference).

regards

Eric