[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference smurf::ase

Title:	ase

Moderator:	SMURF::GROSSO

Created:	Thu Jul 29 1993
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2114
Total number of notes:	7347

1825.0. "DECsafe v.13 Unix 3.2g" by USCTR1::ASCHER (Dave Ascher) Tue Jan 14 1997 15:06

T.R	Title	User	Personal Name	Date	Lines
1825.1		KITCHE::schott	Eric R. Schott USG Product Management	`Tue Jan 14 1997 19:48`	7
1825.2		SMURF::KNIGHT	Fred Knight	`Thu Jan 23 1997 14:47`	7
1825.3		USCTR1::ASCHER	Dave Ascher	`Thu Jan 23 1997 23:50`	3
	re: hmmm is there some new new new firmware that mght address this? d
1825.4		SMURF::MARSHALL	Rob Marshall - USEG	`Fri Jan 24 1997 17:57`	25
	Hi, Taking a quick look at the code, it appears that the agent tries to determine which disks are bad by first stat'ing the special device file and then trying to ping it. If that fails, the device should go on a list of bad devices that are then passed as such to lsm_dg_action (there is a -b option for bad devices). But, I'm not sure if the time that it takes to do all this is "charged" to lsm_dg_action, so I'm not real clear on why you are getting a message that lsm_dg_action timed out. Unless the above tests somehow returned no bad devices, and lsm_dg_action was trying to get to the pulled device. Is your logging set to Informational? If so, you should see some messages that look something like: "can't stat() device file /dev/xxx", if the stat failed, or: "can't reach device '/dev/xxx'", if the ping failed. If you are not seeing either, then it would appear that the pulled disk is not being seen as failed. If you could turn Informational logging on, and try this test, and give us a sample of the daemon.log output, it might help see what is happening here. Then maybe I could tell you if you should open a CLD on this, or not. Rob Marshall USEG
1825.5	can you check the logs I sent to the csc?	USCTR1::ASCHER	Dave Ascher	`Sat Jan 25 1997 10:49`	35
	Taking a quick look at the code, it appears that the agent tries to determine which disks are bad by first stat'ing the special device file and then trying to ping it. If that fails, the device should go on a list of bad devices that are then passed as such to lsm_dg_action (there is a -b option for bad devices). what is lsm_dg_action supposed to do with the list of bad devices? But, I'm not sure if the time that it takes to do all this is "charged" to lsm_dg_action, so I'm not real clear on why you are getting a message that lsm_dg_action timed out. Unless the above tests somehow returned no bad devices, and lsm_dg_action was trying to get to the pulled device. Is your logging set to Informational? If so, you should see some messages that look something like: "can't stat() device file /dev/xxx", if the stat failed, or: "can't reach device '/dev/xxx'", if the ping failed. If you are not seeing either, then it would appear that the pulled disk is not being seen as failed. If you could turn Informational logging on, and try this test, and give us a sample of the daemon.log output, it might help see what is happening here. Then maybe I could tell you if you should open a CLD on this, or not. I'm afraid that I am not at the customer site at the moment... I will try to get more information when I am back there or at one of 2 other sites I will be visiting this week - all of whom have exhibited this behavior. There is a log with the IPMT I filed (#c970113-534) but I don't know how to access the IPMT system. thanks, d
1825.6		usr505.zko.dec.com::Marshall	Rob Marshall	`Tue Jan 28 1997 14:28`	8
	Hi, lsm_dg_action goes through the list of devices associated with the service and does things like: voldisk define disk... So, if the disk is bad, it takes it off the list of devices in the service, and should not try to access it at all. Rob
1825.7	comments?	USCTR1::ASCHER	Dave Ascher	`Tue Jan 28 1997 21:19`	29
	Rob, thanks... I'm begining to get a funny feeling about this- I have assumed for a long time that if I tell asemgr about one of the disks in the disk group, since it tells me that it now knows about the whole disk group, there is nothing to be gained by the tedious exercise of entering each and every lsm volume associated iwth the storage. I am now getting the feeling that only the devices explicitly associated with the storage of a volume are going to be pinged for good health - and that only a failover of such a disk will exercise the check for a mirror logic, etc... I have had an inconsistent set of experiences about what happens when I pull a disk or a cable or shutdown a shelf or a controller... I now think that the experiences might have been different because sometimes I was using a disk that was on the volume that ASE know about and other times it was not. Is this in fact the case? I have NOT been telling asemgr about more than one of the volumes because my start/stop scripts take care of the mount/umounts anyway.. one of my customers had 54 volumes - it a very tedious task to get all that into asemgr if there is not good reason to do so. On the other hand, if it will make the disk stuff work the way it oughta, I'd do it. d
1825.8	???	BACHUS::DEVOS	Manu Devos DEC/SI Brussels 856-7539	`Fri Jan 31 1997 03:46`	14
	Hi, > I have NOT been telling asemgr about more than one of the volumes because > my start/stop scripts take care of the mount/umounts anyway. My understanding was that ASE itself is taking note of the LSM devices of a diskgroup once at least one volume of this diskgroup is involved in a NFS/DISK service. If it is not the case, I dont see how lsm_dg_action could "voldisk online rzxx" and thus if not all disks of a diskgroup are placed online, the LSM volumes could not start... So, if my understanding is not OK, can Engineering clarify this ... Regards, Manu.
1825.9	RED ALERT	USCTR1::ASCHER	Dave Ascher	`Fri Jan 31 1997 06:12`	72
	Manu, It is my guess (I have not been able to sit down long enough with the ase internal scripts to figure this out) that the part of ASE that figures out what it has to do to move the whole disk group is not the same as the part that figures out which disks it needs to check the health of. I assume that the thinking was something like - you might have disks in the group that are not necessary for the service, so their health should not be monitored. Not unreasonable, but not the first model that would have come to my mind. The question is, what does it take to get ASE to monitor the health of the disks making up a volume??? What does one have to do to make this work? I have almost always had "trouble" when I tested pulling a disk or a cable, but have somehow explained it away to myself. With mirrors, which our customers always have, it is nice that the service does not failover when only one plex is lost - but I have not been able to consistently see that ASE notices anything is wrong until I try to access the storage. I have always been puzzled about how come the ping has picked up the failure in a few seconds - a minute maybe. I have ONCE seen a message in the daemon.log saying somethng about the single volume that told ASE about being okay becauase it has a mirror. I have now installed the latest vold, voldisk, am.o, am_scsi.o, lsm_dg_action and defined all 54 volumes to ASE. Mountpoint NONE. Now when I pull out a disk, or turn off the HSZ, ASE seems to do not much of anything. When I try to access the volume (ls or touch, for example) the process hangs for seconds - or up to several minutes. WHen I finally give up and put the disk back in the process completes... ASE makes some lame log entries about scsi reservation resets and/or failures but nothing whatever about the fact that there is a possible failure of a volume that needs to be checked. I have also verified what happens when I turn the service offline, online the disks and import the diskgroup manually. I then pull out a disk (there is a mirror for it) and attempt to access it and experience no hanging... volwatch reports a problem (which does not seem to occur when ASE in the picture)... This is a very distressing and exhausting situation - one that I can't believe is still occuring after all of this time. The handling of failures to access storage is one of the basic design pillars of ASE and it ought to be tested well enough so that it works reliably and so that we don't have to verify over and over again at each customer site. and/or it ought to be made much clearer in the docs what the magic words are that will make it work. I am dealing with customers who are implementing SAP - putting their company's entire business systems on this stuff. WHile it is a big plus that they won't have to wait for hours for mcs to show up after a cpu bombs, it is a big minus that when a disk goes (definitely a more common occurence) they can't rely upon ASE to detect the failure and do the right thing. At this point with the latest patches they are getting hung processes which can only get cleared by replacing the disk... not a wonderful situation for the system with the company gonads sitting on it. How can get some quick action on this situation without involving too may VPs from too may companies? ANybody home in Engineering? d
1825.10		usr404.zko.dec.com::Marshall	Rob Marshall	`Fri Jan 31 1997 09:07`	66
	Hi Dave, Maybe we need to back off for a minute and clarify a couple of terms: ASE - Available Server Environment LSM - Logical Storage Manager As you may be able to guess by the names, ASE is not designed to monitor, or manage, storage. It's job is to provide a means to create services that are not tied to a specific system. This is done to try to improve the availability of the services clients use. When a system is deemed no longer able to offer the service, ASE will try to relocate the service to a node that CAN offer the service. LSM has been integrated to a certain degree within ASE because it was clear that system managers would want to use some tool to manage and monitor their storage (since ASE does not do it). It is LSM's job to try to improve the accessibility of the data that is on the physical storage media. It uses mirroring, striping, etc. to accomplish this. ASE pings the devices when starting a service in an attempt to help the LSM script so that it does not try to talk to devices that are bad. What LSM then does with what is left is LSM's decision, and ASE tries to determine what the health of the SERVICE is based on the return values from the scripts. LSM also does not monitor devices. It notices (hopefully) when a device has failed and makes a decision based on what is left of the volume as to whether, or not, it can still provide access to the data. In most cases, only one plex is affected, and the other plex in a volume can still be used to access the data. An error should be returned, and the plex should be disabled. It is the system manager's job to monitor the health of his system and the devices. To help him tools have been developed like console manager and system watchdog. And that may be what you are really looking for to help you provide the solution your customer needs.(?) To make this clear: ASE does NOT monitor the health of storage. It monitors the health of network connections and notices when members are very sick, or die. LSM doe NOT monitor the health of storage. It notices when errors occur as it tries to access a device, and, if the data is mirrored, uses other devices that should have the same data on it to satisfy the request. It provides redundancy of data to allow for more highly available data. Even volwatch will only tell you when a problem is noticed, it does not examine the storage on a constant basis (at least I don't believe it does...). System watchdog can be used to monitor the health of different components, including storage. Console manager can ensure that messages are routed to some person(s) who needs to react to them. ASE is NOT a replacement for good system management. If anything it makes system management more complex, not less labor intensive. System managers must still know how to deal with disk failures, network failures, system crashes, etc. ASE just tries to ensure that the services are available for the users, as best it can. This should minimize the down time of a service. Also, engineering is home, and very busy. It has been repeated many times in this, and other conferences, that notes files are not a support mechanism. If you have a support issue, ie if LSM is hanging, please open a CLD. Rob Marshall USEG
1825.11	I'm NEVER home... I'm delivering YOUR product	USCTR1::ASCHER	Dave Ascher	`Fri Jan 31 1997 10:00`	149
	Rob, I appear to have raised some hackles - this was not my intent. I appreciate the work you guys do but I don't find the CLD IPMT process to be at all useful when I am working at a customer site through the night far from home under tight deadlines with a complex configuration. I have never actually received any useful help through those processes in two years of doing SAP R/3 DECsafe projects. But that's another story. Maybe we need to back off for a minute and clarify a couple of terms: ASE - Available Server Environment LSM - Logical Storage Manager As you may be able to guess by the names, ASE is not designed to monitor, or manage, storage. It's job is to provide a means to create services that are not tied to a specific system. This is done to try to improve the availability of the services clients use. When a system is deemed no longer able to offer the service, ASE will try to relocate the service to a node that CAN offer the service. SO far, this is very clear and I have no problem with it. LSM has been integrated to a certain degree within ASE because it was clear that system managers would want to use some tool to manage and monitor their storage (since ASE does not do it). It is LSM's job to try to improve the accessibility of the data that is on the physical storage media. It uses mirroring, striping, etc. to accomplish this. I can't agree with this. LSM-awareness was put into ASE when it became clear that the 'service' was not 'available' when there were storage problems - also ASE had to be able to move the storage resources around (just like does with the network alias resource) to a system that can provide the service. ASE pings the devices when starting a service in an attempt to help the LSM script so that it does not try to talk to devices that are bad. What LSM then does with what is left is LSM's decision, and ASE tries to determine what the health of the SERVICE is based on the return values from the scripts. Again - ASE apparently does ping devices during service startup but due to the timeouts on SCSI this seems to typically have the effect of causing the service to not be able to start at all. SO the 'service' become less available than it would have been without ASE. ASE is supposed to figure out if the loss of 'this' disk will mean the loss of a volume (if the disk is part of an LSM volume). "Loss of a disk" could also be loss of a controller or cable - in which case failover to another node would make the service available again. LSM can't do that... only ASE with LSM knowlege (and ASE_PARTIAL_MIRRORING) can do that. LSM also does not monitor devices. It notices (hopefully) when a device has failed and makes a decision based on what is left of the volume as to whether, or not, it can still provide access to the data. In most cases, only one plex is affected, and the other plex in a volume can still be used to access the data. An error should be returned, and the plex should be disabled. I can only think of ONE time when I observed ASE making a determination that there was another good plex (an LSM concept, yes?) and that the volume good still be used. THere was a) no indication that the bad plex was disabled b) no indication that it checked more than the single volume which it was told about. Now that I have updated am.o, am_scsi.o, voldisk, vold, and lsm_dg_action (3.2G) AND told ase about all 54 volumes it does not indicate at all that it is going to check for a remaining good plex. It is the system manager's job to monitor the health of his system and the devices. To help him tools have been developed like console manager and system watchdog. And that may be what you are really looking for to help you provide the solution your customer needs.(?) I'm looking for DECsafe/ASE alias Trucluster/Available Server to do what it takes to ensure that availability is enhanced. It has been some time since I listened to Eric present the ASE vision, but I must say that I remain convinced that at least in the early versions there was talk of monitoring of the storage to deal with loss of the ability to access a disk from one system when it could still be accessed from another. To make this clear: ASE does NOT monitor the health of storage. It monitors the health of network connections and notices when members are very sick, or die. Apparently you are correct about it not doing so. I don't see it doing it. I also don't see it doing what is definitely supposed to do (as per your description above) when access to a storage component is detected. LSM doe NOT monitor the health of storage. It notices when errors occur as it tries to access a device, and, if the data is mirrored, uses other devices that should have the same data on it to satisfy the request. It provides redundancy of data to allow for more highly available data. Even volwatch will only tell you when a problem is noticed, it does not examine the storage on a constant basis (at least I don't believe it does...). .... ASE is NOT a replacement for good system management. If anything it makes system management more complex, not less labor intensive. System managers must still know how to deal with disk failures, network failures, system crashes, etc. ASE just tries to ensure that the services are available for the users, as best it can. This should minimize the down time of a service. ASE should not introduce behaviors that reduce the availability of 'services' to users. What I am seeing is that with the newest patches when a disk (which has a mirror) is pulled, the application attempting to access the volume is hung for an indeterminate and generally unacceptably long time. When the same volumes are tested with the service turned offline, volumes and disk groups online and imported by hand, LSM does what it supposed to do - keep the failure of the disk out of the way of the application trying to access the volume. This says to me that the behavior of ASE is somewhere between "highly undesireable" and "totally unacceptable". I am sure it is not what was intended by its creators - but that I have bumped into either a bug or a set of assumptions that the creators had that my dozen or so pretty big SAP installations do not meet. I would like to be able to continue to assure customers that using ASE will enhance their applications availability. If I have screwed something up or if there is a communications issue (human communications) that can be clear up and we can have a bunch of much happier customers and I can get some rest. If there is a bug, I think we would all like to identify it and get it fixed. Also, engineering is home, and very busy. It has been repeated many times in this, and other conferences, that notes files are not a support mechanism. If you have a support issue, ie if LSM is hanging, please open a CLD. My apologies for implying that my desperate pleas for help were being ignored. Just gotta wonder why nobody would be interested in what I (and Manu) were asking about - we've been working with this stuff and this group for a long time and might actually have something worthwhile to contribute from time to time. On the other hand we might just be rambling after a 24 hour straight day spent at a major customer who is going 'live' with SAP in 48 hrs, struggling to make everything work sensibly.
1825.12	Go to the future ...	BACHUS::DEVOS	Manu Devos DEC/SI Brussels 856-7539	`Tue Feb 04 1997 04:10`	20
	Dave, I think HSZ40 is the real problem here... I NEVER encountered the situation you described with NON-HSZ40 config. I agree with your concern about the fact that we, in the field, are NEVER playing with ASE alone, but with whole config, soft & hard, with LSM, ADVFS, DECNSR (oracle-sap-triton-sms clinicom...) and we said in french "La solidite d'une chaine depend de son plus faible maillon" that I am going to try to translate in " A chain is so strong as its weakest link" ? Thus, maybe, (Are you a taker, Eric) a new notefile ASE_IN_THE_FIELD should be created which should be monitored by the ASE-LSM-ADVFS-DECNSR engineer ?
1825.13		XIRTLU::schott	Eric R. Schott USG Product Management	`Tue Feb 04 1997 08:10`	24
	Hi I agree with your concern about full system testing. The hi-test program is suppose to provide a method for better full system testing to be done. I think if you have inputs to what they should be testing, you should discuss this with Kevin Dadoly. As to running into problems with products working together...I think these need to be raised with the product groups involved. I understand this may mean interacting in multiple notes conferences, but I don't see that creating a new conference is going to solve the problem ( I have no problem with another conference, I just don't think it will get the interactions your are requesting ). It would be good to get a write up of the applications folks have integrated with ASE (so that others might learn from this)...it would also help folks in teams like hi-test adjust their testing plans to include the most common products. I think some of the writeups/scripts should be posted to this conference for ASE/clusters in addition to the conference for the product involved (if their is a conference). regards Eric