[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:	DECmcc user notes file. Does not replace IPMT.
Notice:	Use IPMT for problems. Newsletter location in note 6187
Moderator:	TAEC::BEROUD

Created:	Mon Aug 21 1989
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	6497
Total number of notes:	27359

4858.0. "Many exception alarms?" by BIGUN::MAYNE (`AXP!': Bill the Cat) Sat Apr 10 1993 22:57

I'm using the mcc_df script and an alarm (as described in the script manual) to
check for free disk space.

A lot of exception alarms are firing. The alarm information says "The script has
timed out". This is obviously not true, because I've put some logging
information in the script (piping the output from df through tee to a file) and
the information is being printed from the script within a second or two. (I've
bumped up the timeout time to 10 seconds.)

The alarms are set to go off every minute (and they will go off, because I've
deliberately used lots of disk space). I'm running them on three different
systems, locally and using rsh. The script works fine every time I use it from
the ULTRIX command line.

How can I look "inside" MCC to find out why the exception alarms are firing?
What causes timeout exceptions when the scripts aren't timing out?

PJDM

T.R	Title	User	Personal Name	Date	Lines
4858.1	What causes timeout exceptions when the scripts aren't timing out?	MOLAR::ROBERTS	Keith Roberts - Network Management Applications	`Mon Apr 12 1993 09:56`	37
	PJDM >> What causes timeout exceptions when the scripts aren't timing out? Inside the Script AM is a timer which watches over executing scripts. The timer is set to your Script Instance's Time-Out value, or if thats not specified, the Default-Timeout value (show mcc 0 script_am all char). If the timer runs out, the process that is executing the script is deleted and the 'Script has timed out' exception returned. >> I'm using the mcc_df script and an alarm ... A lot of exception alarms >> are firing [saying] "The script has timed out". This is obviously not true, >> because I've put some logging information in the script This is weird because Alarms will wait forever for the Script to complete. It is up to the Script AM to limit the execution time. Unless you are using Global Wildcards, which I don't think you are. If you write a Rule using a global wildcard, Alarms will poll every global entity as specified in your Domain. Alarms has a timer which is only turned on when processing global wildcards. The timer allows a maximum of 1 minute for each global entity to respond. This timer was added because the entities are polled one at a time; if one entity (or access module) never responded the Rule would get stuck. >> How can I look "inside" MCC to find out why the exception alarms >> are firing? The Script AM will only return the 'timeout' exception if the Timer has stopped the Script. Alarms is returning to you this exception. I'll try Rules on the Disk Script and see if I can make any sense out of this. /keith
4858.2	SCRIPT AM timeouts ...	EEMELI::KINNARI		`Wed Apr 14 1993 12:30`	28
	Hi, I've seen similar behaviour (and actually few related problems too). For example, if I start graph operation to mcc_df and select 4 attributes it might (or not) work for a while. After few polls some (or all) attributes stop working (guy is only standing in the corner). But the most interesting thing happens after that: if I do show status from IMPM to mcc_df, script returns data (display updates arguments) but the control is not released (ie. only thing you can push is stop button and the cursor is like a watch). And I assume that will time out after few minutes. And similar thing happens with my own scripts. (I have built telnet emulator which connects to CASE muxes and return status information.) I have defined 10 sepate alarm rules to scripts and if I enable those at the same time usually 3-5 timeouts. But I'm absolutely rule that scripts return arguments. My problems can not be unique because this happens to me both in T1.3 and V1.3 in every Ultrix system I have installed products. Rgds, //pasi
4858.3	Historian and Graph could make lots of calls to the Script AM	MOLAR::ROBERTS	Keith Roberts - Network Management Applications	`Wed Apr 14 1993 15:56`	21
	Help me here .. I'm grasping at straws .. 8} If you Graph multiple attributes (from the same partition), the Graph Widget makes multiple calls to the MM, even though all the attributes of a partition are returned in one call. If you were Graphing 5 Attributes from the Script AM, then 5 calls are made, 5 sub-processes are created, ... I'm thinking there could be something wrong with the Thread Terminator (the thing that times out the executing Scripts). With a whole bunch of requests to the Script AM, the Thread Terminator could be screwing up and terminating the wrong Scripts .. or anything could be possible (guess if what I am saying was true, I'd expect an ACCVIO instead). The output files created by the script AM are based on the process ID and thread ID, so I doubt that there is a conflict there. Please remind me .. are these problems ONLY occuring on Ultrix? Or are people seeing them on VMS too ? /keith
4858.4		BIGUN::MAYNE	`AXP!': Bill the Cat	`Thu Apr 15 1993 01:25`	53
	I can't help but agree with .2 and .3: it seems to be resource/timing related. If I enable two or three rules about 20ish seconds apart (the rules poll every minute) they work fine. If I enable two rules at the same time, one of them consistently falls over. If I enable two rules separately, then another one close to one of the first two, it falls over. If I choose four rules and do a show status, sometimes all four work, sometimes one of them falls over and I get an exception dialogue box with a timeout value of something like "+0-00:01:00.000I0.000" or "+0-00:00:05.000I-----" (different values for different rules/scripts). What is the I0.000 or I----- meant to represent? Watching the log file from the script (using tail -f) and the show status window provides food for thought: the exception dialogue box giving the timeout exception appears before the script finishes executing, but does not kill the script: the script finishes writing all the information to its log file. Below is the script I am using, altered from the mcc_df script. I'm using THRBASE112. MCC has THRBASE111: is that any more (or less) reliable, because I don't remember these problems then. PJDM #!/bin/csh -f # # Modified from the mcc_df script provided in the DECmcc kit. # This one takes a device name and hostname as parameters and performs the # act on the given host. # #! argument 1 has the file system name (optionally)--this is used to filter #! argument 2 is an optional system to rsh to set log=${1}${2} set on_host if ($2 != "") then set on_host="rsh $2" endif echo BEGIN `date` $1 $2>>/tmp/$log if($1 != "") then $on_host df -i \| \ nawk 'BEGIN { N = 1 } {gsub("%", " "); if(N > 2) print; N = N+1; }' \| grep $1\|tee -a /tmp/$log else $on_host df -i \| nawk 'BEGIN { N = 1 } {gsub("%", " "); if(N > 2) print; N = N+1; }'\|tee -a /tmp/$log endif echo END `date` $1 $2>>/tmp/$log exit
4858.5	Ultrix only	EEMELI::KINNARI		`Thu Apr 15 1993 04:39`	9
	I have seen script problems only in Ultrix environment. One thing which is quite frustrated in this script problem is the fact that in my alarm rule will be disabled after timeout ... Rgds, //pasi
4858.6	Only Ultrix .. thats a starting point	MOLAR::ROBERTS	Keith Roberts - Network Management Applications	`Thu Apr 15 1993 12:04`	17
	RE: .5 > I have seen script problems only in Ultrix environment. Good .. I'll poke around there > One thing which is quite frustrated in this script problem is > the fact that in my alarm rule will be disabled after timeout ... Thats weird .. The Rule should only be come disabled if the MM (the Script AM) returns an Exception with a Problem Persistence defined as 'Permament'. I'll have to check to see what is being returned from the Script AM. Thanks /keith
4858.7	Just did some Ultrix Script AM testing .. Not happy with the results 8(	MOLAR::ROBERTS	Keith Roberts - Network Management Applications	`Fri Apr 16 1993 12:25`	32
	Aha! I have discovered that the Script AM can only process three (3) concurrent Show requests !! 8( I selected Graphing all the attributes from the mcc_df script; I think there was 6 or 7 of them. A little while later all the Little men grabbed there heads with the Timeout Error. I selected 2 attributes to Graph and it worked fine. I selected 3 attributes to Graph and they timed out again. I did the same test with 3 FCL windows ... 2 windows with Show Status worked, adding the 3rd caused them all to hang. The Thread-Terminator is doing its job. The 'spawed' jobs are not retuning (or something) so the Terminator is timing out the data request. I don't know what the problem is 8( ... The Script AM forks a process to execute the Script. The parent process waits on a signal that the child has completed .. it just never comes out of that wait. I did determine that the Scripts are running and retuning their data. The Script AM (parent) just isn't waking up to process the data. ---------- Any Ultrix people out there that can help me out here? Could using Signals and CMA threads be causing the problems ?? Thanks /keith
4858.8		TOOK::SWIST	Jim Swist LKG2-2/T2 DTN 226-7102	`Tue Apr 20 1993 09:45`	37
	As a general comment, signals and CMA are not exactly a marriage made in heaven. The following appears to work: fork/exec your subprocess (make sure you use the cma_fork wrapper, although this will happen automatically if you include the normal mcc_interface_def.h file) you need to run a daemon thread to listen for child termination signals - you can't wait from the the original thread if you have more than one (fork) outstanding since signal delivery knows nothing about threads and you will get your wires crossed. the daemon thread does a sigwait() - this is a cma call that will stall (that thread only) for the specified signal (SIGCHLD in this case). When you get notified, figure out which main thread it is for by doing a wait3() system service which will return the pid of the terminating child, then wake up the main thread via CMA condition variables. YOU THEN HAVE TO LOOP BACK ON THE WAIT3 until you get no more pids returned, then loop back on the sigwait(). The loops are because Ultrix signals do not stack, you have to reap** all the exiting children you can before exiting the handler. Confused? There are variation of the above theme in several places throughout MCC - I'd guess it's taken many tries to get them to work in all cases. Have fun. ** This word is used because one such piece of code is kindly called the "Night of the Living Dead Zombie Killer Grim Reaper". Turns out if you are doing subprocesses startups and you don't care if and when those subprocesses end, you still need some variation on the above code, otherwise you system clogs with zombie processes.

Conference azur::mcc

4858.0. "Many exception alarms?" by BIGUN::MAYNE (`AXP!&#039;: Bill the Cat) Sat Apr 10 1993 22:57

4858.0. "Many exception alarms?" by BIGUN::MAYNE (`AXP!': Bill the Cat) Sat Apr 10 1993 22:57