T.R | Title | User | Personal Name | Date | Lines |
---|
4858.1 | What causes timeout exceptions when the scripts aren't timing out? | MOLAR::ROBERTS | Keith Roberts - Network Management Applications | Mon Apr 12 1993 09:56 | 37 |
|
PJDM
>> What causes timeout exceptions when the scripts aren't timing out?
Inside the Script AM is a timer which watches over executing scripts.
The timer is set to your Script Instance's Time-Out value, or if thats
not specified, the Default-Timeout value (show mcc 0 script_am all char).
If the timer runs out, the process that is executing the script is deleted
and the 'Script has timed out' exception returned.
>> I'm using the mcc_df script and an alarm ... A lot of exception alarms
>> are firing [saying] "The script has timed out". This is obviously not true,
>> because I've put some logging information in the script
This is weird because Alarms will wait forever for the Script to complete.
It is up to the Script AM to limit the execution time.
Unless you are using Global Wildcards, which I don't think you are.
If you write a Rule using a global wildcard, Alarms will poll every
global entity as specified in your Domain. Alarms has a timer which is
only turned on when processing global wildcards. The timer allows a
maximum of 1 minute for each global entity to respond. This timer was
added because the entities are polled one at a time; if one entity (or
access module) never responded the Rule would get stuck.
>> How can I look "inside" MCC to find out why the exception alarms
>> are firing?
The Script AM will only return the 'timeout' exception if the Timer has
stopped the Script. Alarms is returning to you this exception.
I'll try Rules on the Disk Script and see if I can make any sense
out of this.
/keith
|
4858.2 | SCRIPT AM timeouts ... | EEMELI::KINNARI | | Wed Apr 14 1993 12:30 | 28 |
|
Hi,
I've seen similar behaviour (and actually few related problems too).
For example, if I start graph operation to mcc_df and select 4
attributes it might (or not) work for a while. After few polls
some (or all) attributes stop working (guy is only standing in
the corner). But the most interesting thing happens after that:
if I do show status from IMPM to mcc_df, script returns data
(display updates arguments) but the control is not released
(ie. only thing you can push is stop button and the cursor
is like a watch). And I assume that will time out after
few minutes.
And similar thing happens with my own scripts. (I have built
telnet emulator which connects to CASE muxes and return status
information.) I have defined 10 sepate alarm rules to scripts
and if I enable those at the same time usually 3-5 timeouts.
But I'm absolutely rule that scripts return arguments.
My problems can not be unique because this happens to me
both in T1.3 and V1.3 in every Ultrix system I have installed products.
Rgds,
//pasi
|
4858.3 | Historian and Graph could make lots of calls to the Script AM | MOLAR::ROBERTS | Keith Roberts - Network Management Applications | Wed Apr 14 1993 15:56 | 21 |
| Help me here .. I'm grasping at straws .. 8}
If you Graph multiple attributes (from the same partition), the
Graph Widget makes multiple calls to the MM, even though all the
attributes of a partition are returned in one call.
If you were Graphing 5 Attributes from the Script AM, then 5 calls are
made, 5 sub-processes are created, ... I'm thinking there could be
something wrong with the Thread Terminator (the thing that times out
the executing Scripts). With a whole bunch of requests to the Script
AM, the Thread Terminator could be screwing up and terminating the
wrong Scripts .. or anything could be possible (guess if what I am
saying was true, I'd expect an ACCVIO instead).
The output files created by the script AM are based on the process ID
and thread ID, so I doubt that there is a conflict there.
Please remind me .. are these problems *ONLY* occuring on Ultrix? Or
are people seeing them on VMS too ?
/keith
|
4858.4 | | BIGUN::MAYNE | `AXP!': Bill the Cat | Thu Apr 15 1993 01:25 | 53 |
| I can't help but agree with .2 and .3: it seems to be resource/timing related.
If I enable two or three rules about 20ish seconds apart (the rules poll every
minute) they work fine. If I enable two rules at the same time, one of them
consistently falls over. If I enable two rules separately, then another one
close to one of the first two, it falls over.
If I choose four rules and do a show status, sometimes all four work, sometimes
one of them falls over and I get an exception dialogue box with a timeout value
of something like "+0-00:01:00.000I0.000" or "+0-00:00:05.000I-----" (different
values for different rules/scripts). What is the I0.000 or I----- meant to
represent?
Watching the log file from the script (using tail -f) and the show status
window provides food for thought: the exception dialogue box giving the timeout
exception appears before the script finishes executing, but does not kill the
script: the script finishes writing all the information to its log file.
Below is the script I am using, altered from the mcc_df script.
I'm using THRBASE112. MCC has THRBASE111: is that any more (or less) reliable,
because I don't remember these problems then.
PJDM
#!/bin/csh -f
#
# Modified from the mcc_df script provided in the DECmcc kit.
# This one takes a device name and hostname as parameters and performs the
# act on the given host.
#
#! argument 1 has the file system name (optionally)--this is used to filter
#! argument 2 is an optional system to rsh to
set log=${1}${2}
set on_host
if ($2 != "") then
set on_host="rsh $2"
endif
echo BEGIN `date` $1 $2>>/tmp/$log
if($1 != "") then
$on_host df -i | \
nawk 'BEGIN { N = 1 } {gsub("%", " "); if(N > 2) print; N = N+1; }' | grep
$1|tee -a /tmp/$log
else
$on_host df -i | nawk 'BEGIN { N = 1 } {gsub("%", " "); if(N > 2) print; N =
N+1; }'|tee -a /tmp/$log
endif
echo END `date` $1 $2>>/tmp/$log
exit
|
4858.5 | Ultrix only | EEMELI::KINNARI | | Thu Apr 15 1993 04:39 | 9 |
|
I have seen script problems only in Ultrix environment.
One thing which is quite frustrated in this script problem is
the fact that in my alarm rule will be disabled after timeout ...
Rgds,
//pasi
|
4858.6 | Only Ultrix .. thats a starting point | MOLAR::ROBERTS | Keith Roberts - Network Management Applications | Thu Apr 15 1993 12:04 | 17 |
| RE: .5
> I have seen script problems only in Ultrix environment.
Good .. I'll poke around there
> One thing which is quite frustrated in this script problem is
> the fact that in my alarm rule will be disabled after timeout ...
Thats weird .. The Rule should only be come disabled if the MM
(the Script AM) returns an Exception with a Problem Persistence
defined as 'Permament'. I'll have to check to see what is
being returned from the Script AM.
Thanks /keith
|
4858.7 | Just did some Ultrix Script AM testing .. Not happy with the results 8( | MOLAR::ROBERTS | Keith Roberts - Network Management Applications | Fri Apr 16 1993 12:25 | 32 |
| Aha!
I have discovered that the Script AM can only process three (3)
concurrent Show requests !! 8(
I selected Graphing all the attributes from the mcc_df script;
I think there was 6 or 7 of them. A little while later all the
Little men grabbed there heads with the Timeout Error.
I selected 2 attributes to Graph and it worked fine.
I selected 3 attributes to Graph and they timed out again.
I did the same test with 3 FCL windows ... 2 windows with Show Status
worked, adding the 3rd caused them all to hang.
The Thread-Terminator is doing its job. The 'spawed' jobs are not
retuning (or something) so the Terminator is timing out the data request.
I don't know what the problem is 8( ... The Script AM forks a process
to execute the Script. The parent process waits on a signal that the
child has completed .. it just never comes out of that wait.
I did determine that the Scripts are running and retuning their data.
The Script AM (parent) just isn't waking up to process the data.
----------
Any Ultrix people out there that can help me out here? Could using
Signals and CMA threads be causing the problems ??
Thanks /keith
|
4858.8 | | TOOK::SWIST | Jim Swist LKG2-2/T2 DTN 226-7102 | Tue Apr 20 1993 09:45 | 37 |
| As a general comment, signals and CMA are not exactly a marriage made
in heaven.
The following appears to work:
fork/exec your subprocess (make sure you use the cma_fork wrapper,
although this will happen automatically if you include the normal
mcc_interface_def.h file)
you need to run a daemon thread to listen for child termination
signals - you can't wait from the the original thread if you have
more than one (fork) outstanding since signal delivery knows nothing
about threads and you will get your wires crossed.
the daemon thread does a sigwait() - this is a cma call that will
stall (that thread only) for the specified signal (SIGCHLD in this
case). When you get notified, figure out which main thread it
is for by doing a wait3() system service which will return the pid
of the terminating child, then wake up the main thread via
CMA condition variables. YOU THEN HAVE TO LOOP BACK ON THE WAIT3 until
you get no more pids returned, then loop back on the sigwait().
The loops are because Ultrix signals do not stack, you have to
reap** all the exiting children you can before exiting the handler.
Confused? There are variation of the above theme in several places
throughout MCC - I'd guess it's taken *many* tries to get them to
work in all cases.
Have fun.
** This word is used because one such piece of code is kindly called
the "Night of the Living Dead Zombie Killer Grim Reaper". Turns out
if you are doing subprocesses startups and you don't care if and when
those subprocesses end, you *still* need some variation on the above
code, otherwise you system clogs with zombie processes.
|