T.R | Title | User | Personal Name | Date | Lines |
---|
1149.1 | More info - more confusion | SNOC02::MISNETWORK | Take a byte | Tue Jun 18 1991 00:01 | 86 |
| More info.
I know that last night my alarms worked when an event happened on one of my
circuits, but again, today it is very much broken.
The MCC_DNA4_EVL log showed the following -
$ set proc/priv=(all,nobypass)
$ manage/enter/presen=mcc_dna4_evl
Network object MCC_DNA4_EVL is declared, Status = 52854793
Waiting for the event message from EVL.....
The connection with EVL is established.
** Unable to connect to NMCC **
Ready to read the next event message...
Ready to read the next event message...
Ready to read the next event message...
.
.
.
Ready to read the next event message...
Ready to read the next event message...
Failed to receive an event from EVL, status = 8420
%SYSTEM-F-LINKABORT, network partner aborted logical link
TASSONE job terminated at 18-JUN-1991 11:46:34.63
I tried the DISABLE/ENABLE trick with the local sink monitor without any
success, again the log as follows -
$ manage/enter/presen=mcc_dna4_evl
Network object MCC_DNA4_EVL is declared, Status = 52854793
Waiting for the event message from EVL.....
I tried the DISABLE/ENABLE trick a second time with the following results -
MCC> disab node4 sprnet local sink monitor
Node4 59.1 Local Sink Monitor
AT 18-JUN-1991 11:51:31
Disable completed successfully.
MCC> enabl node4 sprnet local sink monitor
Node4 59.1 Local Sink Monitor
AT 18-JUN-1991 11:51:34
Internal error in DECnet Phase IV AM.
VMS Error = %SYSTEM-F-DUPLNAM, duplicate name
MCC> enabl node4 sprnet local sink monitor
Node4 59.1 Local Sink Monitor
AT 18-JUN-1991 11:59:59
Enable completed successfully.
Tried zeroing my counters with the following results -
MCC> getevent node4 * any event
%%%%%%%%%%% OPCOM 18-JUN-1991 12:01:01.00 %%%%%%%%%%%
Message from user DECNET on SPRNET
DECnet event 0.9, counters zeroed
From node 59.1 (SPRNET), 18-JUN-1991 12:01:00.02
Node 59.1 (SPRNET)
%%%%%%%%%%% OPCOM 18-JUN-1991 12:01:01.79 %%%%%%%%%%%
Message from user AUDIT$SERVER on SPRNET
Security alarm (SECURITY) and security audit (SECURITY) on SPRNET, system id: 65
534
Auditable event: Network login failure
Event time: 18-JUN-1991 12:01:01.77
PID: 00000164
Username: ILLEGAL
Remote nodename: SPRNET Remote node id: 60417
Remote username: TASSONE
Status: %LOGIN-F-NOSUCHUSER, no such user
NCP showed following -
MCC_DNA4_EVL 0 00000163
TASK 0 ILLEGAL
HELP!!! What is happening here. My once beloved uncomplaining fully operational
DECmcc is sick !
Cheers,
Louis
|
1149.2 | | TOOK::JEAN_LEE | | Tue Jun 18 1991 14:55 | 110 |
|
Hi Louis,
Thanks for entering these reports. Let me answer them sequentially.
1.
> $ manage/enter/presen=mcc_dna4_evl
> Network object MCC_DNA4_EVL is declared, Status = 52854793
> Waiting for the event message from EVL....
> but nothing happens, I see the events reaching my system with
> REPL/ENA=NET. I disabled/enabled my local sink monitor twice before it
> started to work again. This is causing some pain as I try to keep a log
> of all DECnet outages with the following command file, but it is not
> reliable .....
We have also experienced this. By toggling the state of the sink
usually clears the problem. We will investigate further whether this
is a expected behaviour of EVL or not.
2.
> Waiting for the event message from EVL.....
> The connection with EVL is established.
> ** Unable to connect to NMCC **
> Ready to read the next event message...
> Failed to send event = 409 to MCC event manager, INSEVTPOOLMEM
> Ready to read the next event message...
> Failed to send event = 407 to MCC event manager, INSEVTPOOLMEM
> Ready to read the next event message...
> Failed to send event = 410 to MCC event manager, INSEVTPOOLMEM
> Ready to read the next event message...
> Failed to send event = 410 to MCC event manager, INSEVTPOOLMEM
> Ready to read the next event message...
> Failed to send event = 407 to MCC event manager, INSEVTPOOLMEM
> Ready to read the next event message...
> Failed to receive an event from EVL, status = 8420
> %SYSTEM-F-LINKABORT, network partner aborted logical link
This means that MCC event manager is running out of its virtual memory.
This problem needs further investigation. I will report the findings
in a future note.
3.
================================================================================
Note 1149.1 EVENTS problem 1 of 1
SNOC02::MISNETWORK "Take a byte" 86 lines 17-JUN-1991 23:01
-< More info - more confusion >-
--------------------------------------------------------------------------------
> Ready to read the next event message...
> Ready to read the next event message...
> Failed to receive an event from EVL, status = 8420
> %SYSTEM-F-LINKABORT, network partner aborted logical link
When the logical link between EVL and the event sink is broken, it
can be caused by many reasons, node reachability change, circuit state
change, line problem...etc, just like any connectivity between two
nodes. When this happens, I would check the system EVL.LOG right away,
(not the mcc_dna4_evl.log) to find out the cause. Depending
on the cause, restarting the sink or EVL immediately may not always be
the right answer. MCC does not control the connectivity between
EVL and MCC sink, except using ENABLE or DISABLE to start or abort
the sink process. If the latter is the case, the log will tell you so.
4.
> I tried the DISABLE/ENABLE trick with the local sink monitor without any
> success, again the log as follows -
> $ manage/enter/presen=mcc_dna4_evl
> Network object MCC_DNA4_EVL is declared, Status = 52854793
> Waiting for the event message from EVL.....
> I tried the DISABLE/ENABLE trick a second time with the following results -
MCC> enable node4 sprnet local sink monitor
Node4 59.1 Local Sink Monitor
AT 18-JUN-1991 11:51:34
> Internal error in DECnet Phase IV AM.
> VMS Error = %SYSTEM-F-DUPLNAM, duplicate name
This means the sink monitor process is not completely gone yet.
Sometimes it takes a while for VMS to kill a process.
I would make sure the process mcc_dna4_evl is actually gone before I
enable it.
5.
> Tried zeroing my counters with the following results -
> MCC> getevent node4 * any event
> %%%%%%%%%%% OPCOM 18-JUN-1991 12:01:01.00 %%%%%%%%%%%
> Message from user DECNET on SPRNET
> DECnet event 0.9, counters zeroed
> From node 59.1 (SPRNET), 18-JUN-1991 12:01:00.02
> Node 59.1 (SPRNET)
In the above OPCOM message, this event occurred on sprnet
and is from node sprnet. In MCC's model, this event is considered an
event of node4 sprnet remote node sprnet.
Thus, you need to use this command to get the event:
MCC> getevent node4 sprnet remote node sprnet any event
|
1149.3 | Thanks for the info | SNOC02::MISNETWORK | Take a byte | Tue Jun 18 1991 20:54 | 22 |
| Thanks for the thorough reply. Good to see there are answeres to some
of my problems, if not total solutions.
I checked my EVL.LOG and only found 2, one was fine but the latest
version showed the following -
$ RUN SYS$SYSTEM:EVL
%EVL-E-OPENMON, error creating logical link to monitor process
SPRNET::"TASK=mcc
_dna4_evl"
-SYSTEM-F-INVLOGIN, login information invalid at remote node
%EVL-E-WRITEMON, error writing event record to monitor process
mcc_dna4_evl
-SYSTEM-F-FILNOTACC, file not accessed on channel
Must have been when I was turning the lights on and off. The log times
donot correspond to the MCC_DNA4_EVL log, so I will have to remember
next time to check the EVL.LOG when I get the network abort message.
Looking forward to your findings,
Cheers,
Louis
|
1149.4 | Need more info for INSEVTPOOLMEM | TOOK::T_HUPPER | The rest, as they say, is history. | Tue Jun 25 1991 14:06 | 33 |
| The inquiry into the INSEVTPOOLMEM error needs further input from you.
Are you receiving MCC_S_EVENT_LOST in your com file log when the sink
is reporting INSEVTPOOLMEM? This should be the case. If not, then
something is either not being reported, or the event pool is so full
that lost events cannot be delivered.
A good way to create a big problem in the current type of event pool is
to "stop" (exit handlers don't run) a DECmcc process that is receiving
events while other DECmcc processes are still running. The event pool
will still contain the abandonned mcc_event_get request structures.
These abandonned requests will still receive all matching events, but
will not read the event out of the event pool and free its memory. If
this is the case, the only way to free the memory is to exit from all
DECmcc processes on the system and restart them. There must be a point
in time when there are NO DECmcc processes running. Then the next
DECmcc process to perform an event operation will cause the event pool
to be recreated in its empty state. Are you stopping any DECmcc
processes on the system (any users) while leaving others running?
I would assume that a reboot of the system would also clean out the
event pool nicely. How long after the reboot did the sink report that
the pool had INSEVTPOOLMEM? How many events are correctly received
before lost events or no events are received? I would assume that
events are correctly received for a while, then lost events are
received, then no events are received. This would be the case if the
events were simply arriving too fast to be processed by the DECmcc
system. The event pool happens to be the most limited queue in the the
events subsystem, so that is where the problem is reported. What is
the arrival rate of events in the event sink that are to be processed
by DECmcc? Also, what type of machine are you using, so we can get a
estimate of reasonable event throughput?
Ted Hupper
|
1149.5 | INSEVTPOOLMEM error gone | SNOC01::MISNETWORK | They call me LAT | Mon Jul 01 1991 00:00 | 8 |
| The INSEVTPOOLMEM error seems to have gone away, so I will not pursue
it at this stage. Things have been working pretty well, but I haven't
had a chance to check all the logs, so I will start doing that again.
Thanks for the advice,
Cheers,
Louis
|
1149.6 | still a prob | JETSAM::WOODCOCK | | Mon Jul 01 1991 10:29 | 62 |
| If it's ok I'd like to pick up following thru on this problem. I see this
INSETPOOLMEM almost daily with MCC_DNA4_EVL going south after a dozen or
two. This is hampering my confidence in using EVENTS.
> The inquiry into the INSEVTPOOLMEM error needs further input from you.
> Are you receiving MCC_S_EVENT_LOST in your com file log when the sink
> is reporting INSEVTPOOLMEM? This should be the case. If not, then
> something is either not being reported, or the event pool is so full
> that lost events cannot be delivered.
I'm sure I'm not starting the process exactly like base note but I don't
believe I've ever seen a MCC_S_EVENT_LOST error.
> A good way to create a big problem in the current type of event pool is
> to "stop" (exit handlers don't run) a DECmcc process that is receiving
> events while other DECmcc processes are still running. The event pool
> will still contain the abandonned mcc_event_get request structures.
> These abandonned requests will still receive all matching events, but
> will not read the event out of the event pool and free its memory. If
> this is the case, the only way to free the memory is to exit from all
> DECmcc processes on the system and restart them. There must be a point
> in time when there are NO DECmcc processes running. Then the next
> DECmcc process to perform an event operation will cause the event pool
> to be recreated in its empty state. Are you stopping any DECmcc
> processes on the system (any users) while leaving others running?
Usually the only reason we stop processes is because they don't work. Once
the INSETPOOLMEM kills MCC_DNA4_EVL we of course have to restart it. This
is typically in the morning when we check for proper processes. As far as
other MCC processes running there probably is. It is unrealistic to stop ALL
MCC processes when we restart MCC_DNA4_EVL and the associated alarms. There
will ALWAYS be other alarm processes, recording, and exporting to take place.
We can't be restarting all MCC processes in the future when this occurs.
> I would assume that a reboot of the system would also clean out the
> event pool nicely.
It seems to, yes.
> How long after the reboot did the sink report that the pool had
> INSEVTPOOLMEM? How many events are correctly received before lost
> events or no events are received? I would assume that events are
> correctly received for a while, then lost events are received, then no
> events are received. This would be the case if the events were simply
> arriving too fast to be processed by the DECmcc system. The event pool
> happens to be the most limited queue in the the
> events subsystem, so that is where the problem is reported. What is
> the arrival rate of events in the event sink that are to be processed
> by DECmcc? Also, what type of machine are you using, so we can get a
> estimate of reasonable event throughput?
I'm not sure how long after a reboot. Restarting MCC_DNA4_EVL seems to work
for several hours though. I haven't seen anything in the present logs which
indicate lost events. Events coming in can be anything from 1 an hour to
5-10 per second. It depends on what is happening on the net. I'm now running
on an 8810 w/384M (it feels good to breathe again, the 3520 now only does the
display work). To say the least I should have enough fire power, and I'm gonna
let all sorts of MCC stuff **RIP** and bring us to the levels we should have
been at months ago.
best regards,
brad...
|
1149.7 | Please be careful how you kill background MCC processes | TOOK::GUERTIN | I do this for a living -- really | Mon Jul 01 1991 12:52 | 40 |
| If you are seeing INSEVTPOOLMEM when you look at the DNA4 EVL log file,
then I can understand it. It should (I assume) have some text around
it, like "The DNA4 Event Monitor just got a INSEVTPOOLMEM from the MCC
Event Manager!". On the other hand, if you are seeing this signalled
as a VMS message, then something doesn't make sense. That CVR should
always be trapped by the caller of the mcc_event_put() MCC kernel
routine.
In order to clean up a request of an event, the Requestor of an MCC
event must cancel the request. However, the Requestor cannot always
cancel, for example, if the user hits Control-Y, the Requestor may not
get control. We therefore have an Exit Handler in the Event Manager
to capture any remaining outstanding requests. On image exit, the
Event Manager cleans up whatever the Event Requestors could not. But
if someone does a $ STOP on an MCC process which is requesting events,
even the Exit Handlers do not get called. There is little we can do at
this point (being a user-mode event system). The Event Sinks generally
only PUT events, so stopping them (with a $ STOP) rarely (if ever)
would cause outstanding Requests to be left in the Event Pool.
Ideally, Event Sinks should be stopped but issuing some sort of
MCC> DISABLE <whatever> SINK command, which will cause a clean rundown
of the Event Sink. Check the Documentation for the exact command
syntax for the Sink you want to stop.
There are some MCC processes which run in the background (no user
interface), but also do GETEVENTs. These need to be aborted WITHOUT
stopping them (e.g., DO NOT use the DCL $ STOP command). An example
might be MCC Alarms running in batch. If you DO abort a background MCC
Alarms process, would almost always cause garbage (mostly invalid
request information) to be left in the Event Pool. The Putters (e.g.,
DNA4 Event Sinks) would see these as valid reqests for events, and post
events to the Event Manager. After awhile, the Events will flood the
Event Pool, and you have to take fairly drastic measures (killing all
processes using MCC) to clean things up.
Do you have to kill background MCC Alarms processes? If so, how do
you kill them?
-Matt.
|
1149.8 | more info/questions | JETSAM::WOODCOCK | | Mon Jul 01 1991 15:07 | 54 |
|
> If you are seeing INSEVTPOOLMEM when you look at the DNA4 EVL log file,
> then I can understand it. It should (I assume) have some text around
> it, like "The DNA4 Event Monitor just got a INSEVTPOOLMEM from the MCC
> Event Manager!". On the other hand, if you are seeing this signalled
> as a VMS message, then something doesn't make sense. That CVR should
> always be trapped by the caller of the mcc_event_put() MCC kernel
> routine.
The INSEVTPOOLMEM error is indeed seen in the MCC_DNA4_EVL.LOG.
> In order to clean up a request of an event, the Requestor of an MCC
> event must cancel the request. However, the Requestor cannot always
> cancel, for example, if the user hits Control-Y, the Requestor may not
> get control. We therefore have an Exit Handler in the Event Manager
> to capture any remaining outstanding requests. On image exit, the
> Event Manager cleans up whatever the Event Requestors could not. But
> if someone does a $ STOP on an MCC process which is requesting events,
> even the Exit Handlers do not get called. There is little we can do at
> this point (being a user-mode event system). The Event Sinks generally
> only PUT events, so stopping them (with a $ STOP) rarely (if ever)
> would cause outstanding Requests to be left in the Event Pool.
> Ideally, Event Sinks should be stopped but issuing some sort of
> MCC> DISABLE <whatever> SINK command, which will cause a clean rundown
> of the Event Sink. Check the Documentation for the exact command
> syntax for the Sink you want to stop.
Actually, MCC_STARTUP_DNA4_EVL I think does this as a first step. In any
event, the errors and subsequent failure of MCC_DNA4_EVL doesn't come when
someone has STOPped a process. It is usually in the middle of the night
sometime. Could a STOP process cause problems later?
> There are some MCC processes which run in the background (no user
> interface), but also do GETEVENTs. These need to be aborted WITHOUT
> stopping them (e.g., DO NOT use the DCL $ STOP command). An example
> might be MCC Alarms running in batch. If you DO abort a background MCC
> Alarms process, would almost always cause garbage (mostly invalid
> request information) to be left in the Event Pool. The Putters (e.g.,
> DNA4 Event Sinks) would see these as valid reqests for events, and post
> events to the Event Manager. After awhile, the Events will flood the
> Event Pool, and you have to take fairly drastic measures (killing all
> processes using MCC) to clean things up.
> Do you have to kill background MCC Alarms processes? If so, how do
> you kill them?
The only time we STOP MCC ALARMS processes is when they don't work. Sorry, I'm
a bit puzzled, if we are running ALARMS in batch what other options other than
STOP do we have to initiate a restart of the alarms? Or should the order of
things go, DISABLE SINK, STOP alarms process, ENABLE SINK, START alarms process?
thanks,
brad...
|
1149.9 | There are no easy answers for this problem | TOOK::GUERTIN | I do this for a living -- really | Mon Jul 01 1991 17:07 | 40 |
| > Actually, MCC_STARTUP_DNA4_EVL I think does this as a first step. In any
> event, the errors and subsequent failure of MCC_DNA4_EVL doesn't come when
> someone has STOPped a process. It is usually in the middle of the night
> sometime. Could a STOP process cause problems later?
Yes. Once you STOP a process which is doing GETEVENTs, you have
initiated a stale request, which could eventually clog up the Event
Pool. It may minutes, hours, or days, depending how often the events
(which never get picked up) come into the Event Pool.
> The only time we STOP MCC ALARMS processes is when they don't work. Sorry, I'm
> a bit puzzled, if we are running ALARMS in batch what other options other than
> STOP do we have to initiate a restart of the alarms? Or should the order of
> things go, DISABLE SINK, STOP alarms process, ENABLE SINK, START alarms
> process?
I'm sorrier than you are! There is no elegant solution to this
problem. The fact of the matter is that in the release notes, we state
(for users of the MCC Kernel routines) that the MCC processes should
not be STOPped. End users are now realizing that it is useful to have
Alarms running in batch, but don't know of a clean way to stop the
batch process. Hence, shooting it in the head seems to do the trick.
There are two possibilities for this awkward situation. I recommend
running Alarms from a window (you can iconize it). If you want to kill
Alarms, then just Control-Y out. Everything should cleanup correctly.
The other possibility to do a "Forced Exit" of the Alarms process.
This is more difficult, because there is no way at DCL level to do
this, you need to write your own program (I have one that I can post as
a reply if you want it). Also, since it causes the process to
essentially call the Exit routine in the middle of execution, you may
cause the process to go into resource waits (for example, if the
process was in a Disable Control-Y window of execution, and you Forced
an Exit).
If Alarms is not working, then we need to figure out why BEFORE killing
the Alarms process. If we find the originator of the problems, you
should never need to stop the Alarms process. I think by solving one
of your problems, you are creating bigger problems.
-Matt.
|
1149.10 | If not Batch, then what? | NSSG::R_SPENCE | Nets don't fail me now... | Tue Jul 02 1991 11:04 | 11 |
| DECmcc engineering reccomends running alarms in batch.
No one is going to run production alarms in a window. Can't reboot the
workstation... can't even log out to let someone else use it...
Sounds like the re-engineering of alarms to a detached process
controlled from DECmcc is a priority.
What do we tell customers?
s/rob
|
1149.11 | managable batch alarms soon?? | JETSAM::WOODCOCK | | Tue Jul 02 1991 12:29 | 16 |
| I have to agree. Alarms from a window is not viable. For the reasons Rob
mentioned and also alarms run 24 hours a day. Leaving sys logged in all day/
night I'm uncomfortable with, especially considering I've set host to the
main system and this link potentially could drop occasionally creating the same
problem we're trying to avoid. Managable alarms within batch has been LONG
stated as an area needed for change. Are there any updates as to when this
may change? As far as killing processes I'll try to walk more lightly but
what can I say. Stopping all MCC processes or rebooting a multi-application
clustered 8810 aren't pretty options. Also I'm not convinced this is the
root to all the evil, but only an irritant worsening the situation. FYI, this
problem with the pool is probably more widespread among EVL users than known
because others have indicated they seen it also. Considering how many are
actually using EVL for monitoring it may be a high percentage seeing the error.
cheers,
brad...
|
1149.12 | We said THAT!?!?! | TOOK::GUERTIN | I do this for a living -- really | Tue Jul 02 1991 12:47 | 24 |
| Rob,
As a member of DECmcc engineering, I'm amazed and disappointed that
this fell through the cracks. There is no patch that I can think of.
I talked to Anil Navkal (Alarms PL) just yesterday, and thought he
told me that they did NOT explicitly state that the user should run
Alarms in batch.
The problem is that Alarms does not have ANY detached process support.
If it did, then we would not be in this predicament. (This is not
a complaint about the Alarms-FM. The MCC-Kernel needs to provide
generic detached process management routines.) Other MMs have
implemented their own private detached process support.
The fact of the matter remains that you cannot kill the Alarms process
by doing a DCL STOP on the process while Alarms is requesting Events.
I don't know what a DELETE/ENTRY does to a process, if it is the same
as a STOP, then you MUST NOT do that either.
Is it possible to have a command procedure disable all the Alarm Event
rules running in batch?
-Matt.
|
1149.13 | No can do ... | TOOK::ORENSTEIN | | Tue Jul 02 1991 14:42 | 13 |
| I too have been thinking about this problem, and I agree that
ALARMS will be better off when it is detached.
Matt, unfortunately rules are enabled within a process. So
a user on DCL can not see that rules are being run in batch.
And that user on DCL can not disable the rules that are
running in batch.
Infact, ALARMS is designed so that once the rule is enabled,
another process could delete the rule from the MIR, and it
would keep running in the first process as if nothing happened.
aud...
|
1149.14 | using DELETE not STOP | JETSAM::WOODCOCK | | Tue Jul 02 1991 15:33 | 7 |
| Hi Matt,
For clarity, I always DELETE/ENTRY to stop the process. I never use STOP
PROCESS/ID=... I too, don't know if there is a difference. But I always
use DELETE because it's usually easier to type :-).
brad...
|
1149.15 | Try this instead... | TOOK::GUERTIN | I do this for a living -- really | Tue Jul 02 1991 16:10 | 102 |
| The following is a VAX C program which will attempt to send a Force
Exit to another process. You need privileges to send a Force Exit
to a process that you do not own.
If you need to abort an MCC process and cannot do it interactively,
then please try using "FORCEX" before attempting to use the STOP or
DELETE/ENTRY commands. (At least until we find a better solution.)
-Matt.
This program is not supported by NME, MCC, or DEC in general. No
one is liable or responsible for this program in any way, shape or
form. Use at your own risk. Etc,etc. <insert usual caveats here>
--------------------------CUT HERE---------------------------------
/* FORCEX.C -- Force Another Process to Exit
(by calling the $FORCEX system routine).
$ CC FORCEX.C
$ LINK FORCEX.OBJ, SYS$INPUT:/OPT ! Type in image lib interactively.
SYS$SHARE:VAXCRTL.EXE/SHARE
^Z ! Control-Z out of input mode.
$ COPY FORCEX.EXE ! Copy it to where you want it.
from a privileged account,
define it as a Foreign command:
$ FORCEX:==$SYS$DISK:[]FORCEX.EXE ! Use actual disk location.
$ FORCEX <pid1> [<pid2> ... <pidn>] ! Use PID or Process name (quoted).
*/
#include <descrip.h>
#include <ssdef.h>
int remove_quotes( p_string ) /* Remove double quotes */
char *p_string;
{
int i;
for (i=0;*(p_string+i) != '\0';i++)
*(p_string+i) = *(p_string+i+1);
if ((i > 1) && (*(p_string+i-2) == '"'))
*(p_string+i-2) = '\0';
return (strlen( p_string ));
}
main( argc, argv )
int argc;
char *argv[];
{
int exit_code = SS$_FORCEDEXIT;
int use_pid;
int sstat;
int pid;
char *procnam_str;
struct dsc$descriptor procnam_dsc = {0, DSC$K_DTYPE_T, DSC$K_CLASS_S, 0};
int arg_count = 0;
int quotes = 0;/* boolean flag 1 = no quotes, 0 = quotes specified */
int msg_len;
char msg_txt[256];
struct dsc$descriptor msg_dsc = {256, DSC$K_DTYPE_T, DSC$K_CLASS_S, msg_txt};
procnam_str = malloc( 256 );
do
{
arg_count++;
if (argc < 2)
{
printf("Enter a PID in hex (or a Process Name) : ");
scanf("%s",procnam_str );
argc = 1;
}
else
procnam_str = argv[arg_count];
procnam_dsc.dsc$w_length = strlen( procnam_str );
procnam_dsc.dsc$a_pointer = procnam_str;
/* Quoted strings are always treated as Names */
quotes = (*procnam_str == '"');
if (!quotes && (ots$cvt_tz_l(&procnam_dsc, &pid, 4, 0) == SS$_NORMAL))
sstat = sys$forcex( &pid, 0, &exit_code );
else
{
if ((quotes) && (procnam_dsc.dsc$w_length > 1))
procnam_dsc.dsc$w_length = remove_quotes( procnam_str );
sstat = sys$forcex( 0, &procnam_dsc, &exit_code );
}
if (sstat == SS$_NORMAL)
printf("\nForced Exit successfully requested for %s\n", procnam_str );
else
{
printf("\nForced Exit request failed for %s\n", procnam_str);
sys$getmsg( sstat, &msg_len, &msg_dsc, 1, 0 );
msg_txt[msg_len] ='\0';
printf("Reason: %s\n",msg_txt);
}
} while (arg_count < argc-1);
}
|
1149.16 | SET MODE=HACK | WAKEME::ANIL | | Wed Jul 03 1991 10:21 | 37 |
| Thanks Matt. Will every one out there give a good round of applause
to Matt for writing the real code! :-)
While you guys are busy compiling Matt's program you may want to try
the following to get you out of the "how-to-stop-MCC-that-is_running-
in-the-background".
The command procedure has all the comments. My first thought was to
make it a lot more fancy and be driven by some rule firing that will
stop the batch job. But for now I prefer it to be very simple. A little
effort on users part will solve the problem. In V1.2 we may try to
be a little more user friendly :-), no promises though!!
$ manage/enter
! Enable mcc 0 Alarms rule foo_1, in domain blaha
! :
! Enable all your rules here
! :
! Enable mcc 0 Alarms rule foo_n, in domain blaha
!
! The following command will wait for what ever delta-time you specify
! If you want to stop the Background process check the PID of the
! spawned process. The name of the process is <username>_1
! The PID of this process is generally 1 more than the batch job's PID
! , say its x. Now to stop the background MCC, do your favorite stop/id
! for the PID x. The spawned process will be killed. The parent process
! will now resume next mcc command which just happens to be a graceful
! exit. You may want to do SHOW MCC 0 Alarms RULE * all att before
! the exit command.
!
spawn wait 22:00:00
exit
|
1149.17 | works good | JETSAM::WOODCOCK | | Wed Jul 03 1991 16:41 | 11 |
| Hi Matt,
Thanks for the program. I've got it compiled and tested. It seems to do
the trick and hopefully it helps and/or resolves this problem.
best regards,
brad...
PS. Anil, nice creative hack as Option B :-)
|
1149.18 | For a future version... | MARVIN::COBB | Graham R. Cobb (Wide Area Comms.), REO2-G/H9, 830-3917 | Fri Jul 05 1991 09:31 | 26 |
| Processes will always get stopped for many reasons. You shouldn't ever rely
on user-mode exit handlers or ^Y interception to clean up a shared resource.
There are two fairly obvious fixes I can think of for a future version:
1) Use a kernel mode exit handler. Of course this requires writing
privileged, inner mode code and using things like protected sharable images.
2) Take stock and tidy up frequently. For example every time a process
connects to the global section have it look around and tidy up the mess
caused by a process going away unexpectedly. Or do it from a timer. The
main problem here is working out who is still attached. Fortunately there
is an easy solution to that using locks.
You can get as complex as you like using locks but a simple solution should
work: every process that uses the global section writes its PID somewhere in
the section where everyone else can find it. It also takes out an exclusive
lock called MCC$<pid>. If another process needs to know whether the first
process is still around (and, more importantly, still using the global
section!) it tries to acquire lock MCC$<pid>. If it succeeds the process
has stopped using the section and its mess should be tidied away.
Either of those solutions could work. Or, of course, something much more
specific to the alarms module. Whichever way it is done I think this needs
to be a high priority to fix for V1.2.
Graham
|
1149.19 | The future is ... "Portability"! | TOOK::GUERTIN | I do this for a living -- really | Mon Jul 08 1991 09:50 | 55 |
| RE:.18
Graham,
Yes, the solutions you suggest are doable. The problems are:
1) Using a kernel mode exit handler. This is analogous to cracking
open a peanut with a thermonuclear device. Yes, it will work,
yes, it is overkill, yes, there are simpler (and more portable)
solutions which stay in user-mode.
2) Various garbage collection schemes. Counting on things such as
PIDs to identify a process will work until the same PID gets
re-used. If you look at the N-process to N-process communication
behavior of MCC events (for example Sinks are generally very long
running process which mainly do Puts, while forground MCC tend not
to run very long, and do Gets), then you will notice that it may
be several hours or days between when the process goes aways and
another process needs to check its PID. I do not believe there
is a guarantee in the VMS architecture that PIDs will be not
be reused, or at which intervals they could be re-used. If you
know of any statements (such as, "PIDs are always unique and never
reused between reboots"), then please let me know. Also, remember
that were are not just talking about processes, we are also talking
about threads. For example, if a thread issues a Get, and then
is destroyed, or hangs, the event request remains in the event pool.
Instead, there appears to be a handful of creative, yet simple
solutions, which provide the same end result. Some examples:
1) Implement a "sweeper". Sweepers are threads which run in any
process which calls the MCC Event Manager. They are started
up on Event Manager initialization, and periodically scan the
Event Pool for garbage. Unfortunately, this is an "active"
as opposed to a "passive" solution, and required the system
to do more work base upon the load.
2) When the Putter puts an Event, and notices that the Getter
hasn't picked up events in a timely fashion, he issues a
"challenge". If the Getter accepts the challenge, then
the Request is validated.
3) Each Getter has a quota of the number of events it can
have queued up in the event pool. If the quota is reached,
the events are "lost", after a period of no Getter activity,
the request itself becomes invalid.
We have several others, including various combinations of the above
schemes.
I appreciate your interest, and your taking the time to propose
plausable solutions. However, the real issue is not the lack of
solutions, but the lack of time and people resources to implement them.
The solution we have finally come up with requires a minimum of both,
but it still must be worked into the schedule and traded against other
tasks (which means some other piece of functionality or some other bug
fix will NOT get into the product in the next release). For V1.1, we
reluctantly settled for exit handler cleanup -- although that solution
isn't very portable either :-).
|
1149.20 | Help is coming - "Real Soon" | TOOK::T_HUPPER | The rest, as they say, is history. | Mon Jul 08 1991 12:46 | 13 |
| RE:.18, .19
Just so everybody can feel better about "the Event Manager that can't
clean up after itself", we have time allocated for the V1.2 release to
implement some/all of the functionality that Matt outlined in .19. The
internal Event Pool cleanup mechanism has always been an integral part
of the Event Manager, but until now, there has been NO time to
implement it. The tradeoffs we've had to make in many areas of DECmcc
in order to get ANY product out the door have been severe. We are
allocating more time now to filling in some of the areas previously
traded off.
Ted
|
1149.21 | | MARVIN::COBB | Graham R. Cobb (Wide Area Comms.), REO2-G/H9, 830-3917 | Mon Jul 08 1991 12:49 | 11 |
| You are right that there are many possible solutions (by the way, the "lock"
approach can be made immune to re-using the same PID but it rapidly becomes
complex). Personally I would probably use the kernel mode exit handler
approach, but then I have been writing VMS inner mode code for almost 10
years!
I take your point that any solution will cost some other feature but I
wanted to add my voice to the outcry that a user mode exit handler is not an
adequate solution for V1.2.
Graham
|
1149.22 | INSEVTPOOLMEM is back | JETSAM::WOODCOCK | | Thu Jul 18 1991 12:59 | 18 |
| I have come back to the original problem, INSEVTPOOLMEM. I have once again
received this error today. I have been extremely careful to use 'FORCEX'
but the error has reappeared. Usually I can simply restart MCC_DNA4_EVL
and all works well for awhile but not today. Restarting it brought back
the same error within minutes. Should I reset all MCC processes when I
receive this error always, please say no that is a painful workaround.
As a side note I have been working on MCC and EVL being more robust. As
a consequence I forced EVL to go away many times yesterday which produced
a fatal link abort error in MCC_DNA4_EVL. Could this have been the prelude
to this error coming on again? It shouldn't be because EVL goes away on its
own often and can't be avoided thru normal operations. Would it help to
restart only MCC_DNA4_EVL each work day? BTW, I think I have a hack to
keep MCC_DNA4_EVL running even when EVL drops out. I'll be looking for
opinions on it but I'll post it in the appropriate note.
regards,
brad...
|
1149.23 | Not processing sinked events | AUNTB::BRILEY | Are you a rock or leaf in the wind | Wed Jul 24 1991 10:42 | 7 |
| Did anyone ever find out the problem causing the initial problem that
Louis reported. That is the MCC_DNA4_EVL not receiving/processing
sinked event.
Thanks,
Rob
|
1149.24 | Event Mgr cleanup for killed processes? | TAEC::MCDONALD | | Mon Feb 17 1992 05:16 | 21 |
| re .20
>Just so everybody can feel better about "the Event Manager that can't
>clean up after itself", we have time allocated for the V1.2 release
>to implement some/all of the functionality that Matt outlined in .19.
I am using mcc Component Version = T1.2.4 on Ultrix.
Has the functionality discussed in notes .19 & .20 been implemented
in the newer Event Manager?
I have a background process which does an mcc_event_get for infinity.
If this process gets killed (kill on Ultrix), then other
processes doing mcc_event_puts still receive a status of
Normal (as if another process has received the event, when
in fact there are no other processes waiting for the event).
If the background process does an mcc_event_get cancel before
exiting then this does not happen.
Is there a way to correct this (the mcc_event_put receives
MCC_S_NOEVENTREQ when the process is no longer there) ?
thanks, Carol
|
1149.25 | Use mcc_kill rather than kill | TOOK::MINTZ | Erik Mintz, DECmcc Development, dtn 226-5033 | Mon Feb 17 1992 08:40 | 6 |
| This does appear to be a problem (and I have seen the relevant QAR).
However, we DO NOT recommend killing DECmcc processes on ULTRIX
using "kill". That is why we provide mcc_kill to terminate them.
-- Erik
|
1149.26 | what's the difference? | TAEC::MCDONALD | | Mon Feb 17 1992 10:59 | 3 |
| what does mcc_kill do differently from "kill"?
Anyway a process might exit for other reasons before doing a cancel.
|
1149.27 | mcc_kill allows a clean shut down | TOOK::MINTZ | Erik Mintz, DECmcc Development, dtn 226-5033 | Mon Feb 17 1992 11:16 | 7 |
| > what does mcc_kill do differently from "kill"?
It sends an MCC event that allows a process to shut itself down.
There are known clean-up problems when a process is abruptly
terminated.
|
1149.28 | Event manager cleanup has been implemented in V1.2 | TOOK::T_HUPPER | The rest, as they say, is history. | Tue Feb 18 1992 11:08 | 61 |
| RE .24:
New functionality for V1.2:
The event manager DOES cleanup when processes die. It does NOT do so
immediately. The purpose is to avoid filling up the event memory pool
with events for GETs of processes that have been killed/stopped. The
purpose is not to ensure to the PUT that a GET actually processed the
event. That is impossible for the (low-level) event manager to do. It
has no control over what happens to an event after it leaves the event
manager.
The cleanup that is done when a process doing mcc_event_get calls dies
is based on a timer and the queue of the mcc_event_get filling up. The
algorithm is as follows:
If the event queue (settable with the MCC_EVENT_EDQ_SIZE_LIMIT
environmemt variable, default is 200) for the GET fills up, after a
timeout (settable with the environment variable
MCC_EVENT_EDQ_TIME_LIMIT, default is 60 seconds) AND another event is
PUT to this queue, the entire contents of the queue is converted to
lost events. If another event is PUT to this queue after another
timeout (settable with the environment variable
MCC_EVENT_LOST_TIME_LIMIT, default is 600 seconds) expires, the GET
structures are removed from the event manager. No further PUTs will
see this deleted GET (they will now receive MCC_S_NOEVENTREQ).
If the event pool has filled to a threshold level (not settable), it is
not necessary to have any PUTs enqueued for the dead GET to have the
above sequence take place. All GETs in the event pool are checked
against the timeouts. Any GETs past the timeouts are deleted along
with their posted events.
The purpose of the above sweeping operation is to prevent the event
manager pool from being put out of commission by dead GETs. Note that
because of the timeouts and/or requirement to reach a threshold of
fullness, we cannot give instantaneous accuracy on whether or not the
event actually went to a GET process.
After a process with outstanding GETs dies, and before the GET
structures are removed from the event pool, PUTs that match those GETs
will return MCC_S_NORMAL. After the cleanup, they will receive
MCC_S_NOEVENTREQ. The difference in these CVRs is whether or not the
event was queued to a GET, not whether the event was acted upon by a
real process.
If you need to know whether an event was acted upon, then you need a
transaction processing model. As the event manager is only providing a
one-way distribution of data, a single event posting cannot provide
this capability. An end-to-end receipt is required. A return event
could provide that receipt, but the model is becoming complicated.
If knowing as quickly as possible whether a GET process has died
(perhaps so that an automatic restart of the GET process can be done
(but why did it die?)) is really important, we would have to test the
existence of the GET process for each matching GET for each PUT of an
event. Given that the event manager cannot guarantee action on an
event and needs to have high performance, we did not implement this
test.
Ted
|