[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:	DECmcc user notes file. Does not replace IPMT.
Notice:	Use IPMT for problems. Newsletter location in note 6187
Moderator:	TAEC::BEROUD

Created:	Mon Aug 21 1989
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	6497
Total number of notes:	27359

807.0. "notify & events" by JETSAM::WOODCOCK () Mon Mar 18 1991 10:30

Hi there,

I am having a problem of reliability when I start combining alarms,
events, and notifications. A number of problems and anomalies are occuring
which I can't put my finger on. The process players are EVL, MCC_DNA4_EVL,
two alarm procedures in batch, and map notification. There are 50-60 alarms
combined between the two jobs and the expressions are of the following nature:

expression=(occurs(node4 bbpk01 cir syn-* adj node * adjacency down))

	or

expression=(occurs(node4 bbpk01 cir ethernet adj node * adjacency down))

Polling alarms are also run in the evening (about 70) for circuit substates.

-1-

Often in the morning notifications (or event alarms) no longer work (although 
it survived this entire weekend). All processes are running but the following 
is found in MCC_DNA4_EVL.LOG. Because all the players are running I don't 
know there is a problem until I *test* MCC. This isn't something I'd expect 
to have to do periodically to ensure the tool's operation.

Ready to read the next event message...
Ready to read the next event message...
Ready to read the next event message...
Failed to receive an event from EVL, status = 8420

-2-

When the alarm batch jobs are stopped the MCC_DNA4_EVL process has died
several times (but not every time). Why? Didn't happen to catch this
log. If I reproduce it again I'll post it.

-3-

When the alarm batch jobs are started and the notification (map) is enabled 
at the same time I get unpredictable results. Errors range from nothing
noticable to "invalid lock id". The map may or may not handle incoming
alarms and the map window almost always crashes when exiting. I would have
thought these processes would be independent, no?? QARed.  

-4-

The EVL and MCC_DNA4_EVL processes die when confronted with streaming events.
This happened (and continues to happen) when a circuit bounces continuously
(2 or more events per second) or if mulitple nodes are sinking the same network
event simutaneously (ie. Area Reachability event). I tried reproducing this
situation manually (via .com) with about 600 events within 5 minutes (async 
events from a single node) and it was handled fine. I get errors within the
EVL.log file but I can't determine if this is actually pointing me in the
right direction. Below is a copy of the final few lines for the present log
and everything is still running ok! Any ideas out there??

$ RUN SYS$SYSTEM:EVL
%EVL-E-OPENMON, error creating logical link to monitor process NOCMAN::"TASK=MCC
_DNA4_EVL"
-SYSTEM-F-NOSUCHOBJ, network object is unknown at remote node
%EVL-E-WRITEMON, error writing event record to monitor process MCC_DNA4_EVL
-SYSTEM-F-FILNOTACC, file not accessed on channel
 
regards,
brad...

ps. Jim/Daryl - The polling alarms have been 100% reliable since code changes!

T.R	Title	User	Personal Name	Date	Lines
807.1	mcc_dna4_evl dropped	JETSAM::WOODCOCK		`Mon Mar 18 1991 17:41`	27
	The MCC_DNA4_EVL process died this afternoon with the following log: Ready to read the next event message... Ready to read the next event message... Failed to receive an event from EVL, status = 8420 %SYSTEM-F-LINKABORT, network partner aborted logical link DECMCC job terminated at 18-MAR-1991 15:07:24.98 The following is a portion of the EVL.log. I believe that it broke down at the (%EVL-F-NETASN, unable to assign a channel to NET) line. EVL restarted but of course MCC_DNA4_EVL didn't because it had to be done manually. Could EVL need tuning? $ RUN SYS$SYSTEM:EVL %EVL-E-OPENMON, error creating logical link to monitor process NOCMAN::"TASK=MCC _DNA4_EVL" -SYSTEM-F-NOSUCHOBJ, network object is unknown at remote node %EVL-E-WRITEMON, error writing event record to monitor process MCC_DNA4_EVL -SYSTEM-F-FILNOTACC, file not accessed on channel %EVL-F-NETASN, unable to assign a channel to NET -SYSTEM-F-PATHLOST, path to network partner node lost $ PURGE/KEEP=3 EVL.LOG $ LOGOUT/BRIEF DECNET job terminated at 18-MAR-1991 15:07:22.94
807.2	We'll take a look....	TOOK::CAREY		`Wed Mar 20 1991 16:37`	12
	Brad, Gee, thanks. We love the problems you bring us. :-) I have no idea what could be going on. I'll get some data on the evl.log information and see if we can come up with a scenario for your breakdown. -Jim
807.3	some clues	JETSAM::WOODCOCK		`Thu Mar 21 1991 14:58`	44
	It seems that two of these problems may be related. I reproduced problem -2- today (alarms started in batch, notfication enabled simutaneously) and was again given "invalid lock id". When exiting the map the DECterm window vanished. I restarted the map (all jobs already running), but notifies failed to work. I checked MCC_DNA4_EVL.LOG and found; Ready to read the next event message... Ready to read the next event message... Ready to read the next event message... A fatal error occurred when sending event = 418 to MCC event manager! The EVL sink is terminated! OPS_DNA4_STOP_SINK_MONITOR Failed at step 5, status = 52877226 STOP_SINK_MONITOR is terminated, thread id = 65539, status=52854793 I then stopped the batch que and that's when MCC_DNA4_EVL dies (prob -3-). I don't seem to be able to recreate this problem -3- unless problem -2- has been encountered. A recheck of MCC_DNA4_EVL.LOG now shows a new error line: Ready to read the next event message... Ready to read the next event message... Ready to read the next event message... A fatal error occurred when sending event = 418 to MCC event manager! The EVL sink is terminated! OPS_DNA4_STOP_SINK_MONITOR Failed at step 5, status = 52877226 STOP_SINK_MONITOR is terminated, thread id = 65539, status=52854793 %LIB-F-SECINTFAI, secondary interlock failure in queue SYSTEM job terminated at 21-MAR-1991 14:22:55.04 Problem -2- appears to be a real nasty one where if it occurs the processes continue to appear to run. But in reality the user must stop all alarm jobs, stop MCC_DNA4_EVL process (maybe even EVL), then restart MCC_DNA4_EVL, then restart the alarm jobs, then bring the map back up with notifications (with WAIT statements between everything so nothing has a chance to bump into each other!!). So Jim/Jean, being the nice guy that I am, I've just reduced the number of problems down to a mere 3. It just keeps getting easier every day :-). brad...
807.4	This COULD be an Event Manager problem	TOOK::GUERTIN	I do this for a living -- really	`Mon Mar 25 1991 14:22`	37
	Brad, There were two potential problems with the Event Manager that we had for V1.1. The first was a small window where we try to acquire the same lock at two different points within the same process. This window was so small that we only saw it on a MIPS machine while porting to Ultrix. The affect of this problem is that you could see a "lock conversion" error. This problem was fixed for next release by combining locks and moving around the acquire/release statements. The second problem you just discovered. In order to save on VMS resources, we used RTL interlock calls instead of Locks for enqueuing and dequeuing entries in the Event Pool. Apparently, in a multiple-CPU environment (assuming that is what you have), the "Secondary Interlock Failure" is easier to get than we expected (we have a high retry count). This problem was also fixed for the next release, since in order to have portable code, we had to use locks more efficiently in order to remove the need of having calls to the RTL interlock routines. I guess what I am saying is this: If it is the Event Manager (and it may not be), then most, if not all, of your problems should go away on the next VMS release (whenever that is). The workaround would be to spread out the load on the Event Manager over a longer period, in order to reduce the strain on system resources (locks and CPUs). If this workaround is not acceptable, we have to work thru management to get you a special MCC kernel (not an easy thing to do). But even that is not a guarantee that your problems will go away, it is just one possible cause of your problems. One way of checking if the Event Manager is detecting the problem and bubbling it up to the Event Sink, is to define the MCC_EVENT_LOG to 1 ($ define MCC_EVENT_LOG 1) in the same process, and see if you get any Internal error messages about Lock conversions or Interlock failures. If no Internal error messages get displayed, then the Event Manager is (probably) innocent. -Matt.
807.5	it can wait	JETSAM::WOODCOCK		`Wed Apr 24 1991 19:44`	19
	I got side tracked for a while but I just got a chance to re-review this note. If you folks think the LOCK problem will be resolved in the next release I can wait. The work around isn't pretty but it seems to work most times. I'll give you a ring next release if it's still there. BTW, in poking thru VPA reports there were a couple mentions of lock problems so this seems to confirm your thoughts. As far as EVL goes I'll probably try to spend some time getting a better understanding of it and try to make it more robust thru tuning. Although from all I've heard it's the nature of this beast to drop out and come back up. MCC should put some effort into ensuring that MCC_DNA4_EVL is automated to come back up with it. Whole operational businesses may depend on these two working in harmony and their interaction is very important to many net mngrs even if they don't know it today. It's the future they will move to. best regards, brad...
807.6	attempt at stability	JETSAM::WOODCOCK		`Thu Jul 18 1991 15:01`	21
	Hello, As promised/threatened I would take a look at trying to make MCC and EVL more stable partners. I know you folks are tight on time and this problem is critical for ease of operations. My first approach was with EVL which I got nowhere. It seems the EVL experts are far and few between or very shy. Therefore I looked toward an MCC solution. I have edited MCC_COMMON:MCC_DNA4_EVL.COM to be a bonafide hack. Basically when EVL drops out (happens often when hit with streams of events) MCC_DNA4_EVL drops with a fatal link abort message. I simply capture the status code, test to see if it is a link abort, then loop back up and restart. Seems to work ok but is not time tested as yet. I would like some feedback to ensure this doesn't negatively impact anything. Also I wanted to see what would happen if EVL had not yet returned. MCC_DNA4_EVL seemed to wait for EVL after it was restarted then made a link. This worked up to 5 minutes of EVL down time so far. I'll post the ala hack as the next reply. But of course I do not intend to support this very delicate code, and NO rights are reserved :-). cheers, brad...
807.7	MCC_COMMON:MCC_DNA4_EVL.COM	JETSAM::WOODCOCK		`Thu Jul 18 1991 15:11`	18
	$! This procedure replaces the original MCC_COMMON:MCC_DNA4_EVL.COM and $! is intended to allow this process to restart when the EVL process fails $! and causes a LINKABORT error which would normally EXIT this procedure. $! $! $ set verify $ start: $ on warning then goto status $ manage/enter/presen=mcc_dna4_evl $ status: $ show symbol $status $ wait 00:00:20 $! $! Check to see if error was caused by LINKABORT and restart if true $! $ if $status .eqs. "%X000020E4" then goto start $! $ exit