T.R | Title | User | Personal Name | Date | Lines |
---|
807.1 | mcc_dna4_evl dropped | JETSAM::WOODCOCK | | Mon Mar 18 1991 17:41 | 27 |
| The MCC_DNA4_EVL process died this afternoon with the following log:
Ready to read the next event message...
Ready to read the next event message...
Failed to receive an event from EVL, status = 8420
%SYSTEM-F-LINKABORT, network partner aborted logical link
DECMCC job terminated at 18-MAR-1991 15:07:24.98
The following is a portion of the EVL.log. I believe that it broke down at the
(%EVL-F-NETASN, unable to assign a channel to NET) line. EVL restarted
but of course MCC_DNA4_EVL didn't because it had to be done manually. Could
EVL need tuning?
$ RUN SYS$SYSTEM:EVL
%EVL-E-OPENMON, error creating logical link to monitor process NOCMAN::"TASK=MCC
_DNA4_EVL"
-SYSTEM-F-NOSUCHOBJ, network object is unknown at remote node
%EVL-E-WRITEMON, error writing event record to monitor process MCC_DNA4_EVL
-SYSTEM-F-FILNOTACC, file not accessed on channel
%EVL-F-NETASN, unable to assign a channel to NET
-SYSTEM-F-PATHLOST, path to network partner node lost
$ PURGE/KEEP=3 EVL.LOG
$ LOGOUT/BRIEF
DECNET job terminated at 18-MAR-1991 15:07:22.94
|
807.2 | We'll take a look.... | TOOK::CAREY | | Wed Mar 20 1991 16:37 | 12 |
|
Brad,
Gee, thanks. We love the problems you bring us. :-)
I have *no idea* what could be going on.
I'll get some data on the evl.log information and see if we can come up
with a scenario for your breakdown.
-Jim
|
807.3 | some clues | JETSAM::WOODCOCK | | Thu Mar 21 1991 14:58 | 44 |
| It seems that two of these problems may be related. I reproduced problem
-2- today (alarms started in batch, notfication enabled simutaneously) and
was again given "invalid lock id". When exiting the map the DECterm window
vanished. I restarted the map (all jobs already running), but notifies
failed to work. I checked MCC_DNA4_EVL.LOG and found;
Ready to read the next event message...
Ready to read the next event message...
Ready to read the next event message...
A fatal error occurred when sending event = 418 to MCC event manager!
The EVL sink is terminated!
OPS_DNA4_STOP_SINK_MONITOR Failed at step 5, status = 52877226
STOP_SINK_MONITOR is terminated, thread id = 65539, status=52854793
I then stopped the batch que and that's when MCC_DNA4_EVL dies (prob -3-).
I don't seem to be able to recreate this problem -3- unless problem -2-
has been encountered. A recheck of MCC_DNA4_EVL.LOG now shows a new error
line:
Ready to read the next event message...
Ready to read the next event message...
Ready to read the next event message...
A fatal error occurred when sending event = 418 to MCC event manager!
The EVL sink is terminated!
OPS_DNA4_STOP_SINK_MONITOR Failed at step 5, status = 52877226
STOP_SINK_MONITOR is terminated, thread id = 65539, status=52854793
%LIB-F-SECINTFAI, secondary interlock failure in queue
SYSTEM job terminated at 21-MAR-1991 14:22:55.04
Problem -2- appears to be a real nasty one where if it occurs the processes
continue to appear to run. But in reality the user must stop all alarm jobs,
stop MCC_DNA4_EVL process (maybe even EVL), then restart MCC_DNA4_EVL, then
restart the alarm jobs, then bring the map back up with notifications (with
WAIT statements between everything so nothing has a chance to bump into each
other!!).
So Jim/Jean, being the nice guy that I am, I've just reduced the number of
problems down to a mere 3. It just keeps getting easier every day :-).
brad...
|
807.4 | This COULD be an Event Manager problem | TOOK::GUERTIN | I do this for a living -- really | Mon Mar 25 1991 14:22 | 37 |
| Brad,
There were two potential problems with the Event Manager that we had
for V1.1. The first was a small window where we try to acquire the
same lock at two different points within the same process.
This window was so small that we only saw it on a MIPS machine while
porting to Ultrix. The affect of this problem is that you could see
a "lock conversion" error. This problem was fixed for next release by
combining locks and moving around the acquire/release statements.
The second problem you just discovered. In order to save on VMS
resources, we used RTL interlock calls instead of Locks for enqueuing
and dequeuing entries in the Event Pool. Apparently, in a multiple-CPU
environment (assuming that is what you have), the "Secondary Interlock
Failure" is easier to get than we expected (we have a high retry
count). This problem was also fixed for the next release, since in
order to have portable code, we had to use locks more efficiently in
order to remove the need of having calls to the RTL interlock routines.
I guess what I am saying is this: If it is the Event Manager (and it
may not be), then most, if not all, of your problems should go away on
the next VMS release (whenever that is). The workaround would be to
spread out the load on the Event Manager over a longer period, in order
to reduce the strain on system resources (locks and CPUs). If this
workaround is not acceptable, we have to work thru management to get
you a special MCC kernel (not an easy thing to do). But even that is
not a guarantee that your problems will go away, it is just one
possible cause of your problems.
One way of checking if the Event Manager is detecting the problem and
bubbling it up to the Event Sink, is to define the MCC_EVENT_LOG to 1
($ define MCC_EVENT_LOG 1) in the same process, and see if you get any
Internal error messages about Lock conversions or Interlock failures.
If no Internal error messages get displayed, then the Event Manager is
(probably) innocent.
-Matt.
|
807.5 | it can wait | JETSAM::WOODCOCK | | Wed Apr 24 1991 19:44 | 19 |
| I got side tracked for a while but I just got a chance to re-review
this note. If you folks think the LOCK problem will be resolved in
the next release I can wait. The work around isn't pretty but it
seems to work most times. I'll give you a ring next release if it's
still there. BTW, in poking thru VPA reports there were a couple
mentions of lock problems so this seems to confirm your thoughts.
As far as EVL goes I'll probably try to spend some time getting a
better understanding of it and try to make it more robust thru tuning.
Although from all I've heard it's the nature of this beast to drop
out and come back up. MCC should put some effort into ensuring that
MCC_DNA4_EVL is automated to come back up with it. Whole operational
businesses may depend on these two working in harmony and their
interaction is **very** important to many net mngrs even if they don't
know it today. It's the future they will move to.
best regards,
brad...
|
807.6 | attempt at stability | JETSAM::WOODCOCK | | Thu Jul 18 1991 15:01 | 21 |
| Hello,
As promised/threatened I would take a look at trying to make MCC and EVL
more stable partners. I know you folks are tight on time and this problem
is *critical* for ease of operations. My first approach was with EVL
which I got nowhere. It seems the EVL experts are far and few between or
very shy. Therefore I looked toward an MCC solution. I have edited
MCC_COMMON:MCC_DNA4_EVL.COM to be a bonafide hack. Basically when EVL
drops out (happens often when hit with streams of events) MCC_DNA4_EVL
drops with a fatal link abort message. I simply capture the status code,
test to see if it is a link abort, then loop back up and restart. Seems
to work ok but is not time tested as yet. I would like some feedback to
ensure this doesn't negatively impact anything. Also I wanted to see what
would happen if EVL had not yet returned. MCC_DNA4_EVL seemed to wait for
EVL after it was restarted then made a link. This worked up to 5 minutes of
EVL down time so far. I'll post the ala hack as the next reply. But of course
I do not intend to support this very delicate code, and NO rights are
reserved :-).
cheers,
brad...
|
807.7 | MCC_COMMON:MCC_DNA4_EVL.COM | JETSAM::WOODCOCK | | Thu Jul 18 1991 15:11 | 18 |
| $! This procedure replaces the original MCC_COMMON:MCC_DNA4_EVL.COM and
$! is intended to allow this process to restart when the EVL process fails
$! and causes a LINKABORT error which would normally EXIT this procedure.
$!
$!
$ set verify
$ start:
$ on warning then goto status
$ manage/enter/presen=mcc_dna4_evl
$ status:
$ show symbol $status
$ wait 00:00:20
$!
$! Check to see if error was caused by LINKABORT and restart if true
$!
$ if $status .eqs. "%X000020E4" then goto start
$!
$ exit
|