[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

5117.0. "Notification performance -> problem" by STKHLM::BERGGREN (Nils Berggren EIS/Project dpmt, Sweden DTN 876-8287) Wed May 26 1993 05:54

     Hi all,


     We're running acceptance-tests at the swedish PTT and have
     run into serious problems regarding performance.
     
     The configuration is:
     
     VS4000/90, 128 Mb, 2*RZ58 as system-disk and MCC-disk
     respectively.  DNS, DECmcc v1.3.
     
     Process-quotas set up to or above what MCC-AUDIT suggests.
     
     The test that fails is to prove that the configuration has
     sufficient power and memory  to support 5 operators.
     
     The test-setup is as follows:
     
     1.  A process blocking 32 MB ( SYS$LCKPAG(32Mb) and
     SYS$HIBER ).  Working-set set up to hold it without paging.
     
     2.  DETmcc (Asset-SW developed in Sweden) with 400 alarm-rules
     enabled.  The rules are polling at every 10 hours, so no
     CPU-power is consumed, but memory is.  
     
     3.  A command-procedure, on another node, is generating 1
     DECnet event per second (ZERO EXECUTOR).  The events are
     sinked to the test-node.
     
     4.  5 DECmcc operators are logged in, running IMPM with the
     notification services window.  Default notification request
     enabled (Notfiy domain xxx).
     
     
     At this stage everything works OK.  Response time is good.
     aprox 50-60% CPU-power left on the system.  Memory is
     enough, aprox 10-15 Mb on the free-list.
     
     When the operators create notification request for the the
     'NDOE4 node REMOTE NODE node COUNTERS ZEROED' the CPU goes
     up to 100% busy time and they're getting the events from the
     remote node almost as expected.  The notifications comes in
     groups of two-three at a time.  This doesn't cause any
     objections.
     
     After a couple of minutes (with 100% CPU-busy) all operators
     are hanging, no CPU is consumed (less than 5% CPU busy), no
     other activities can be seen on the system (using MONITOR).
     Other processes, not running DECmcc, are not affected at
     all.
     
     This may continue for 60-90 seconds when the hanging
     processes are 'released' and notifications continues trying
     to catch up.  Sometimes we're getting a message telling that
     "one or more events were lost for NODE4 node ..." (I think,
     if I remember right, that this is a CVR from the event
     manager).  When the processes are released, the CPU goes up
     to 100% busy again.
     
     
     We tried to have 5 operators issuing 'GETEVENT NODE4 node
     REMOTE node COUNTERS ZEROED, at every 00:00:00.1' and this
     didn't cause any problems.  CPU less than 40% busy, which I
     interpret as that the event manager is not the bottle-neck,
     but notification-services/FM is.

     We ran MONITOR ALL_CLASSES during the tests, and found
     nothing strange during the hangings.

     We observed that notifications can't catch up with the rate
     that the events are coming in.  In another test we did the
     following:

     1. one operator having a notify request for 'counters zeroed'
     on node Y
     
     2.  Node Y generating 100 'counters zeroed' events at a rate
     of 1/sec.
     
     3. No other user activity on the system.
     
     The time to process all 100 events and have them displayed
     on notifications window took aprox 120 seconds.  This
     implies that notifications FM can't process 1 event/sec for
     a long time and that the event manager at one point will
     have a full queue of events to be delivered.
     
     What if we have a very long flooding of events from the
     network into the MCC-system with notifications enabled?
     
     How many events can the event manager handle? At what rate?
     Can the event-queue be increased?



     We're going to do some more tests to try to find out more
     precise under what conditions these hangings appear and
     I'll be back with further info.


     I'd appreciate any comments.  
     
     We'll try to convince the customer that the test should be
     done some other way and that this test is far away from a
     real situation, but until then... 
     
     These hangings may ruin the whole deal...

	/Nils
T.RTitleUserPersonal
Name
DateLines
5117.1CHEEKO::DITMARSPeteWed May 26 1993 12:5626
Your FCL getevent test is slightly flawed:  you want to
specify a FOR DURATION instead of an AT EVERY.

GETEVENT NODE4 node
     REMOTE node COUNTERS ZEROED, for duration 24:00:00

Try that and see what happens.

Also, try running the same notification requests 
(not getevents) from the same number of FCL processes 
and direct the output to the null device.

notify domain blah blah blah , to file nl:

This will take the notif PM out of the loop but still 
test the FM.

Another thing to try is to close (File->Close Window not
just iconify) the Notif PM window(s) and just let the 
notifications arrive at the map (and beep).

A question: in your test of running 5 IMPMs, how many 
of your IMPMs are pumping their displays to other
workstations?  If all 5 IMPM sessions are running
with their displays on that single workstation, try
sending the displays elsewhere.
5117.2Can you change the size of the Event Manager ?TAEC::LAVILLATWed May 26 1993 13:049
Would it also help to increase the size of the Event Manager pool ?

Is it possible on VMS (i.e. Does the MCC_EVENT_POOL_SIZE has an effect ? ) ?

My 2 �.

Pierre.

5117.3Re .1STKHLM::BERGGRENNils Berggren EIS/Project dpmt, Sweden DTN 876-8287Thu May 27 1993 03:1856
Re .1

>>> Your FCL getevent test is slightly flawed:  you want to
>>> specify a FOR DURATION instead of an AT EVERY.
>>>
>>>   GETEVENT NODE4 node
>>>     REMOTE node COUNTERS ZEROED, for duration 24:00:00
>>>
>>> Try that and see what happens.

I guess that this wouldn't affect the performance.

Does NOTIF-FM do a 'GETEVENT xxx, FOR DURATION yyy' when you 
issue a notification request or does it issue the GETEVENT 
without any scheduling after processing a received event?
I thought it was the latter, so I specified AT EVERY to try to
do as NOTIF-FM.

>>> Also, try running the same notification requests 
>>> (not getevents) from the same number of FCL processes 
>>>> and direct the output to the null device.
>>>
>>> notify domain blah blah blah , to file nl:
>>>
>>> This will take the notif PM out of the loop but still 
>>> test the FM.

I'll do it.


>>> Another thing to try is to close (File->Close Window not
>>> just iconify) the Notif PM window(s) and just let the 
>>> notifications arrive at the map (and beep).

guess that this is like the above test:
    'notify domain blah blah blah , to file nl:'

>>> A question: in your test of running 5 IMPMs, how many 
>>> of your IMPMs are pumping their displays to other
>>> workstations?  If all 5 IMPM sessions are running
>>> with their displays on that single workstation, try
>>> sending the displays elsewhere.

We're not displayin on the workstation itself.  They have 5 
VXT2000 with 14 Mb of memory each on which we're displaying 
the IMPM:s.

We've tried everything from 5 IMPM:s on 5 VXT:s to 5 IMPM:s on 
1 VXT and we get hangings in every case.  

Memory on the VXT:s are not a bottleneck, there's 2-3 Mb of 
free memory when we have __lots__ of windows on the screen.

       /Nils


5117.4CHEEKO::DITMARSPeteThu May 27 1993 10:3054
>I guess that this wouldn't affect the performance.

It may not.

>Does NOTIF-FM do a 'GETEVENT xxx, FOR DURATION yyy' when you 
>issue a notification request or does it issue the GETEVENT 
>without any scheduling after processing a received event?
>I thought it was the latter, so I specified AT EVERY to try to
>do as NOTIF-FM.

It's the former.  The scope of interest (FOR) gets passed
through by the AM to the event manager (via a parameter to 
mcc_event_get), which forces it to hang onto events that arrive 
between calls.  The scheduling time (AT) only effects the scheduling 
of the mcc_call to the AM.  It's possible (though I'm not sure how 
likely) that using AT would allow the event manager to drop events 
that would otherwise get stuck in the pool.

>>>> Another thing to try is to close (File->Close Window not
>>>> just iconify) the Notif PM window(s) and just let the 
>>>> notifications arrive at the map (and beep).
>
>guess that this is like the above test:
>    'notify domain blah blah blah , to file nl:'

Kinda, but not quite.  There's still a fair amount of processing that
the IMPM does with each notification... just far less X calls.

>We're not displayin on the workstation itself.  They have 5 

Hmmmm....  Same behavior if all 5 IMPMs pump their displays to 
the workstation?  (not that this is a scenario that a customer
would use...)

I do remember, in one test scenario, we saw a kind of "reverse traffic 
jam" behavior where the IMPM/Notif PM kind of slowed way down and then, 
by starting another PM, things sped back up again.  During that
testing, we found a major memory leak.  When we fixed the 
memory leak, the weird "traffic jam" behavior went away.  I never 
really understood what could have caused performance to improve when
two processes were running as opposed to one, but I had this vague
feeling that the code being executed was more likely to be in physical 
memory when two processes were using it than when one was.  That coupled
with the fact that the virtual address usage of the processes were
growing rapidly, I guess.  Or, it could be that the event manager was
unable to handle the throughput when the consumers were "slow".

>     The test-setup is as follows:
>     
>     1.  A process blocking 32 MB ( SYS$LCKPAG(32Mb) and
>     SYS$HIBER ).  Working-set set up to hold it without paging.

I'm not sure I understand the point of this.  Is it effectively taking
32Mb of physical (and virtual) memory away from the rest of the system?
5117.5Block 32 Mb - make sure enough memory for expansion.STKHLM::BERGGRENNils Berggren EIS/Project dpmt, Sweden DTN 876-8287Fri May 28 1993 03:5817
re .4

>>> I'm not sure I understand the point of this.  Is it effectively taking
>>> 32Mb of physical (and virtual) memory away from the rest of the system?

The blocking of 32 Mb is just to prove that 128 Mb is enough, and that they 
have some room for expansion regarding memory usage.

It is taking 32 Mb out of memory.  The System Services manual says:

 "The Lock Pages in Memory service locks a page or range
  of pages in memory. The specified virtual pages are forced
  into the working set and then locked in memory. A locked
  page is not swapped out of memory if the working set of the
  process is swapped out. These pages are not candidates for
  page replacement and in this sense are locked in the working
  set as well."
5117.6Hangis if using FCL with 'TO FILE NL:' as wellSTKHLM::BERGGRENNils Berggren EIS/Project dpmt, Sweden DTN 876-8287Tue Jun 01 1993 03:4215
We tried FCL with 'NOTIFY ..., TO FILE NL:' and it did hang for 30-60 seconds.
The first time I tested, it worked ok with 400+ events, but the second time it 
failed...

We also tried IMPM with the notification window closed with no success.

I guess that this takes X-traffic overhead as the problem away.


Anyone tried to reproduce the problem?   We get the hangings every time
running 1 event/sec for more than 2 minutes and 4 or more operators. 
We even had hangings with two operators, but that is not reproducable.


      /Nils
5117.7What LOG-logicals to use?STKHLM::BERGGRENNils Berggren EIS/Project dpmt, Sweden DTN 876-8287Wed Jun 02 1993 03:0914
Are the LOG-logicals (MCC_EVENT_LOG and MCC_EVENT_TRACE) still supported?
If yes, what are the proper values to set?  Is it still 1 and 180 respectively?

Is the logical MCC_NOTIFICATION_FM_LOG the right name for NOTIFICATION-FM and
what values should be used?

Does it make any sence to use them to figure out when and where it is 
hanging, and what's going on?

Any other hints on what to do to figure out what's going on (or rather what is
*NOT* going on...)

   
    /Nils
5117.8Some event pool logicalsTOOK::T_HUPPERThe rest, as they say, is history.Fri Jun 11 1993 18:0311
    The event logicals MCC_EVENT_LOG and MCC_EVENT_TRACE are still usable,
    values 1 and 180 (hex) respectively.  These will cause the dumping of
    event puts and gets.  This may be a LARGE amount of output.  Make sure
    that all processes using the event pool use these logicals to get the
    whole picture.
    
    The event logical MCC_EVENT_POOL_SIZE can be used to set the size of
    the event pool in bytes (evenly divisible by 1024).  Minimum is 262144,
    default is 524288, max is 8192000.
    
    Ted
5117.9Notification window missing eventsNEWVAX::BUCHMANUNIX refugee in a VMS worldWed May 31 1995 00:3618
    A member of our test team is noticing that an operator's notification
    window can start to miss events if it has been up for a long time.
    Is this a normal occurrence? Should the notifications window not be
    left up for long periods?
    			Thanks,
    				Jim B.
    
    Tester's message follows:
    
    > When MCC has been up for hours on end, the notification window can slow
    > down.  This is indicated when there are gaps between the alarm ids
    > shown in the notification window.
    >
    > (Note: all alarms are logged to the mcc_notification.log file.)
    >
    > To fix the problem,
    > you can clear out all of the alarms in the notification window
    > periodically, or restart MCC.
5117.10Any ideas?NEWVAX::BUCHMANUNIX refugee in a VMS worldMon Jun 12 1995 22:298
    > Note 5117.9 by NEWVAX::BUCHMAN "UNIX refugee in a VMS world" >>>
    >               -< Notification window missing events >-
    
    Can anyone give us an opinion here? We have increased the pool size,
    and that helps a bit; but can we do anything else? BTW, all of our
    events come in via SNMP traps.
    				Thanks,
    					Jim B.