[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

5910.0. "Notification stops - serious problem at BT." by BAHTAT::BOND () Tue Mar 15 1994 05:47

    Hi,
    
    My customer, British Telecom, is having serious problems with
    notification services.  This has been reported to the UK TSC and brief
    details were given by Bipin Mistry in note 5866.  I offer here as much
    detail as we currently have in order to try to get the problem solved
    because the customer is getting to the end of his tether (not
    suprisingly).
    
    The customer has a large domain hierarchy and creates a "notify domain
    top entity=(collector *) event=(any event)" which expands to all
    subdomains.  This is done from the root account.  His main users also
    have the same notify command but specify a starting domain which is not
    quite at the top of the domain tree, ie his main users do not see the
    total hierarchy.  We see a number of different problems which we think
    are all related to notification services.
    
    Sometimes, notification stops working for some or all of the main users
    but it carries on for the root users.  Sometimes it stops for everybody
    including root (but more often just for the main users).  It doesn't
    produce any errors or core dumps when it stops, you just don't get
    anything displayed in the window and map colour changes stop.
    
    Sometimes after a reboot, notification will not begin at all for the
    main users although it always starts for root.  The notify requests
    window indicates it is created and enabled successfully (takes about 20
    minutes!) but no events get displayed whereas they do for root.
    
    Bringing in a third and fourth username (different from root and 'main
    users') often gives the error "No matches found for entities or rules
    requested" even though they are using the same domain hierarchy and
    notify command as 'main users'.  Obviously they get no notifications.
    
    Looking in the MCC_ECO conference, I saw that notes 77 and 52 reference
    a bug in notification_fm that gives the above error message.  But it
    says it is against mcc 1.2.3 and we are running 1.3.  Actually, to be
    exact, we are running the special iconic_map that says " 07/09/93
    Special executable for DEC_WM_TAKE_FOCUS problem." that was released to
    BT to try to fix another problem but I believe notification is vanilla
    1.3.  COULD IT BE THAT THIS OLD FIX IS NOT IN THIS VERSION OF THE MAP??
    We have also been given a new version of mcc_evc_sink to fix the
    problem whereby the sink sometimes stops receiving decnet events.
    
    The problem seems to have got worse since we upgraded the disks on the
    system (from 4 RZ25s to 2 RZ26s and an RZ58) and re-loaded all the
    partitions.  The system now feels much faster.  I wonder if that is
    exploiting a timing window in the code somewhere?  We also operate with
    a large DNS cache so DECmcc resolves its name lookups as quickly as
    possible.  The problem is now occurring at least three times a day and
    quite often, notifications will not start at all for the 'main users'
    as I described above.  (We get them to start eventually by just
    rebooting again and again.)
    
    We are running on ULTRIX 4.3A, DECnet OSI V5.1, 5000-240 with R4000
    upgrade, 416MB memory.  We have doubled all the values in sem.h in case
    it was a semaphore problem but this doesn't seem to have helped. 
    max_nofiles is set to 1024 and we have set maxusers to 128 and maxuprc
    to 512 to make sure the system generally has plenty of resources
    allocated.  IS THERE SOMETHING ELSE WE COULD BE RUNNING OUT OF?
    What about shared memory segments?  smmax is quite high at 50000
    because we run with the maximum 8192000 sized event pool and a 20MB DNS
    cache.  As far as I know, nothing else requires shared memory other
    than mcc and dns and these are the only things that run on the system.
    
    Is there anything we can do to encourage notification services to tell
    us why it stops, or perhaps better, why it sometimes doesn't start at
    all?  Any debug flags we can set?  Perhaps doing this will change the
    timing and eliminate the problem!
    
    We really must get to the bottom of this problem soon as possible as
    the customer is about to lose his rag.  (He has already escalated this
    to members of the UK board of management...)
    
    Thanks,
    
    Chris Bond.
    
T.RTitleUserPersonal
Name
DateLines
5910.1a few hintsTAEC::LAVILLATTue Mar 15 1994 08:5923
  Chris,

  I speak only for myself and this is not an official response from the
  MCC team (I am not part of it :-) ).

  First, I would suggest BT to file a CLD against MCC for this behavior,
  since MCC on ULTRIX is in maintenance mode, the only way to have bugs
  corrected is to enter CLDs.

  Second, you mention that BT is running MCC V1.3 on ULTRIX 4.3-A using R4000
  machines. Be aware that neither Ultrix 4.3A and R4000 machines are supported
  by MCC V1.3. Have a look at the MCC V1.3 SSA, and will probably not see
  mention of ULTRIX 4.3A support, nor 5260 family machines.

  Third, tell BT to start the notif fm after doing a
  'setenv MCC_NOTIFICATION_FM_LOG 0x88' which will produce (hopefully) a lot 
  of useful (or garbage :-( ) information.

  Regards.

  Pierre.

5910.2Thanks - where does it log to?BAHTAT::BONDTue Mar 15 1994 11:0417
    Hi Pierre,
    
    Thanks for the quick response.  Notifications is normally enabled
    directly from the iconic map PM.  If I set the environment variable,
    does it write to a file or will it try to write to the terminal?  If
    so, I guess I will have to enable notifications via an FCL window which
    may change the behaviour.
    
    Also, I believe that mcc 1.3 does support R4000 and ULTRIX 4.3A.  I had
    mail from Joe O'Connor in NSM engineering sometime back saying testing
    was complete.  I think the spd and ssa have just not been updated :-)
    
    The only real support issue in our configuration (as far as I know) is
    that we are running DECnet/OSI V5.1 rather than V5.1A but that is
    because mcc won't play with the new dnsclerk in 5.1A.
    
    chris
5910.3What about UDMSCCA::daveAhh, but fortunately, I have the key to escape reality.Tue Mar 15 1994 13:2617
What you seem to completely skip was the mention of UDM and that fact that the 
UDM users are the one that see this problem.  Have you isolated this problem
since the CLD was entered or have you determined that UDM has nothing to do with this problem?

What happens if UDM is not run?  Does the problem persist?

------------------------------------------------------------------------------------
From the CLD:

    The customer has 2 users setup to receive notification alarms root and
    udmman. What appears to be happening is that over a period of time the
    udmman account stops receiving notification. However, what is extremely
    strange is that subsequent users who log into the udmman account receive
    notification messages. So what i suspect is happening is that the udmman
    users are running out of a resource or their internal notification stack
    size is not large enough and not being flushed out.

5910.4You can redirect the outputTAEC::LAVILLATWed Mar 16 1994 03:2129
Re .2:

>    
>    Thanks for the quick response.  Notifications is normally enabled
>    directly from the iconic map PM.  If I set the environment variable,
>    does it write to a file or will it try to write to the terminal?  If
>    so, I guess I will have to enable notifications via an FCL window which
>    may change the behaviour.
>    
	The log goes to the stty of the user who has started the NOTIF FM.
	If you want to redirect the output, do a 'ps' command to determine
	the enrollement ID of the notif fm, kill it, and restart it via
	'/usr/mcc/mmexe/mcc_notification_fm <enroll_id> N > <filename> &'

>    Also, I believe that mcc 1.3 does support R4000 and ULTRIX 4.3A.  I had
>    mail from Joe O'Connor in NSM engineering sometime back saying testing
>    was complete.  I think the spd and ssa have just not been updated :-)
>    
	Ok so, if you have a 'not published official' statement...

>    The only real support issue in our configuration (as far as I know) is
>    that we are running DECnet/OSI V5.1 rather than V5.1A but that is
>    because mcc won't play with the new dnsclerk in 5.1A.
>    
	We know this problem :-(

	Regards.

	Pierre.
5910.5What about mbufs ?BIKINI::KRAUSEEuropean NewProductEngineer for MCCWed Mar 16 1994 03:5711
There is a problem in DECnet/OSI V5.1 that causes a loss of mbufs when
protocols are started and stopped frequently. It shows up drastically
with alarm rules polling bridges, but other modules might suffer from
this as well.

Do a netstat -m and watch the mbufs value. 

The DECnet/OSI V5.1 patch that solves the problem is dli_bind.o, but 
while you're at it you should install net_common.o and if_ln.o as well.

*Robert
5910.6Thanks - and more infoBAHTAT::BONDWed Mar 23 1994 06:3134
    Thanks to all the noters for the responses so far.  The problem is
    still happening and we are beginning to narrow down whether it is to
    do with notification or the collection_am.
    
    re: .3, We cannot remove UDM from the equation because it is used to
    manage the terminal servers.  However, note that there are no
    notification requests that expand to getevents to the remote_station or
    unix_system access modules.  ALL notifications come through the
    collection_am.  Obviously remote_stations are the targets for some of
    the collector events.  The 'udmman' account is just the non-privileged
    user account that the bulk of users come through.  The root users still
    have a very similar set of notifications enabled - it is that the
    udmman users are the only ones that receives events targetted to udm
    type entities; root users do as well and sometimes notification stops
    for root as well.
    
    re: .4, Pierre, we haven't tried tracing yet but thanks for all the
    information telling us how to use it.
    
    re: .5 I believe in the past we have seem this mbuf problem; once we
    found that all incoming connects were rejected and outgoing 'dlogins'
    failed with "no buffer space available".  But we haven't seen anything
    like this since.  What are we looking for when watching mbufs; a
    continuously increasing value?
    
    The UK CSC advised me NOT to apply the 5.1 kernel patches because the
    system at BT is running 4.3A and they believed that the 5.1 patches
    were developed for 4.3 and hence would break a 4.3A kernel.  They
    further thought that the 4.3A kernel would have all those changes
    anyway.  We are obviously happy to receive more advice here but again,
    because it is very much a live system, we are not in a position to
    experiment with patches that might be applicable but might not be!  We
    are running the 5.1 patched DNS executables.
    
5910.7BIKINI::KRAUSEEuropean NewProductEngineer for MCCMon Mar 28 1994 05:207
>    like this since.  What are we looking for when watching mbufs; a
>    continuously increasing value?

Yes, that's what I meant. Depending on the number of mbufs available on 
the system it may well take a few days before they are all used up.

*Robert