[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:	DECmcc user notes file. Does not replace IPMT.
Notice:	Use IPMT for problems. Newsletter location in note 6187
Moderator:	TAEC::BEROUD

Created:	Mon Aug 21 1989
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	6497
Total number of notes:	27359

5910.0. "Notification stops - serious problem at BT." by BAHTAT::BOND () Tue Mar 15 1994 05:47

    Hi,
    
    My customer, British Telecom, is having serious problems with
    notification services.  This has been reported to the UK TSC and brief
    details were given by Bipin Mistry in note 5866.  I offer here as much
    detail as we currently have in order to try to get the problem solved
    because the customer is getting to the end of his tether (not
    suprisingly).
    
    The customer has a large domain hierarchy and creates a "notify domain
    top entity=(collector *) event=(any event)" which expands to all
    subdomains.  This is done from the root account.  His main users also
    have the same notify command but specify a starting domain which is not
    quite at the top of the domain tree, ie his main users do not see the
    total hierarchy.  We see a number of different problems which we think
    are all related to notification services.
    
    Sometimes, notification stops working for some or all of the main users
    but it carries on for the root users.  Sometimes it stops for everybody
    including root (but more often just for the main users).  It doesn't
    produce any errors or core dumps when it stops, you just don't get
    anything displayed in the window and map colour changes stop.
    
    Sometimes after a reboot, notification will not begin at all for the
    main users although it always starts for root.  The notify requests
    window indicates it is created and enabled successfully (takes about 20
    minutes!) but no events get displayed whereas they do for root.
    
    Bringing in a third and fourth username (different from root and 'main
    users') often gives the error "No matches found for entities or rules
    requested" even though they are using the same domain hierarchy and
    notify command as 'main users'.  Obviously they get no notifications.
    
    Looking in the MCC_ECO conference, I saw that notes 77 and 52 reference
    a bug in notification_fm that gives the above error message.  But it
    says it is against mcc 1.2.3 and we are running 1.3.  Actually, to be
    exact, we are running the special iconic_map that says " 07/09/93
    Special executable for DEC_WM_TAKE_FOCUS problem." that was released to
    BT to try to fix another problem but I believe notification is vanilla
    1.3.  COULD IT BE THAT THIS OLD FIX IS NOT IN THIS VERSION OF THE MAP??
    We have also been given a new version of mcc_evc_sink to fix the
    problem whereby the sink sometimes stops receiving decnet events.
    
    The problem seems to have got worse since we upgraded the disks on the
    system (from 4 RZ25s to 2 RZ26s and an RZ58) and re-loaded all the
    partitions.  The system now feels much faster.  I wonder if that is
    exploiting a timing window in the code somewhere?  We also operate with
    a large DNS cache so DECmcc resolves its name lookups as quickly as
    possible.  The problem is now occurring at least three times a day and
    quite often, notifications will not start at all for the 'main users'
    as I described above.  (We get them to start eventually by just
    rebooting again and again.)
    
    We are running on ULTRIX 4.3A, DECnet OSI V5.1, 5000-240 with R4000
    upgrade, 416MB memory.  We have doubled all the values in sem.h in case
    it was a semaphore problem but this doesn't seem to have helped. 
    max_nofiles is set to 1024 and we have set maxusers to 128 and maxuprc
    to 512 to make sure the system generally has plenty of resources
    allocated.  IS THERE SOMETHING ELSE WE COULD BE RUNNING OUT OF?
    What about shared memory segments?  smmax is quite high at 50000
    because we run with the maximum 8192000 sized event pool and a 20MB DNS
    cache.  As far as I know, nothing else requires shared memory other
    than mcc and dns and these are the only things that run on the system.
    
    Is there anything we can do to encourage notification services to tell
    us why it stops, or perhaps better, why it sometimes doesn't start at
    all?  Any debug flags we can set?  Perhaps doing this will change the
    timing and eliminate the problem!
    
    We really must get to the bottom of this problem soon as possible as
    the customer is about to lose his rag.  (He has already escalated this
    to members of the UK board of management...)
    
    Thanks,
    
    Chris Bond.

T.R	Title	User	Personal Name	Date	Lines
5910.1	a few hints	TAEC::LAVILLAT		`Tue Mar 15 1994 08:59`	23
	Chris, I speak only for myself and this is not an official response from the MCC team (I am not part of it :-) ). First, I would suggest BT to file a CLD against MCC for this behavior, since MCC on ULTRIX is in maintenance mode, the only way to have bugs corrected is to enter CLDs. Second, you mention that BT is running MCC V1.3 on ULTRIX 4.3-A using R4000 machines. Be aware that neither Ultrix 4.3A and R4000 machines are supported by MCC V1.3. Have a look at the MCC V1.3 SSA, and will probably not see mention of ULTRIX 4.3A support, nor 5260 family machines. Third, tell BT to start the notif fm after doing a 'setenv MCC_NOTIFICATION_FM_LOG 0x88' which will produce (hopefully) a lot of useful (or garbage :-( ) information. Regards. Pierre.
5910.2	Thanks - where does it log to?	BAHTAT::BOND		`Tue Mar 15 1994 11:04`	17
	Hi Pierre, Thanks for the quick response. Notifications is normally enabled directly from the iconic map PM. If I set the environment variable, does it write to a file or will it try to write to the terminal? If so, I guess I will have to enable notifications via an FCL window which may change the behaviour. Also, I believe that mcc 1.3 does support R4000 and ULTRIX 4.3A. I had mail from Joe O'Connor in NSM engineering sometime back saying testing was complete. I think the spd and ssa have just not been updated :-) The only real support issue in our configuration (as far as I know) is that we are running DECnet/OSI V5.1 rather than V5.1A but that is because mcc won't play with the new dnsclerk in 5.1A. chris
5910.3	What about UDM	SCCA::dave	Ahh, but fortunately, I have the key to escape reality.	`Tue Mar 15 1994 13:26`	17
	What you seem to completely skip was the mention of UDM and that fact that the UDM users are the one that see this problem. Have you isolated this problem since the CLD was entered or have you determined that UDM has nothing to do with this problem? What happens if UDM is not run? Does the problem persist? ------------------------------------------------------------------------------------ From the CLD: The customer has 2 users setup to receive notification alarms root and udmman. What appears to be happening is that over a period of time the udmman account stops receiving notification. However, what is extremely strange is that subsequent users who log into the udmman account receive notification messages. So what i suspect is happening is that the udmman users are running out of a resource or their internal notification stack size is not large enough and not being flushed out.
5910.4	You can redirect the output	TAEC::LAVILLAT		`Wed Mar 16 1994 03:21`	29
	Re .2: > > Thanks for the quick response. Notifications is normally enabled > directly from the iconic map PM. If I set the environment variable, > does it write to a file or will it try to write to the terminal? If > so, I guess I will have to enable notifications via an FCL window which > may change the behaviour. > The log goes to the stty of the user who has started the NOTIF FM. If you want to redirect the output, do a 'ps' command to determine the enrollement ID of the notif fm, kill it, and restart it via '/usr/mcc/mmexe/mcc_notification_fm <enroll_id> N > <filename> &' > Also, I believe that mcc 1.3 does support R4000 and ULTRIX 4.3A. I had > mail from Joe O'Connor in NSM engineering sometime back saying testing > was complete. I think the spd and ssa have just not been updated :-) > Ok so, if you have a 'not published official' statement... > The only real support issue in our configuration (as far as I know) is > that we are running DECnet/OSI V5.1 rather than V5.1A but that is > because mcc won't play with the new dnsclerk in 5.1A. > We know this problem :-( Regards. Pierre.
5910.5	What about mbufs ?	BIKINI::KRAUSE	European NewProductEngineer for MCC	`Wed Mar 16 1994 03:57`	11
	There is a problem in DECnet/OSI V5.1 that causes a loss of mbufs when protocols are started and stopped frequently. It shows up drastically with alarm rules polling bridges, but other modules might suffer from this as well. Do a netstat -m and watch the mbufs value. The DECnet/OSI V5.1 patch that solves the problem is dli_bind.o, but while you're at it you should install net_common.o and if_ln.o as well. *Robert
5910.6	Thanks - and more info	BAHTAT::BOND		`Wed Mar 23 1994 06:31`	34
	Thanks to all the noters for the responses so far. The problem is still happening and we are beginning to narrow down whether it is to do with notification or the collection_am. re: .3, We cannot remove UDM from the equation because it is used to manage the terminal servers. However, note that there are no notification requests that expand to getevents to the remote_station or unix_system access modules. ALL notifications come through the collection_am. Obviously remote_stations are the targets for some of the collector events. The 'udmman' account is just the non-privileged user account that the bulk of users come through. The root users still have a very similar set of notifications enabled - it is that the udmman users are the only ones that receives events targetted to udm type entities; root users do as well and sometimes notification stops for root as well. re: .4, Pierre, we haven't tried tracing yet but thanks for all the information telling us how to use it. re: .5 I believe in the past we have seem this mbuf problem; once we found that all incoming connects were rejected and outgoing 'dlogins' failed with "no buffer space available". But we haven't seen anything like this since. What are we looking for when watching mbufs; a continuously increasing value? The UK CSC advised me NOT to apply the 5.1 kernel patches because the system at BT is running 4.3A and they believed that the 5.1 patches were developed for 4.3 and hence would break a 4.3A kernel. They further thought that the 4.3A kernel would have all those changes anyway. We are obviously happy to receive more advice here but again, because it is very much a live system, we are not in a position to experiment with patches that might be applicable but might not be! We are running the 5.1 patched DNS executables.
5910.7		BIKINI::KRAUSE	European NewProductEngineer for MCC	`Mon Mar 28 1994 04:20`	7
	> like this since. What are we looking for when watching mbufs; a > continuously increasing value? Yes, that's what I meant. Depending on the number of mbufs available on the system it may well take a few days before they are all used up. *Robert