[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:	DECmcc user notes file. Does not replace IPMT.
Notice:	Use IPMT for problems. Newsletter location in note 6187
Moderator:	TAEC::BEROUD

Created:	Mon Aug 21 1989
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	6497
Total number of notes:	27359

1400.0. "NML_nnnn processes; Node terminated link before link confirmation" by CUJO::HILL (Dan Hill - Network Management - (Customer Resident)) Tue Aug 27 1991 12:53

    We have been seeing multiple LINE RECEIVE FAILURE errors on all nodes
    in our LAVc.  To monitor this, I wrote an alarm rule for each node in
    our cluster:
    
    Example:
    CREATE MCC 0 ALARMS RULE NOMAD_LINE_RECEIVE_FAILURE -
      EXPRESSION = (CHANGE_OF(NODE4 NOMAD LINE SVA-0 RECEIVE FAILURE,*,*) ,-
                     AT EVERY 00:03:00) ,-
      PROCEDURE  = MCC_COMMON:MCC_ALARMS_LOG_ALARMS.COM ,-
      EXCEPTION HANDLER = MCC_COMMON:MCC_ALARMS_LOG_EXCEPTION.COM ,-
      CATEGORY   = "Receive problems" ,-
      DESCRIPTION= "Line Receive Failures Detected." ,-
      QUEUE      = "MCC_ALARMS_BATCH" ,-
      PARAMETER  = "NODE_ALARMS.LOG" ,-
      PERCEIVED SEVERITY = MINOR ,-
      IN DOMAIN  = .m.seg20.MAINTENANCE
    
    I brought up DECmcc FCL and executed the command procedure to enable
    the alarms I had just created.  I also brought up the Iconic Map to
    view the alarms as they fired.
    
    Our cluster went down the tubes as I did this.  Everyone ground to a
    halt as NML_nnnn process (4 per node) kicked into gear and sucked up
    CPU time.  
       Example:   NML_8232, NML_8233, NML_8235, NML_8237
    
    CRITICAL alarms fired in DECmcc on the Iconic Map window for each node.
    The alarm message was:
    
        Node terminated link before link confirmation
    
    The alarm rule which I had initially enabled was now disabled for each
    node.  I checked max links and alias max links on all nodes.  32 on
    all, with 64 on our cluster boot node and 128 on my network management
    node.  Exec counters showed none of the nodes maxed out.
    
    I also noticed that server processes began to drop out as connections
    were broken.
    
    When I ENABLED the rules again via Iconic Map, I created two additional
    NML_nnnn processes on each node and further degraded system
    performance.  I had to exit DECmcc to prevent the total demise of our
    cluster.  Everything came back to life as NML_nnnn processes timed out.
    
    I also noticed that DECmcc iconic map window had an ACCVIO message in
    it:  access violation, reason mask=00, virtual address=00000004,
    PC=003CD716, PSL=03C00004
    
    ------------------------------
    Can someone provide a fix for this problem?
    
    -Thanks,
     Dan

T.R	Title	User	Personal Name	Date	Lines
1400.1	Alarms uses the Show Directive ...	NANOVX::ROBERTS	Keith Roberts - DECmcc Toolkit Team	`Tue Aug 27 1991 15:50`	15
	The Alarms Rule Evaluator converts your Rule Expression into a Show directive (or Getevent directive for the Occurs function). If you rule is: (CHANGE_OF(NODE4 NOMAD LINE SVA-0 RECEIVE FAILURE,,), AT EVERY 00:03:00) Try: SHOW NODE4 NOMAD LINE SVA-0 RECEIVE FAILURE AT EVERY 00:03:00 Because this is exactly the command we generate. I can't imagine why your system got bogged down by the NML processes. /keith
1400.2	Additional NML_nnnn problems using NODE4_AM rules.	CUJO::HILL	Dan Hill-Net.Mgt.-Customer Resident	`Tue Aug 27 1991 16:43`	23
	Some additional info: I've created rules for additional nodes to monitor reachability of nodes in a specific area using the following syntax: CREATE MCC 0 ALARMS RULE node2_REMOTE_NODE_STATE - EXPRESSION =(NODE4 node1 REMOTE NODE node2 STATE = UNREACHABLE ,- AT EVERY 00:05:00) ,- . . . NOTE that node1 is area router for node2. Two NML_nnnn processes were started for each end node on the area router (node1). I enabled rules for 14 nodes, so the area router had 28 NML_nnnn processes running. --------------- Thanks for the info in the .1 . I'll continue to research this issue, but until I resolve it, I can't use the NODE4_AM rules I've created since the result is a massive system impact. -dh
1400.3	No answers, some suggestions, some questions	TOOK::CAREY		`Wed Aug 28 1991 14:17`	46
	Re: .0 -- I'm not sure why you're seeing 4 NML server processes per node. Due to our connection checking, I would expect 2. Sorry, but 2 is the minimum. How many nodes are there in your LAVc? I expect that it is less than 20, but tell me if I'm wrong. How many Node4 alarms are you running? To how many different Node4 entities? (When I ask about Node4, I'm referring to Node4 and all of its children, such as Node4 Line, etc.) The request you're making is pretty simple, and even though the processes are going to be there, I don't expect them to be active for more than a second or two during every polling cycle. I'm curious about how much trouble you're having with the Lines themselves as well; as an in-band management system, management requests are at as much risk as data of confusion causing excessive overhead. Do you know anything much about the problem? Your LAVc shouldn't be collapsing under the load induced by MCC, more information about the kinds of load it is under might help us to understand better. Re: .2 -- Try either slightly different polling intervals for different alarms, or use a command procedure to Enable them and do it at say, fifteen second intervals. Since you are running all of these rules from the same process on the same node, to the same remote node, skewing the requests a little will allow them to re-use the already started nml servers. Probably you will still use something more than two processes to service the requests, but I expect this to reduce the count to five or six. When you enable all the rules at once, we get into a situation where the DNA4 AM is trying to start all of these connections concurrently. THAT is what costs the process slots. Incidentally, the first connection will take two processes, the rest should only require 1, unless you are doing a lot of requests to different Node4s in the meantime. We are looking at ways to reduce this load still more -- hopefully we'll be able to supply you with some options in V1.2. -Jim Carey
1400.4	Can you solve the problem with less polling ?	DELNI::R_PAQUET		`Thu Aug 29 1991 07:56`	10
	For the reachability problem, why don't you use DECnet events to determine reachability, and then alarm on these events. This will eliminate the polling for reachability. For the line recieve failures, I'd guess that you are seeing the same failures, but counted by each individual system. Rather than alarm on every system in the LAVc, why not just pick one as representative (like the boot node as it is the most critical) for this alarm ?
1400.5	More on NML; Questions on Performance & Max Alarms allowed	CUJO::HILL	Dan Hill-Net.Mgt.-Customer Resident	`Fri Sep 06 1991 01:12`	28
	Upon further testing, here is what I found. 3 NML_nnnn processes are created for every rule enabled for Phase IV nodes. These processes convert to SERVER_nnnn processes after a few seconds to a minute or more (depending on how bogged down the target processor is). Two of the SERVER_nnnn processes eventually timeout and go away leaving a single process. (It is a bit more complex depending on types of rules, but this explanation suffices for now). If you want to use the same NML_nnnn/SERVER_nnnn process for all alarms, you must wait a few seconds before enabling the next alarm rule. This "staggering" can be accomplished using the command MCC> SPAWN WAIT 00:00:15 between each ENABLE command. ------------------------------------------------------------------------ Enabling rules and at the same time avoiding the comsumption of vast amounts of target node resources can be a real juggling act. Still, by selectively enabling 40+ rules to monitor everything from NODE4 LINE RECEIVE COUNTERS to BRIDGE SPANNING TREE changes, my VAXstation 3100/76 with 32 MB of memory incurred a noticable performance hit as alarms began to fire. >>>>> Does anyone have any info on maximum number of rules that can be enabled? What is the minimum polling time allowed without impacting performance? *What is the minimum suggested VAXstation configuration for monitoring 600 nodes with 3 rules each (1200 alarm rules) with polling times less than 5 minutes each?
1400.6		NSSG::R_SPENCE	Nets don't fail me now...	`Fri Sep 06 1991 14:55`	3
	You can also put a start time on each enable command. s/rob