[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

1400.0. "NML_nnnn processes; Node terminated link before link confirmation" by CUJO::HILL (Dan Hill - Network Management - (Customer Resident)) Tue Aug 27 1991 13:53

    We have been seeing multiple LINE RECEIVE FAILURE errors on all nodes
    in our LAVc.  To monitor this, I wrote an alarm rule for each node in
    our cluster:
    
    Example:
    CREATE MCC 0 ALARMS RULE NOMAD_LINE_RECEIVE_FAILURE -
      EXPRESSION = (CHANGE_OF(NODE4 NOMAD LINE SVA-0 RECEIVE FAILURE,*,*) ,-
                     AT EVERY 00:03:00) ,-
      PROCEDURE  = MCC_COMMON:MCC_ALARMS_LOG_ALARMS.COM ,-
      EXCEPTION HANDLER = MCC_COMMON:MCC_ALARMS_LOG_EXCEPTION.COM ,-
      CATEGORY   = "Receive problems" ,-
      DESCRIPTION= "Line Receive Failures Detected." ,-
      QUEUE      = "MCC_ALARMS_BATCH" ,-
      PARAMETER  = "NODE_ALARMS.LOG" ,-
      PERCEIVED SEVERITY = MINOR ,-
      IN DOMAIN  = .m.seg20.MAINTENANCE
    
    I brought up DECmcc FCL and executed the command procedure to enable
    the alarms I had just created.  I also brought up the Iconic Map to
    view the alarms as they fired.
    
    Our cluster went down the tubes as I did this.  Everyone ground to a
    halt as NML_nnnn process (4 per node) kicked into gear and sucked up
    CPU time.  
       Example:   NML_8232, NML_8233, NML_8235, NML_8237
    
    CRITICAL alarms fired in DECmcc on the Iconic Map window for each node.
    The alarm message was:
    
        Node terminated link before link confirmation
    
    The alarm rule which I had initially enabled was now disabled for each
    node.  I checked max links and alias max links on all nodes.  32 on
    all, with 64 on our cluster boot node and 128 on my network management
    node.  Exec counters showed none of the nodes maxed out.
    
    I also noticed that server processes began to drop out as connections
    were broken.
    
    When I ENABLED the rules again via Iconic Map, I created two additional
    NML_nnnn processes on each node and further degraded system
    performance.  I had to exit DECmcc to prevent the total demise of our
    cluster.  Everything came back to life as NML_nnnn processes timed out.
    
    I also noticed that DECmcc iconic map window had an ACCVIO message in
    it:  access violation, reason mask=00, virtual address=00000004,
    PC=003CD716, PSL=03C00004
    
    ------------------------------
    Can someone provide a fix for this problem?
    
    -Thanks,
     Dan
T.RTitleUserPersonal
Name
DateLines
1400.1Alarms uses the Show Directive ...NANOVX::ROBERTSKeith Roberts - DECmcc Toolkit TeamTue Aug 27 1991 16:5015
The Alarms Rule Evaluator converts your Rule Expression into a Show
directive (or Getevent directive for the Occurs function).

If you rule is:

(CHANGE_OF(NODE4 NOMAD LINE SVA-0 RECEIVE FAILURE,*,*), AT EVERY 00:03:00)

Try:

SHOW NODE4 NOMAD LINE SVA-0 RECEIVE FAILURE AT EVERY 00:03:00

Because this is exactly the command we generate.  I can't imagine why your
system got bogged down by the NML processes.

/keith
1400.2Additional NML_nnnn problems using NODE4_AM rules.CUJO::HILLDan Hill-Net.Mgt.-Customer ResidentTue Aug 27 1991 17:4323
    Some additional info:
    I've created rules for additional nodes to monitor reachability of
    nodes in a specific area using the following syntax:
    
    CREATE MCC 0 ALARMS RULE node2_REMOTE_NODE_STATE -
    EXPRESSION  =(NODE4 node1 REMOTE NODE node2 STATE = UNREACHABLE ,-
    		  AT EVERY 00:05:00) ,-
                       .
                       .
                       .
    
    NOTE that node1 is area router for node2.
    
    Two NML_nnnn processes were started for each end node on the area
    router (node1).  I enabled rules for 14 nodes, so the area router had
    28 NML_nnnn processes running.
    
    ---------------
    Thanks for the info in the .1 .   I'll continue to research this issue,
    but until I resolve it, I can't use the NODE4_AM rules I've created
    since the result is a massive system impact.
    
    -dh
1400.3No answers, some suggestions, some questionsTOOK::CAREYWed Aug 28 1991 15:1746
    
    Re: .0 -- I'm not sure why you're seeing 4 NML server processes per
    node.  Due to our connection checking, I would expect 2.  Sorry, but 2
    is the minimum.  
    
    How many nodes are there in your LAVc?  I expect that it is less than
    20, but tell me if I'm wrong.
    
    How many Node4 alarms are you running?  To how many different Node4
    entities?  (When I ask about Node4, I'm referring to Node4 and all of 
    its children, such as Node4 Line, etc.)
    
    The request you're making is pretty simple, and even though the
    processes are going to be there, I don't expect them to be active for
    more than a second or two during every polling cycle.
    
    I'm curious about how much trouble you're having with the Lines
    themselves as well; as an in-band management system, management
    requests are at as much risk as data of confusion causing excessive
    overhead.  Do you know anything much about the problem?
    
    Your LAVc shouldn't be collapsing under the load induced by MCC, more
    information about the kinds of load it is under might help us to
    understand better.
    
    Re: .2 -- Try either slightly different polling intervals for different
    alarms, or use a command procedure to Enable them and do it at
    say, fifteen second intervals.  Since you are running all of these
    rules from the same process on the same node, to the same remote node,
    skewing the requests a little will allow them to re-use the already
    started nml servers.  Probably you will still use something more than
    two processes to service the requests, but I expect this to reduce the
    count to five or six.  When you enable all the rules at once, we get
    into a situation where the DNA4 AM is trying to start all of these
    connections concurrently.  THAT is what costs the process slots.
    
    Incidentally, the first connection will take two processes, the rest
    should only require 1, unless you are doing a lot of requests to
    different Node4s in the meantime.
    
    We are looking at ways to reduce this load still more -- hopefully
    we'll be able to supply you with some options in V1.2.
    
    -Jim Carey
    
    
1400.4Can you solve the problem with less polling ?DELNI::R_PAQUETThu Aug 29 1991 08:5610
    
    
    For the reachability problem, why don't you use DECnet events to
    determine reachability, and then alarm on these events.  This will
    eliminate the polling for reachability.
    
    For the line recieve failures, I'd guess that you are seeing the same
    failures, but counted by each individual system.  Rather than alarm on
    every system in the LAVc, why not just pick one as representative (like
    the boot node as it is the most critical) for this alarm ?
1400.5More on NML; Questions on Performance & Max Alarms allowedCUJO::HILLDan Hill-Net.Mgt.-Customer ResidentFri Sep 06 1991 02:1228
    Upon further testing, here is what I found.  3 NML_nnnn processes are
    created for every rule enabled for Phase IV nodes.  These processes
    convert to SERVER_nnnn processes after a few seconds to a minute or
    more (depending on how bogged down the target processor is).  Two of
    the SERVER_nnnn processes eventually timeout and go away leaving a
    single process.  (It is a bit more complex depending on types of rules,
    but this explanation suffices for now).
    
    If you want to use the same NML_nnnn/SERVER_nnnn process for all
    alarms, you must wait a few seconds before enabling the next alarm
    rule.  This "staggering" can be accomplished using the command
    MCC> SPAWN WAIT 00:00:15
    between each ENABLE command.
    ------------------------------------------------------------------------
    Enabling rules and at the same time avoiding the comsumption of vast
    amounts of target node resources can be a real juggling act.  Still,
    by selectively enabling 40+ rules to monitor everything from NODE4 LINE
    RECEIVE COUNTERS to BRIDGE SPANNING TREE changes, my VAXstation 3100/76
    with 32 MB of memory incurred a noticable performance hit as alarms
    began to fire.
    
    >>>>> *Does anyone have any info on maximum number of rules that can be 
           enabled?  
    	  *What is the minimum polling time allowed without impacting
    	   performance?
          *What is the minimum suggested VAXstation configuration for
           monitoring 600 nodes with 3 rules each (1200 alarm rules)
    	   with polling times less than 5 minutes each?
1400.6NSSG::R_SPENCENets don't fail me now...Fri Sep 06 1991 15:553
    You can also put a start time on each enable command.
    
    s/rob