[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:DECmcc user notes file. Does not replace IPMT.
Notice:Use IPMT for problems. Newsletter location in note 6187
Moderator:TAEC::BEROUD
Created:Mon Aug 21 1989
Last Modified:Wed Jun 04 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:6497
Total number of notes:27359

681.0. "alarms in .com's" by JETSAM::WOODCOCK () Thu Jan 31 1991 11:53

Hi,

I had submitted a QAR regarding the startup of rules via com procedures. I
believe it is #196 in the new database. Because it was resubmitted into the 
new database by someone other than myself I can't update it. Hence the note.
I had read a reply stating that things were worked on in the DECnet AM and
also to the internal timings of MCC and there was thought to be improvement.
There is no improvement, actually it looks worse.

I had sent an update to Jim C. of what I was seeing while starting event
alarms via com procedures while monitoring the icons yesterday but now feel
the situation is worse than I indicated to him. Further investigation using 
FCL points to some disturbing results considering that most users will 
probably use procedures to start monitoring and/or collection of 
historical data processes.

The log below shows a procedure starting up alarms then taking a snapshot
of the status of those alarms. 7 of the 11 alarms started with an error! 
Considering this is only 10% (or less) of what we'll be monitoring my
confidence level is a little low. The error also indicates the LAST ERROR
and not the LAST POLL ERROR (please verify this thought). Therefore I can
NEVER tell if this problem persists or goes away after the startup because
once there is an error MCC continues to display this error forever.

Along with the Internal DECnet errors you will see No Such Entity errors.
This is a new problem. When these rules are started manually they all work 
fine. 

Observation: The last time we looked at this it was thought that the router
was the *main* culprit because it only handles one NML link. This log also
shows the first rule to a node failing indicating there may also be a 
problem with internal MCC timing of these commands.

regards,
brad...


-----------------------------------------------------------------------------
$ manage/enterprise
DECmcc (V1.1.0)

!
enable mcc 0 alarms rule BBPK01_SYN-0, in domain .pko-24

MCC 0 ALARMS RULE BBPK01_SYN-0 
AT 31-JAN-1991 10:47:14 

Normal operation has begun.
!
enable mcc 0 alarms rule BBPK01_SYN-1, in domain .pko-24

MCC 0 ALARMS RULE BBPK01_SYN-1 
AT 31-JAN-1991 10:47:17 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK02_SYN-0, in domain .pko-24

MCC 0 ALARMS RULE BBPK02_SYN-0 
AT 31-JAN-1991 10:47:22 

Normal operation has begun.
!
enable mcc 0 alarms rule BBPK02_SYN-1, in domain .pko-24

MCC 0 ALARMS RULE BBPK02_SYN-1 
AT 31-JAN-1991 10:47:23 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK02_SYN-2, in domain .pko-24

MCC 0 ALARMS RULE BBPK02_SYN-2 
AT 31-JAN-1991 10:47:24 

Normal operation has begun.
! at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK03_SYN-0, in domain .pko-24

MCC 0 ALARMS RULE BBPK03_SYN-0 
AT 31-JAN-1991 10:47:25 

Normal operation has begun.
!
enable mcc 0 alarms rule BBPK03_SYN-1, in domain .pko-24

MCC 0 ALARMS RULE BBPK03_SYN-1 
AT 31-JAN-1991 10:47:26 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK03_SYN-2, in domain .pko-24

MCC 0 ALARMS RULE BBPK03_SYN-2 
AT 31-JAN-1991 10:47:28 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK04_SYN-0, in domain .pko-24

MCC 0 ALARMS RULE BBPK04_SYN-0 
AT 31-JAN-1991 10:47:29 

Normal operation has begun.
!
enable mcc 0 alarms rule BBPK04_SYN-1, in domain .pko-24

MCC 0 ALARMS RULE BBPK04_SYN-1 
AT 31-JAN-1991 10:47:30 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK04_SYN-2, in domain .pko-24

MCC 0 ALARMS RULE BBPK04_SYN-2 
AT 31-JAN-1991 10:47:30 

Normal operation has begun.
!at start=(+00:01:30)
!
@[alarms.com]start_poll_demon

Domain NOCMAN_NS:.pko-24 Rule BBPK01_SYN-0 
AT 31-JAN-1991 10:47:32 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:22.23
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK01_SYN-1 
AT 31-JAN-1991 10:47:34 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:25.38
              Result of Last Evaluation = False

Domain NOCMAN_NS:.pko-24 Rule BBPK02_SYN-0 
AT 31-JAN-1991 10:47:35 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:26.41
              Result of Last Evaluation = False

Domain NOCMAN_NS:.pko-24 Rule BBPK02_SYN-1 
AT 31-JAN-1991 10:47:35 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:24.19
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK02_SYN-2 
AT 31-JAN-1991 10:47:36 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:27.08
              Result of Last Evaluation = Error
                        Error Condition = "No such entity: Node4 BBPK02 Circuit 
                                          SYN-2  
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK03_SYN-0 
AT 31-JAN-1991 10:47:37 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:28.95
              Result of Last Evaluation = False

Domain NOCMAN_NS:.pko-24 Rule BBPK03_SYN-1 
AT 31-JAN-1991 10:47:38 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:27.25
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK03_SYN-2 
AT 31-JAN-1991 10:47:39 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:33.05
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK04_SYN-0 
AT 31-JAN-1991 10:47:39 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:31.09
              Result of Last Evaluation = Error
                        Error Condition = "No such entity: Node4 BBPK04 Circuit 
                                          SYN-0  
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK04_SYN-1 
AT 31-JAN-1991 10:47:40 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:31.36
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK04_SYN-2 
AT 31-JAN-1991 10:47:41 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:34.28
              Result of Last Evaluation = False

T.RTitleUserPersonal
Name
DateLines
681.1First answer -- see result of last evaluationTOOK::ORENSTEINThu Jan 31 1991 12:5615
    Brad,
    
    This will answer one of your questions:
    
    >>>The error also indicates the LAST ERROR and not the LAST POLL ERROR 
    >>> (please verify this thought). Therefore I can NEVER tell if this 
    >>> problem persists or goes away after the startup because once there 
    >>> is an error MCC continues to display this error forever.
    
    The "error condition" dispays the last error encountered.
    The "result of last evaluation" will be TRUE, FALSE or ERROR.
    
    The latter will be the indicate if the problem has gone away.
    
    aud...                                  
681.2thanks, i'll monitor moreJETSAM::WOODCOCKThu Jan 31 1991 13:519
    The "error condition" dispays the last error encountered.
    The "result of last evaluation" will be TRUE, FALSE or ERROR.
    
> Thanks for pointing out the obvious which I had overlooked.
> I will retry this procedure and have the status monitored at
> relatively close intervals and report back as to whether this
> is a startup or a continuous problem as soon as I get a chance.

> brad...
681.3DNA4_AM changes should helpTOOK::CAREYFri Feb 01 1991 12:0690
    
    Brad,
    
    You've sent this to me in mail as well, and I replied to your mail this
    morning.
    
    For Event Alarms, I expect you will see better behavior with a new AM
    we'll make available to you as soon as we've looked it over.  Part of
    our standard validation is to check out the node instance and verify
    that it isn't a cluster alias.  We were doing this even for event
    alarms, and in if we had trouble connecting, we were almost guaranteed
    to return an "internal error in DECnet Phase IV."
    
    It doesn't make sense for us to access the specified node AT ALL for
    a GETEVENT request.  The Node4 need not even be up on the network when
    an event request is put out against it, and it certainly isn't
    important that the person collecting events be able to access the node.
    
    So, we've modified our validation to skip this node4 verification on 
    GETEVENT requests.  If we can figure out the DECnet address of a
    specified Node4, we will allow a request for event information to be
    made outstanding against it.  That should take care of that set of 
    problems (assuming that these were event alarms).
    
    Were the "no such entity" errors also on event alarms?  I haven't seen
    anything like this, and have no idea what is going on.  I'd like to
    see the alarms rules you were using.  We'll see what we can do in this
    area.
    
    As soon as possible, I'll get you a new AM with this fix in it so that
    we can check it out.
    
    I just reviewed QAR 196 (yep, that's the one) to make sure we were 
    talking about the same QAR.  This QAR is about a change-of rule on
    circuit substate.
    
    First, while change-of and occurs rules may be used for the same
    purpose by you, they cause significantly different activity in DECmcc.
    
    Change-of polls the node4 in question.  Occurs just hangs around
    waiting for event delivery.
    
    The problem still stands that chronologically "close" alarm rules to a
    DECrouter might collide with each other and cause one to fail.  Where
    the IFT kit was prone to return "internal errors", EFT and later kits
    are a lot better about returning the connection failure as an
    "insufficent resources" error. For V1.1, the only thing you can do is
    separate the requests by a few seconds (say 30-60 for a loaded
    network).  We're considering getting smarter about this for V1.2.
    
    There is some latency in busy DECrouters when cleaning up links that
    has caused us some problems, even after the EFT DNA4_AM learned to
    manage single link systems.  It seems that DECrouters consider routing
    their main job, and just do management when there is "spare" time
    available.  I can't say that I disagree with that philosophy, so we're
    doing our best to be accomodating.
    
    Recent changes to the AM that should effect your change-of
    capabilities:
    
    	- Caching of node information to significantly reduce link
    	  overhead, especially to commonly referenced systems.
    
    	- Enhancements to the single link code, especially an ugly little
    	  error in a low level routine that could cause hundreds of
    	  unnecessary links to be created and dropped.  That little logic 
    	  error significantly increased the odds of hitting the DECrouter
    	  latency problem, and often impacted Circuit Status requests.
    
    Before that, the EFT kit included single-link management logic that
    should significantly increase your chances of being successful with the
    change-of function anyway.
    
    I'll get you a new AM as soon as I can, and we'll see if it alleviates
    these errors.  I think you will find that it does.
    
    We'll also be able to look at the no such entity error with you, to see
    where it is originating.  Because we manage GETEVENT differently with
    this AM, we may not encounter this problem (whatever it was) with the
    revised logic.
    
    So, here's the deal:
    
    	- you get me your alarm rules.
    	- I'll get you the latest and greatest DNA4_AM to try out.
    
    Thanks for your time,
    
    -Jim Carey
    
681.4ready when you areJETSAM::WOODCOCKFri Feb 01 1991 15:0210
Hi Jim,

I'm ready for further testing whenever you are. I will send you a couple
of com files which create the alarms for your scrutiny (both change_of and
occurs). I'm glad to see attention on this one. Let me know when you have
a build available. This is on the top of the list.

thanks,
brad...    

681.5One down, one deferred....TOOK::CAREYMon Feb 04 1991 18:3448
    
    Brad,
    
    The "occurs" problems are all cleaned up.  Thanks for checking that
    out for us.  You can collect events from those DECrouters to your
    heart's content, and I think that is the monitoring method of choice
    for two reasons:
    
    	- Less CPU and network overhead watching for network trouble so
    	  the network gets to do more of what it was made for, moving data.
    
    	- You won't have a large window between polling cycles where
    	  you can entirely miss critical problems because DECnet is
    	  working so hard to get around them.
    
    For the "change-of" function, it looks like we'll have to settle for
    a workaround.  When enabling those alarms rules to poll every ten
    minutes, start them up about thirty seconds apart.  You indicated that
    you'll actually be starting them up about one and a half minutes apart
    to avoid race conditions by as much as possible.
    
    If there is an existing link to the DECrouter, we are good at
    recognizing and reporting that.  However, you are setting up a
    condition wherein up to four requests are being initiated to the poor
    old single link DECrouter at the same time.
    
    It looks like the DECrouter gets pretty well overwhelmed under these
    conditions, and the DECnet AM isn't very pretty handling the errors
    when they are reported.  Debugging race conditions like this is
    challenging, and the impact on critical code is high.
    
    For V1.1, you'll see a restriction in the release notes that you should
    be careful polling single link systems (like the whole DECrouter
    family), and to space out polled requests just as you must single link
    other activity to these limited systems.
    
    For V1.2, we'd like to enhance the cache we've already built to allow
    us to perform some link-queueing so that we can synchronize multiple
    threads requesting service from the same limited system, and make this
    a little more transparent to you.
    
    Thanks for jumping right into this one, and turning around your testing
    so quickly for us.
    
    -Jim Carey
    
    
    This will help you get even further away from the hardware.
681.6a bit moreJETSAM::WOODCOCKMon Feb 04 1991 19:1536
Jim,
    
>    	- Less CPU and network overhead watching for network trouble so
>    	  the network gets to do more of what it was made for, moving data.

Definitely preferred...
    
>    	- You won't have a large window between polling cycles where
>    	  you can entirely miss critical problems because DECnet is
>    	  working so hard to get around them.
 
There are cases where this method works out well. During off hours for
example when "we" can't actually watch events real time anyway. The
change_of is convenient because it lets us know when it when down and
when it came back (circuits) at reasonable intervals. If we use occurs
we'd receive mail every time a circuit bounced. Although I'll probably
look into delaying MCC from looking at the events for intervals also, and
see how that works out.   
    
>    It looks like the DECrouter gets pretty well overwhelmed under these
>    conditions, and the DECnet AM isn't very pretty handling the errors
>    when they are reported.  Debugging race conditions like this is
>    challenging, and the impact on critical code is high.
    
If I remember right there were instances where the "FIRST" rule to a router
also failed. This makes me wonder whether maybe both were a bit overwhelmed,
router and MCC (especially with that nasty "unknown entity error"). By
sticking in the delays we not only have allowed the router to catch up, but 
MCC is also able to kick back. With one procedure all is well, but I'm curious
to see what happens when I fire up 10 of these procedures at once for across
the network monitoring. 

brad...

ps. Your new AM "WAS" noticably quicker when using IMPM child lookups and
    commands [thanks, :-)]
681.7further testingJETSAM::WOODCOCKTue Feb 05 1991 15:3415
Hi again,

I have set up 10 procedures similar to the one we just worked thru. The enable
commands for any particular router (multiple circuits) were delayed by 1
minute each. I submitted all 10 of these as seperate jobs thru a procedure
which starts them one after another. There are probably a total of 70 or so
rules. Checking the log files after startup shows a failure on 7 alarms (10%)
including one "no such entity" and the rest are "internal decnet am errors".
It seems MCC couldn't keep up. Do you think I have reached the limitations
of my platform (3520 w32M) or MCC? Would putting small delays into the jobs
being submitted help? Seems to me MCC should be able to handle this load. 
Athough I'm not sure how many polls the system itself can handle.

any ideas?
brad
681.8Maybe DNS is the bottleneck?TOOK::GUERTINE = mccTue Feb 05 1991 20:2720
    Brad,
    
    Seems to me that MCC per se should be able to handle this load without
    "breaking sweat".  I'm guessing that the bottle neck is somewhere else.
    Perhaps DNS?
    
    Is it possible to do testing on something that doesn't use DNS so much?
    For example, use a bridge name or address of a bridge which is
    registered (the Bridge AM caches the information after the initial DNS
    call).  I'm not sure if the DNA4 AM calls DNS when you use just DNA4
    addresses, but I seem to recall Jim Carey saying that it does not.  So
    you may want to try using DNA4 addresses as well.  If MCC suddenly
    "perks up", then the problem is probably overloading the DNS server.
    
    By the way, is the DNS server on the same node?
    
    We expect to do more thorough performance and endurance testing (and
    tuning) in the near future (I hope).
    
    -Matt.
681.9dns is localJETSAM::WOODCOCKWed Feb 06 1991 09:4923
Matt,

DNS is on the same system. Also, I don't have the environment to do
this testing with bridges. As far as whether this is a DNS or MCC
problem they are one in the same from my standpoint.

I have conversed via mail with Jim recently and at this point I hope
to at least identify where the breakdown may be so I can better understand
where the limitations are. I think may be beaking some new ground here
on performance of pollers (but not certain). We have tools now which have 
in the past polled up to 350 nodes. These polls are broken down into 8 
batches but I'm not sure how the timing is done at startup. Once started 
they poll at an unrestricted rate asyncronously. Whereas what I have set 
up in MCC is to start these polls with 10 batches, syncronously and with 
a restricted rate of 10 minutes each.

Hopefully at some point, we can dedicate some time to get a better 
understanding of the performance limitations and *what* is the limiting 
factor. For now I'll tweek a bit more until it works satisfactorily
until we go to an event solution. Addresses may be a good place to start.

brad...

681.10DNS hasn't been a suspect. Everything else thoug....TOOK::CAREYWed Feb 06 1991 11:1938
    
    Brad and I have been trying to work the issues surrounding this problem
    for awhile.  The DECnet AM will only try the namespace if it can't
    resolve the node name using the local database.  Using the address
    directly guarantees that we don't use the namespace, but I don't expect
    that DNS is involved in this problem at all, even using the regular
    node name.
    
    Brad, you might ensure that that is the case by verifying that the
    nodes you are using are defined in your local DECnet database.  Check 
    for them as Remote Node entities on your local MCC node.
    
    I think DNS is probably not involved.
    
    Some of our prime concerns:
    
    - Four nearly simultaneous link requests from a single process to
    DECnet.  Likely, all four are doing a connect initiate at the same
    time, not something that DECnet sees everyday.
    
    - Four nearly simultaneous link requests for a DECrouter.  These
    systems only allow a single management link, and are good at rejecting
    attempts once a link is up.  Do they get in trouble if bombarded with
    many requests while trying to set up context?
    
    - Four threads in the DECnet AM all doing the same thing at the same
    time.  We think we have reentrant code.  It LOOKS reentrant.
    
    Mitigating any or all of these possibilties for V1.1 will be difficult.
    
    We know we can alleviate the strain using prudence setting up the
    rules.  This may be our best answer for V1.1 given the risk of major
    changes at this stage of development.
    
    -Jim Carey
    
    
    
681.11FYI: DECnet-VAX probably not the problemMARVIN::COBBGraham R. Cobb (Wide Area Comms.), REO2-G/H9, 830-3917Mon Feb 11 1991 05:5410
>     - Four nearly simultaneous link requests from a single process to
>     DECnet.  Likely, all four are doing a connect initiate at the same
>     time, not something that DECnet sees everyday.

VAX P.S.I.  does this all the time.  Particularly during our stress testing:
we  connect  and  disconnect a *lot* of links (at least a couple of hundred)
from the same process very frequently.  We have never seen any problems with
DECnet-VAX when doing this (lots of problems in our code, however!).

Graham
681.12SUBURB::SMYTHIIan Smyth 830-3869Tue Feb 12 1991 08:1113
    
>    - Four nearly simultaneous link requests for a DECrouter.  These
>    systems only allow a single management link, and are good at rejecting
>    attempts once a link is up.  Do they get in trouble if bombarded with
>    many requests while trying to set up context?


	I don't understand this. I can get multiple NML links 
up to any DECrouter 2000. I thought that the single management link
was only a restriction on the DECSA box.    
    

Ian
681.13Thanks for your ideas, we're making some headway too....TOOK::CAREYTue Feb 12 1991 09:5250
    
    Re: -2 -- thanks for the information about the DEMSA box.  In that
    case, "Sure, the DECmcc DNA4 AM supports 'em all."
    
    Re: -1 -- my mistake.  Of all of the DECrouter boxes, the DECrouter
    2000 is the only one I've encountered that does support multiple links.
    The others very happily reject additional requests.  In my aching head,
    the DECrouters all tend to get lumped together.  I'll try to do better.
    Specifically, the DECrouter 200, 250 and DECSA's all support only a
    single management link.  The DECrouter 2000 supports many.
    
    FYI - it was testing against a DECrouter 250 that showed us the router
    might need up to a second to recover its context after the management
    link is dropped.
    
    I'm also glad to hear that I should discount the possibility that we're
    having a problem interacting with DECnet.
    
    Finally, we've made some headway against what we're doing wrong. We use
    the phase4address for our transactions whenever possible, instead of 
    leaving DECnet to perform the translation.  This allows us to fall back
    by attempting to resolve the address from the DNS namespace if the
    DECnet database doesn't know it, and gets the address into our
    operations early, to maximize our ability to compare response entities
    and create consistency through MCC.  On VMS, we try to resolve the
    address at the local database using NMLSHR to save time, and simplify
    the process and user requirements of DECmcc.
    
    Some interaction of these requests during our Local Processing is where
    the damage seems to occur.  We haven't successfully reproduced it
    anywhere except on one 3520 dual processor machine, but on that machine
    we have more trouble if the rules are polling locally than we do if we
    attempt to go remote.  Shutting off the NMLSHR back door gives us
    consistently good results.  It could be that our NMLSHR queueing isn't
    completely reentrant, or it could be more directly related to the
    hardware platform we're on.  Anyway, we're pretty confident that we've
    eliminated about 90% of the possibilities.
    
    I'll post it here when we have some real results.
    
    Thanks for your inputs.
    
    -Jim Carey