[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:	DECmcc user notes file. Does not replace IPMT.
Notice:	Use IPMT for problems. Newsletter location in note 6187
Moderator:	TAEC::BEROUD

Created:	Mon Aug 21 1989
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	6497
Total number of notes:	27359

681.0. "alarms in .com's" by JETSAM::WOODCOCK () Thu Jan 31 1991 11:53

Hi,

I had submitted a QAR regarding the startup of rules via com procedures. I
believe it is #196 in the new database. Because it was resubmitted into the 
new database by someone other than myself I can't update it. Hence the note.
I had read a reply stating that things were worked on in the DECnet AM and
also to the internal timings of MCC and there was thought to be improvement.
There is no improvement, actually it looks worse.

I had sent an update to Jim C. of what I was seeing while starting event
alarms via com procedures while monitoring the icons yesterday but now feel
the situation is worse than I indicated to him. Further investigation using 
FCL points to some disturbing results considering that most users will 
probably use procedures to start monitoring and/or collection of 
historical data processes.

The log below shows a procedure starting up alarms then taking a snapshot
of the status of those alarms. 7 of the 11 alarms started with an error! 
Considering this is only 10% (or less) of what we'll be monitoring my
confidence level is a little low. The error also indicates the LAST ERROR
and not the LAST POLL ERROR (please verify this thought). Therefore I can
NEVER tell if this problem persists or goes away after the startup because
once there is an error MCC continues to display this error forever.

Along with the Internal DECnet errors you will see No Such Entity errors.
This is a new problem. When these rules are started manually they all work 
fine. 

Observation: The last time we looked at this it was thought that the router
was the *main* culprit because it only handles one NML link. This log also
shows the first rule to a node failing indicating there may also be a 
problem with internal MCC timing of these commands.

regards,
brad...


-----------------------------------------------------------------------------
$ manage/enterprise
DECmcc (V1.1.0)

!
enable mcc 0 alarms rule BBPK01_SYN-0, in domain .pko-24

MCC 0 ALARMS RULE BBPK01_SYN-0 
AT 31-JAN-1991 10:47:14 

Normal operation has begun.
!
enable mcc 0 alarms rule BBPK01_SYN-1, in domain .pko-24

MCC 0 ALARMS RULE BBPK01_SYN-1 
AT 31-JAN-1991 10:47:17 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK02_SYN-0, in domain .pko-24

MCC 0 ALARMS RULE BBPK02_SYN-0 
AT 31-JAN-1991 10:47:22 

Normal operation has begun.
!
enable mcc 0 alarms rule BBPK02_SYN-1, in domain .pko-24

MCC 0 ALARMS RULE BBPK02_SYN-1 
AT 31-JAN-1991 10:47:23 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK02_SYN-2, in domain .pko-24

MCC 0 ALARMS RULE BBPK02_SYN-2 
AT 31-JAN-1991 10:47:24 

Normal operation has begun.
! at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK03_SYN-0, in domain .pko-24

MCC 0 ALARMS RULE BBPK03_SYN-0 
AT 31-JAN-1991 10:47:25 

Normal operation has begun.
!
enable mcc 0 alarms rule BBPK03_SYN-1, in domain .pko-24

MCC 0 ALARMS RULE BBPK03_SYN-1 
AT 31-JAN-1991 10:47:26 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK03_SYN-2, in domain .pko-24

MCC 0 ALARMS RULE BBPK03_SYN-2 
AT 31-JAN-1991 10:47:28 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK04_SYN-0, in domain .pko-24

MCC 0 ALARMS RULE BBPK04_SYN-0 
AT 31-JAN-1991 10:47:29 

Normal operation has begun.
!
enable mcc 0 alarms rule BBPK04_SYN-1, in domain .pko-24

MCC 0 ALARMS RULE BBPK04_SYN-1 
AT 31-JAN-1991 10:47:30 

Normal operation has begun.
!at start=(+00:01:30)
!
enable mcc 0 alarms rule BBPK04_SYN-2, in domain .pko-24

MCC 0 ALARMS RULE BBPK04_SYN-2 
AT 31-JAN-1991 10:47:30 

Normal operation has begun.
!at start=(+00:01:30)
!
@[alarms.com]start_poll_demon

Domain NOCMAN_NS:.pko-24 Rule BBPK01_SYN-0 
AT 31-JAN-1991 10:47:32 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:22.23
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK01_SYN-1 
AT 31-JAN-1991 10:47:34 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:25.38
              Result of Last Evaluation = False

Domain NOCMAN_NS:.pko-24 Rule BBPK02_SYN-0 
AT 31-JAN-1991 10:47:35 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:26.41
              Result of Last Evaluation = False

Domain NOCMAN_NS:.pko-24 Rule BBPK02_SYN-1 
AT 31-JAN-1991 10:47:35 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:24.19
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK02_SYN-2 
AT 31-JAN-1991 10:47:36 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:27.08
              Result of Last Evaluation = Error
                        Error Condition = "No such entity: Node4 BBPK02 Circuit 
                                          SYN-2  
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK03_SYN-0 
AT 31-JAN-1991 10:47:37 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:28.95
              Result of Last Evaluation = False

Domain NOCMAN_NS:.pko-24 Rule BBPK03_SYN-1 
AT 31-JAN-1991 10:47:38 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:27.25
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK03_SYN-2 
AT 31-JAN-1991 10:47:39 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:33.05
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK04_SYN-0 
AT 31-JAN-1991 10:47:39 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:31.09
              Result of Last Evaluation = Error
                        Error Condition = "No such entity: Node4 BBPK04 Circuit 
                                          SYN-0  
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK04_SYN-1 
AT 31-JAN-1991 10:47:40 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:31.36
              Result of Last Evaluation = Error
                        Error Condition = "Internal error in DECnet Phase IV 
                                          AM. 
                                          "

Domain NOCMAN_NS:.pko-24 Rule BBPK04_SYN-2 
AT 31-JAN-1991 10:47:41 Status

Examination of attributes shows:
                                  State = Enabled
                               Substate = Running
                Time of Last Evaluation = 31-JAN-1991 10:47:34.28
              Result of Last Evaluation = False

T.R	Title	User	Personal Name	Date	Lines
681.1	First answer -- see result of last evaluation	TOOK::ORENSTEIN		`Thu Jan 31 1991 12:56`	15
	Brad, This will answer one of your questions: >>>The error also indicates the LAST ERROR and not the LAST POLL ERROR >>> (please verify this thought). Therefore I can NEVER tell if this >>> problem persists or goes away after the startup because once there >>> is an error MCC continues to display this error forever. The "error condition" dispays the last error encountered. The "result of last evaluation" will be TRUE, FALSE or ERROR. The latter will be the indicate if the problem has gone away. aud...
681.2	thanks, i'll monitor more	JETSAM::WOODCOCK		`Thu Jan 31 1991 13:51`	9
	The "error condition" dispays the last error encountered. The "result of last evaluation" will be TRUE, FALSE or ERROR. > Thanks for pointing out the obvious which I had overlooked. > I will retry this procedure and have the status monitored at > relatively close intervals and report back as to whether this > is a startup or a continuous problem as soon as I get a chance. > brad...
681.3	DNA4_AM changes should help	TOOK::CAREY		`Fri Feb 01 1991 12:06`	90
	Brad, You've sent this to me in mail as well, and I replied to your mail this morning. For Event Alarms, I expect you will see better behavior with a new AM we'll make available to you as soon as we've looked it over. Part of our standard validation is to check out the node instance and verify that it isn't a cluster alias. We were doing this even for event alarms, and in if we had trouble connecting, we were almost guaranteed to return an "internal error in DECnet Phase IV." It doesn't make sense for us to access the specified node AT ALL for a GETEVENT request. The Node4 need not even be up on the network when an event request is put out against it, and it certainly isn't important that the person collecting events be able to access the node. So, we've modified our validation to skip this node4 verification on GETEVENT requests. If we can figure out the DECnet address of a specified Node4, we will allow a request for event information to be made outstanding against it. That should take care of that set of problems (assuming that these were event alarms). Were the "no such entity" errors also on event alarms? I haven't seen anything like this, and have no idea what is going on. I'd like to see the alarms rules you were using. We'll see what we can do in this area. As soon as possible, I'll get you a new AM with this fix in it so that we can check it out. I just reviewed QAR 196 (yep, that's the one) to make sure we were talking about the same QAR. This QAR is about a change-of rule on circuit substate. First, while change-of and occurs rules may be used for the same purpose by you, they cause significantly different activity in DECmcc. Change-of polls the node4 in question. Occurs just hangs around waiting for event delivery. The problem still stands that chronologically "close" alarm rules to a DECrouter might collide with each other and cause one to fail. Where the IFT kit was prone to return "internal errors", EFT and later kits are a lot better about returning the connection failure as an "insufficent resources" error. For V1.1, the only thing you can do is separate the requests by a few seconds (say 30-60 for a loaded network). We're considering getting smarter about this for V1.2. There is some latency in busy DECrouters when cleaning up links that has caused us some problems, even after the EFT DNA4_AM learned to manage single link systems. It seems that DECrouters consider routing their main job, and just do management when there is "spare" time available. I can't say that I disagree with that philosophy, so we're doing our best to be accomodating. Recent changes to the AM that should effect your change-of capabilities: - Caching of node information to significantly reduce link overhead, especially to commonly referenced systems. - Enhancements to the single link code, especially an ugly little error in a low level routine that could cause hundreds of unnecessary links to be created and dropped. That little logic error significantly increased the odds of hitting the DECrouter latency problem, and often impacted Circuit Status requests. Before that, the EFT kit included single-link management logic that should significantly increase your chances of being successful with the change-of function anyway. I'll get you a new AM as soon as I can, and we'll see if it alleviates these errors. I think you will find that it does. We'll also be able to look at the no such entity error with you, to see where it is originating. Because we manage GETEVENT differently with this AM, we may not encounter this problem (whatever it was) with the revised logic. So, here's the deal: - you get me your alarm rules. - I'll get you the latest and greatest DNA4_AM to try out. Thanks for your time, -Jim Carey
681.4	ready when you are	JETSAM::WOODCOCK		`Fri Feb 01 1991 15:02`	10
	Hi Jim, I'm ready for further testing whenever you are. I will send you a couple of com files which create the alarms for your scrutiny (both change_of and occurs). I'm glad to see attention on this one. Let me know when you have a build available. This is on the top of the list. thanks, brad...
681.5	One down, one deferred....	TOOK::CAREY		`Mon Feb 04 1991 18:34`	48
	Brad, The "occurs" problems are all cleaned up. Thanks for checking that out for us. You can collect events from those DECrouters to your heart's content, and I think that is the monitoring method of choice for two reasons: - Less CPU and network overhead watching for network trouble so the network gets to do more of what it was made for, moving data. - You won't have a large window between polling cycles where you can entirely miss critical problems because DECnet is working so hard to get around them. For the "change-of" function, it looks like we'll have to settle for a workaround. When enabling those alarms rules to poll every ten minutes, start them up about thirty seconds apart. You indicated that you'll actually be starting them up about one and a half minutes apart to avoid race conditions by as much as possible. If there is an existing link to the DECrouter, we are good at recognizing and reporting that. However, you are setting up a condition wherein up to four requests are being initiated to the poor old single link DECrouter at the same time. It looks like the DECrouter gets pretty well overwhelmed under these conditions, and the DECnet AM isn't very pretty handling the errors when they are reported. Debugging race conditions like this is challenging, and the impact on critical code is high. For V1.1, you'll see a restriction in the release notes that you should be careful polling single link systems (like the whole DECrouter family), and to space out polled requests just as you must single link other activity to these limited systems. For V1.2, we'd like to enhance the cache we've already built to allow us to perform some link-queueing so that we can synchronize multiple threads requesting service from the same limited system, and make this a little more transparent to you. Thanks for jumping right into this one, and turning around your testing so quickly for us. -Jim Carey This will help you get even further away from the hardware.
681.6	a bit more	JETSAM::WOODCOCK		`Mon Feb 04 1991 19:15`	36
	Jim, > - Less CPU and network overhead watching for network trouble so > the network gets to do more of what it was made for, moving data. Definitely preferred... > - You won't have a large window between polling cycles where > you can entirely miss critical problems because DECnet is > working so hard to get around them. There are cases where this method works out well. During off hours for example when "we" can't actually watch events real time anyway. The change_of is convenient because it lets us know when it when down and when it came back (circuits) at reasonable intervals. If we use occurs we'd receive mail every time a circuit bounced. Although I'll probably look into delaying MCC from looking at the events for intervals also, and see how that works out. > It looks like the DECrouter gets pretty well overwhelmed under these > conditions, and the DECnet AM isn't very pretty handling the errors > when they are reported. Debugging race conditions like this is > challenging, and the impact on critical code is high. If I remember right there were instances where the "FIRST" rule to a router also failed. This makes me wonder whether maybe both were a bit overwhelmed, router and MCC (especially with that nasty "unknown entity error"). By sticking in the delays we not only have allowed the router to catch up, but MCC is also able to kick back. With one procedure all is well, but I'm curious to see what happens when I fire up 10 of these procedures at once for across the network monitoring. brad... ps. Your new AM "WAS" noticably quicker when using IMPM child lookups and commands [thanks, :-)]
681.7	further testing	JETSAM::WOODCOCK		`Tue Feb 05 1991 15:34`	15
	Hi again, I have set up 10 procedures similar to the one we just worked thru. The enable commands for any particular router (multiple circuits) were delayed by 1 minute each. I submitted all 10 of these as seperate jobs thru a procedure which starts them one after another. There are probably a total of 70 or so rules. Checking the log files after startup shows a failure on 7 alarms (10%) including one "no such entity" and the rest are "internal decnet am errors". It seems MCC couldn't keep up. Do you think I have reached the limitations of my platform (3520 w32M) or MCC? Would putting small delays into the jobs being submitted help? Seems to me MCC should be able to handle this load. Athough I'm not sure how many polls the system itself can handle. any ideas? brad
681.8	Maybe DNS is the bottleneck?	TOOK::GUERTIN	E = mcc	`Tue Feb 05 1991 20:27`	20
	Brad, Seems to me that MCC per se should be able to handle this load without "breaking sweat". I'm guessing that the bottle neck is somewhere else. Perhaps DNS? Is it possible to do testing on something that doesn't use DNS so much? For example, use a bridge name or address of a bridge which is registered (the Bridge AM caches the information after the initial DNS call). I'm not sure if the DNA4 AM calls DNS when you use just DNA4 addresses, but I seem to recall Jim Carey saying that it does not. So you may want to try using DNA4 addresses as well. If MCC suddenly "perks up", then the problem is probably overloading the DNS server. By the way, is the DNS server on the same node? We expect to do more thorough performance and endurance testing (and tuning) in the near future (I hope). -Matt.
681.9	dns is local	JETSAM::WOODCOCK		`Wed Feb 06 1991 09:49`	23
	Matt, DNS is on the same system. Also, I don't have the environment to do this testing with bridges. As far as whether this is a DNS or MCC problem they are one in the same from my standpoint. I have conversed via mail with Jim recently and at this point I hope to at least identify where the breakdown may be so I can better understand where the limitations are. I think may be beaking some new ground here on performance of pollers (but not certain). We have tools now which have in the past polled up to 350 nodes. These polls are broken down into 8 batches but I'm not sure how the timing is done at startup. Once started they poll at an unrestricted rate asyncronously. Whereas what I have set up in MCC is to start these polls with 10 batches, syncronously and with a restricted rate of 10 minutes each. Hopefully at some point, we can dedicate some time to get a better understanding of the performance limitations and what is the limiting factor. For now I'll tweek a bit more until it works satisfactorily until we go to an event solution. Addresses may be a good place to start. brad...
681.10	DNS hasn't been a suspect. Everything else thoug....	TOOK::CAREY		`Wed Feb 06 1991 11:19`	38
	Brad and I have been trying to work the issues surrounding this problem for awhile. The DECnet AM will only try the namespace if it can't resolve the node name using the local database. Using the address directly guarantees that we don't use the namespace, but I don't expect that DNS is involved in this problem at all, even using the regular node name. Brad, you might ensure that that is the case by verifying that the nodes you are using are defined in your local DECnet database. Check for them as Remote Node entities on your local MCC node. I think DNS is probably not involved. Some of our prime concerns: - Four nearly simultaneous link requests from a single process to DECnet. Likely, all four are doing a connect initiate at the same time, not something that DECnet sees everyday. - Four nearly simultaneous link requests for a DECrouter. These systems only allow a single management link, and are good at rejecting attempts once a link is up. Do they get in trouble if bombarded with many requests while trying to set up context? - Four threads in the DECnet AM all doing the same thing at the same time. We think we have reentrant code. It LOOKS reentrant. Mitigating any or all of these possibilties for V1.1 will be difficult. We know we can alleviate the strain using prudence setting up the rules. This may be our best answer for V1.1 given the risk of major changes at this stage of development. -Jim Carey
681.11	FYI: DECnet-VAX probably not the problem	MARVIN::COBB	Graham R. Cobb (Wide Area Comms.), REO2-G/H9, 830-3917	`Mon Feb 11 1991 05:54`	10
	> - Four nearly simultaneous link requests from a single process to > DECnet. Likely, all four are doing a connect initiate at the same > time, not something that DECnet sees everyday. VAX P.S.I. does this all the time. Particularly during our stress testing: we connect and disconnect a lot of links (at least a couple of hundred) from the same process very frequently. We have never seen any problems with DECnet-VAX when doing this (lots of problems in our code, however!). Graham
681.12		SUBURB::SMYTHI	Ian Smyth 830-3869	`Tue Feb 12 1991 08:11`	13
	> - Four nearly simultaneous link requests for a DECrouter. These > systems only allow a single management link, and are good at rejecting > attempts once a link is up. Do they get in trouble if bombarded with > many requests while trying to set up context? I don't understand this. I can get multiple NML links up to any DECrouter 2000. I thought that the single management link was only a restriction on the DECSA box. Ian
681.13	Thanks for your ideas, we're making some headway too....	TOOK::CAREY		`Tue Feb 12 1991 09:52`	50
	Re: -2 -- thanks for the information about the DEMSA box. In that case, "Sure, the DECmcc DNA4 AM supports 'em all." Re: -1 -- my mistake. Of all of the DECrouter boxes, the DECrouter 2000 is the only one I've encountered that does support multiple links. The others very happily reject additional requests. In my aching head, the DECrouters all tend to get lumped together. I'll try to do better. Specifically, the DECrouter 200, 250 and DECSA's all support only a single management link. The DECrouter 2000 supports many. FYI - it was testing against a DECrouter 250 that showed us the router might need up to a second to recover its context after the management link is dropped. I'm also glad to hear that I should discount the possibility that we're having a problem interacting with DECnet. Finally, we've made some headway against what we're doing wrong. We use the phase4address for our transactions whenever possible, instead of leaving DECnet to perform the translation. This allows us to fall back by attempting to resolve the address from the DNS namespace if the DECnet database doesn't know it, and gets the address into our operations early, to maximize our ability to compare response entities and create consistency through MCC. On VMS, we try to resolve the address at the local database using NMLSHR to save time, and simplify the process and user requirements of DECmcc. Some interaction of these requests during our Local Processing is where the damage seems to occur. We haven't successfully reproduced it anywhere except on one 3520 dual processor machine, but on that machine we have more trouble if the rules are polling locally than we do if we attempt to go remote. Shutting off the NMLSHR back door gives us consistently good results. It could be that our NMLSHR queueing isn't completely reentrant, or it could be more directly related to the hardware platform we're on. Anyway, we're pretty confident that we've eliminated about 90% of the possibilities. I'll post it here when we have some real results. Thanks for your inputs. -Jim Carey