[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference azur::mcc

Title:	DECmcc user notes file. Does not replace IPMT.
Notice:	Use IPMT for problems. Newsletter location in note 6187
Moderator:	TAEC::BEROUD

Created:	Mon Aug 21 1989
Last Modified:	Wed Jun 04 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	6497
Total number of notes:	27359

84.0. "Is this a Bug or is it me ?" by PILOU::BONGARTZ (Huckleberry Finn, I presume ?) Tue Mar 27 1990 01:56

	Hi,

	I'm  currently  writing  an application that uses MCC to fetch
	DECnet Circuit Counters. I used the method suggested somewhere
	in this notesfile, invoking MCC as a subprocess with sys$input
	and sys$output being mailboxes.

	Doing this, I've come across two problems:

	My first approach was sending a one-line command to the sub-
	process, which was at DCL level, of the form
	"MCC SHOW NODE4 name CIRCUIT name ALL COUNTERS,TO FILE fname".

	This worked, but was obviously quite slow (image activation etc).

	The  funny  here was that, although the command completed ok,
	MCC had a reserved operand fault after each command (seen on
	sys$output). I have the exact address (somewhere in S0 space)
	here if needed.

	Due  to the time it took to invoke MCC, I decided to start MCC
	and then send commands directly to MCC, leaving it running all
	the  time.  No  mode  reserved  operand faults, but - and this
	*REALLY* annoys me - the performance got worse.

	Time to return from a "SHOW NODE4 n CIRC ..." command consis-
	tently  increased,  from  an initial 13 seconds, to almost 10
	MINUTES!  The  Virtual  memory  used  by MCC also slowly went
	from ca. 12000 pages to 16000... I left it running over night
	and came in this morning to find the whole system hung... and
	a  message on the screen saying that the last "poll" took 563
	seconds to complete ! (Couldn't check the Virtual memory used
	by MCC at this time, everything was just dead.)


		Any clues ?

			Cheers,
				Marc.



	Below are a few more numbers, for those who are interested.

	What my code does is trying to fetch circuit information for
	a set of 16 circuits in a 5 minute interval.

	commands entered (once for each node/circuit combination):

	  SHOW NODE n ADDRESS,TYPE,STATE,TO FILE f
	  SHOW NODE n CIRCUIT c ALL COUNTERS,TYPE,STATE,TO FILE f
	  SHOW CMON$SYNCH (to get an error message for synchronization)

The numbers in the following table indicate the seconds elapsed until the above
three commands complete. As can be noticed, these numbers increase in an almost
linear fashion from 13 seconds (best case) to 154 !!! seconds. I can understand
the case of PILOU, which does not seem to be very responsive, or occasional
delays, but a constant increase of the time needed to poll ?

	Poll Loop Number
------------------------------------------
 1	2	3	4	5	6	Node
----	----	----	----	----	----	--------------------------------
 63(�)	48	 64	 90	137	 83	GABIN End Node (QNA-0, Ethernet)
 99(�)	94	118	146	128	148	PILOU End Node (UNA-0, Ethernet)
 13	26	 40	 49	 52	 72	SUNNY1 Area4   (ETHERNET)
 16	28	 43	 80	 94	 73	SUNNY6 Area4	DDCMP Point
 17	30	 48	 89	112	 75	SUNNY5	"	"
 18	31	 51	102	116	 75	SUNNY8	"	"
 18	34	 75	105	103	 77	SUNNY8	"	"
 19	35	 83	107	 65	 83	RECTO	"	"
 21	38	 82	107	 65	 78	RECTO	"	"
 20	37	 87	120	 66	 83	SUNNY9	"	"
 22	37	 88	 97	 74	 79	RECTO	"	"
 22	40	 70	105	 67	113	RECTO	"	"
 25	42	 61	 92	 69	147	SUNNY2	"	"
 24	39	 57	136	 68	138	SUNNY2	"	"
 24	41	 59	122	 70	151	SUNNY1	"	"
 25	42	 51	125	 70	154	SUNNY9	"	"


	(�) includes image activation of MCC
	(�) PILOU is a relatively loaded system.

Virtual Pages used by MCC increased slowly but consistently from an initial
12067 pages to 15395 pages at the end of the 6th poll loop. Most of its time,
MCC was COMputable, ticking away loads of cpu time.

    (maybe MCC doesn't cleanup after allocating memory data structures?)

T.R	Title	User	Personal Name	Date	Lines
84.1	INFO - are you running EFT kit?	GOSTE::CALLANDER		`Tue Mar 27 1990 14:49`	10
	Hi, You have hit upon some of the problems that we are currently working on. I would be interested in knowing if you are running the EFT kit. Especially the component version numbers of the DECnet NODE4 Access Module, the TRM Presentation Module, and the base system. Thanks for the additional information.
84.2	All T1.0.0 ...	PILOU::BONGARTZ	Huckleberry Finn, I presume ?	`Wed Mar 28 1990 05:55`	7
	> kit. Especially the component version numbers of the DECnet NODE4 > Access Module, the TRM Presentation Module, and the base system. All three Component Versions are T1.0.0 ... ( my workaround now is to exit and re-run mcc if it takes more than 45 seconds for a poll... )
84.3	if you find another goods ones...	GOSTE::CALLANDER		`Wed Mar 28 1990 16:27`	12
	Thanks for the additional input. We will see what can be done. If you hit any other commands that go up at such a nice rate it would be useful if you posted them here. Since different commands go through different paths in the system, sometimes something that looks like a small leak on one command, turns out to be something major given another command. jill
84.4	got one! (or two?)	PILOU::BONGARTZ	Huckleberry Finn, I presume ?	`Fri Mar 30 1990 05:14`	28
	> -< if you find another goods ones... >- Got another one... in my original polling loop, I also checked the counters on the local node (GABIN). Each poll created a SERVER_xxxx process, which apparently terminated after ca 5 minutes... but as the commands were given in less time than that,the system filled up with these processes... and ended up doing nothing but paging and swapping. Another thing, though it might not be due to me, MCC or whatever else - "just a coincidence ?" : I started my poll server in the afternoon before leaving work, and left it running over night, polling all the routers here in Valbonne. During the night, the whole network went down - systems crashed, etc. The last output from my server was at 03:13, and about that time the problems occured. Wether my code crashed because of the problem, or the problem occured because of the polls, is not clear to me - but if it's due to MCC or my server (no privs!), we'd better make sure this doesn't happen on a customer network.. I'll let the thing run tonight and let you know if the net goes down the drain again. Regards, Marc.
84.5	Thanks for the additional information	PETE::BURGESS		`Fri Mar 30 1990 09:55`	36
	You have presented several problems to us which have been assigned to different engineers for resolution. 1) The reserved operand fault which occurs when MCC is executed as a sub-process assigning sys$input/output to mail-boxes. This seems like a contained problem- I will try to reproduce your experiment here and diagnosis the problem: Would you send me the exact commands which you used to create the mcc sub-process and the commands used for communicating with the sub-process? (enet: Pete::Burgess) 2) Virtual memory expansion. This is probably due to "vm leaks". We have instrumented test versions of MCC with diagnostic tools for recording vm deallocation problems, and have been testing this problem since December, and have fixed many problems. Our focus has probably been the on the normal successful operations, and the most common error paths. My hypothesis is that MCC is taking some error paths without properly terminating its requested operations. We will be trying to reproduce this problem with our instrumented version of MCC. 3) The performance problems: The DECnet phase 4 project leader will be contacting you to obtain more diagnostic information. My first concerns relate to the large number of nml servers which are being created on your routing servers \Pete Burgess
84.6	Reduce NETSERVER$TIMEOUT to dump processes	TOOK::CAREY		`Fri Mar 30 1990 11:25`	26
	The only way we can see MCC "bringing down the network" is by applying huge loads on all of the routing nodes in the network. If we put enough pressure on them in terms of excessive NETSERVERs, it is conceivable that they will be unable to perform normal network communications. As soon as that happens, the routing traffic increases dramatically because the routers are trying to understand the topology. If you've got an appreciable number of routers, the network degrades rapidly. So, the first thing to do is get rid of the excessive NETSERVERS. We don't know why you spawn a new server with each connection. But until we do, you can at least cut down on the number of server processes that are out there by setting the NETSERVER process timeout lower. Do this by setting the system logical NETSERVER$TIMEOUT to just a few seconds instead of the default of around five minutes. You'll still suffer the process creation overhead, but at least you won't get the swapping and paging that you're seeing. Hope this helps, and I'll give you more on this server problem as soon as I can find out more. -Jim Carey
84.7	We Can't Reproduce Multiple Server Problems	TOOK::CAREY		`Mon Apr 02 1990 12:03`	61
	Marc, I had a chance to do some experimenting on our network here, and was unable to reproduce a situation where multiple servers were spawned and weren't expected. Any details that you could give me about the exact nature of your requests could help, although I can't imagine what might be different about them. I created and checked out the following cases: - Connecting to a remote node with Proxy Access defined. This worked fine. Subsequent requests connected to the spawned server. - Connecting to a remote node using explicit access (BY USER = "...") This also worked fine. I did these close together, so the Proxy Server was still out there, and a new server was created for the explicit access case. This is normal because VMS has to consider them to be different processes with different rights. As expected, subsequent requests connected to the same server just spawned. - Connecting to a remote node using Default Access (no proxy, no explicit accounting information) This worked as expected too. After forming this connection, I had three servers running: one for the Proxy access, one for the Explicit Access, and one for the Default Access. Subsequent requests didn't spawn any new servers. In fact, once I had the three servers running, I attempted to confuse the system by using Proxy, Explicit, and Default Access in different combinations. No problems were encountered, and no additional processes were spawned (by the way, connecting to an existing server cuts down the response to a circuit counters request from an estimated fifteen seconds, to two or three seconds maximum). We also tried to reproduce the problem on a boundary condition. You mentioned that your servers were set up to last about five minutes and that you were requesting counters about every five minutes. We wondered if the server process could somehow get locked up if a request came in just as it was being stopped. Several attempts to cause this to happen were unsuccessful. Since you appear to reproduce this problem at will, we don't expect that the problem lies on that boundary. We still suspect that there is something funny about the NETSERVER processes that you are creating and will continue to pursue that angle. I hope that isolating and changing the appropriate network, system, or account parameters will clean up these servers and get your connections behaving more closely to what we expect. -Jim Carey
84.8	Defective Bridge responsible for Network problems	TOOK::CAREY		`Tue Apr 03 1990 10:20`	11
	Just a little added detail: While MCC was under suspicion of "bringing down the network" it appears that a defective bridge was the real culprit this time. We are still investigating the problems described in this note, but there is no grounds to fear that DECmcc will topple your network. -Jim Carey
84.9	Any progress on increasing response time problem?	DSTEG1::MCCANN		`Wed May 09 1990 09:40`	6
	Has the problem of the ever-increasing response times mentioned in .0 been solved, or its cause identified? If so, will it be fixed in EFT update? Jack
84.10	leaks being plugged	GOSTE::CALLANDER		`Wed May 09 1990 16:10`	24
	There were two things at work in the problems reported. The defective bridge was the cause of the crash and most of the "slow down" that was experienced. The other problem was due to some memory leaks (causing fragmentation of memory when run for extended periods of time), and the dictionary lookup overhead. For EFT update we have made quite a few advances in our memory management by implementing a local cache for the allocation and deallocation of temporary memory; a better caching alogrithym for the dictionary look ups was implmented in the EFT release, and fine tuned for EFT update; quite a few leaks were plugged; and some of the slower code paths have been reviewed and condensed to provide a faster end user response time. So far people with early, integration, releases of the base system changes have been very pleased with the enhancements. I hope you are too. But we are not stopping there, work on performance and memory management are continuing. jill