[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference bulova::decw_jan-89_to_nov-90

Title:	DECWINDOWS 26-JAN-89 to 29-NOV-90
Notice:	See 1639.0 for VMS V5.3 kit; 2043.0 for 5.4 IFT kit
Moderator:	STAR::VATNE

Created:	Mon Oct 30 1989
Last Modified:	Mon Dec 31 1990
Last Successful Update:	Fri Jun 06 1997
Number of topics:	3726
Total number of notes:	19516

428.0. "Excessively slow client application startup times" by QUARK::LIONEL (The dream is alive) Fri Mar 17 1989 10:25

I wonder if I'm doing something wrong...

Lots of people tell me how wonderful it is to run an application on a big
server and have the display on a workstation client.  So I try it - on
our 8800, I do a SET DISPLAY/CREATE/NODE=QUARK:: and try running a few
applications.  The results are so poor, there has to be something amis.

The first thing I tried was running the new DECwindows debugger.  The 8800
is running V5.1, but has the T5.2 debugger installed; my VSII/GPX has
T5.2-410.  After THREE MINUTES, the first debugger window started to appear.
Four minutes elapsed before all windows were present.  Clicking on a menu
required about 5-7 seconds before the pulldown menu appeared.  However,
I could not get very far before I got "connection aborted".  (Yes, I've
read the note about the server retry logical, and will ask our system managers
to define it, but I don't think that should affect startup time.)

Next I tried DECW$MAIL - Two minutes for the window to appear.

Then I tried DECW$CLOCK, simple, right?  After two and a half minutes with
nothing on the screen, I get the infamous 2DBA002 error and "connection
aborted".

Please tell me - what do I look at?  The debugger people in my group say
they have no problems (it only [sic] takes them a minute and 20
seconds to bring up the debugger...)

			Steve

T.R	Title	User	Personal Name	Date	Lines
428.1	network transmission problems?	STAR::BMATTHEWS		`Fri Mar 17 1989 10:31`	8
	Check the decnet error counts on your workstation. It could be that there are very many retransmissions going on. I believe you can increase the number of receive buffers to alleviate that problem. A decnet/ncp guru will have to provide more details on exactly how to find out what the error count is and how to increase the receive buffers. Bill
428.2	Also WSEXTENT and/or WSMAX	MCNALY::MILLER	Bush For President...Kate Bush!	`Fri Mar 17 1989 10:37`	9
	Lots of public clusters have really low WSEXTENTs in users authorize parameters and/or low WSMAX as a SYSGEN parameter. In my opinion, they should be at least 5000 pages. Regards, == ken miller ==
428.3		QUARK::LIONEL	The dream is alive	`Fri Mar 17 1989 11:04`	37
	Well, here's the circuit counters on my workstation - doesn't look bad to me... Circuit Counters as of 17-MAR-1989 11:09:06 Circuit = QNA-0 21910 Seconds since last zeroed 36872 Terminating packets received 30754 Originating packets sent 0 Terminating congestion loss 0 Transit packets received 0 Transit packets sent 0 Transit congestion loss 0 Circuit down 0 Initialization failure 0 Adjacency down 1 Peak adjacencies 32215 Data blocks sent 9373289 Bytes sent 39227 Data blocks received 2286248 Bytes received 0 Unrecognized frame destination 7 User buffer unavailable And on the cluster, my WSQUOTA is 1024, WSEXTENT is 4096, WSMAX is 24600 and there's some 35000 free pages. Here's the executor characteristics that may be relevant: Maximum buffers = 100 Buffer size = 576 Pipeline quota = 65024 What else can I look at? Steve
428.4	More things to look at	STAR::BMATTHEWS		`Fri Mar 17 1989 12:11`	6
	The user buffer unavailable is the problem I have seen in the past. Do a $ mcr ncp show known lines char command. On my vs2000/gpx I have 6 receive buffers which is probably low. The big nodes in our cluster have 20 buffers. My device buffer size is 1498. I don't know whether that is good or bad. Bill
428.5	Another piece of possibly relevant data	STAR::BMATTHEWS		`Fri Mar 17 1989 12:14`	2
	Look also at your line counters. $ mcr ncp show known lines count. - Bill
428.6		QUARK::LIONEL	The dream is alive	`Fri Mar 17 1989 12:27`	25
	Hmm... I have 10 receive buffers. I'll try setting it to 20 and see what I get. Here are my line counters. Line = QNA-0 28687 Seconds since last zeroed 148651 Data blocks received 88453 Multicast blocks received 0 Receive failure 12909746 Bytes received 8885300 Multicast bytes received 0 Data overrun 53386 Data blocks sent 2305 Multicast blocks sent 46 Blocks sent, multiple collisions 46 Blocks sent, single collision 0 Blocks sent, initially deferred 13414383 Bytes sent 166852 Multicast bytes sent 0 Send failure 0 Collision detect check failure 0 Unrecognized frame destination 0 System buffer unavailable 734 User buffer unavailable
428.7		QUARK::LIONEL	The dream is alive	`Fri Mar 17 1989 13:20`	4
	Ok, tried 20 receive buffers - no difference. This can't be so hard! Steve
428.8	client or ws	STAR::BMATTHEWS		`Fri Mar 17 1989 15:30`	4
	Try firing up your apps to someone elses workstation to see if it is a problem on the workstatoin side or the client side. Bill
428.9		QUARK::LIONEL	The dream is alive	`Fri Mar 17 1989 17:35`	5
	Ok, I tried it on someone else's WS - a VS2000 running V5.1. It's just as bad. So maybe it's our cluster? What server parameters matter? Steve
428.10		STAR::ORGOVAN	Vince Orgovan	`Sat Mar 18 1989 14:11`	42
	Steve this is puzzling. I just started up DECW$MAIL on our 8800 with the display directed to my standalone VS2000/GPX with 6Mb. The window appeared in 30 seconds. It's hard to understand what could account for it taking four times longer in your configuration. Can you compile and run this program on your 8800 with the display directed at your server? It times the basic client-to-server and back again round trip. It runs in about 15 seconds elapsed time on our 8800. /* * This program times a server round trip. * * To compile & link on VMS: * * $ cc foo * $ link foo,sys$input/opt * sys$share:decw$xlibshr/share * sys$share:vaxcrtl/share * ^Z / #include <decw$include/Xlib.h> #define loopcount 1000 extern int lib$init_timer(); extern int lib$show_timer(); int main() { Display dpy; int i; dpy = XOpenDisplay(""); lib$init_timer(); /* start the timer / for (i=0; i < loopcount; i++) XSync(dpy, 0); lib$show_timer(); / stop & display the timer */ XCloseDisplay(dpy); exit(1); }
428.11	Is the 8800 simply a heavily used pig?	CVG::PETTENGILL	mulp	`Sat Mar 18 1989 16:47`	18
	OBviously the problem is on the 8800. Can you use another system to get a better feel for what it should be like; in my experience a 6220 is much better than you describe. What kind of load is there on the 8800? If you get less memory+cpu+io then on a vs2000 than you aren't going to be better off. Are you running in batch or interactive? Batch usually has a lower priority which will give you bad performance if a lot of `background' work is being done. Are your UAF or system working set figures very low? A large working set extent is needed for a lot of applications. Are you getting routed either due to cluster alias or because of the LAN configuration? This seems unlikely because when I've traced the comm traffic, it is very similar to LAT traffic except the messages are about twice as large (ie., ~200 bytes average instead of ~100 bytes).
428.12	Ouch!	QUARK::LIONEL	The dream is alive	`Mon Mar 20 1989 10:46`	18
	Ok, I ran Vince's program... The bad news is that I had to reduce the loop count from 1000 to 25 to get it to complete in a reasonable time... At a count of 25, it took 2 minutes and 57 seconds. While the program was running, my workstation was compute-bound. The 8800 I am running on is not heavily loaded, and the working set parameters are more than adequate. There is no cluster routing involved, and the two systems are on the same logical Ethernet (though perhaps on different segments). I ran the program interactively. I would agree that the problem would appear to be something on our cluster. It can't be just a matter of load - the performance figures I gave above are just TOO awful - something else has to be at work. Ideas and consulting offers welcome... Steve
428.13		STAR::ORGOVAN	Vince Orgovan	`Mon Mar 20 1989 11:06`	12
	Egads. With a loop count of 25 taking 177 seconds that means that your typical round trip is 7 seconds. It should be something like 0.015 seconds. And your workstation is compute-bound? I suspect that something is drastically wrong on your workstation. Maybe you've uncovered a server memory leak or something? Do a show process /continuous on your server process (from some set host connection that doesn't use the server) and rerun the round-trip timer. Is the server process getting much CPU time? Is is getting any page faults? How big is it's virtual address space?
428.14	More info to gather	STAR::BMATTHEWS		`Mon Mar 20 1989 11:41`	9
	Boy there sure seems to be alot of conflicting data here. If other people run apps on the 8800 to their workstations it runs ok but if you run apps to your ws or other ws's then performance is terrible. You should complete the matrix and see if someone else can fire off apps to your ws and see what performance is like. Also when you run vince's round trip test you get a ws that is compute bound. Where are the ws computes going? Are they all in the server at user mode? kernel mode? some other process? Bill
428.15	Works ok here...sorry	NECSC::LEVY	A leaf of all colors	`Tue Mar 21 1989 08:21`	14
	I thought I'd try this out just for giggles. I'm running FOO out of a DCL Command window from FileView which is running in a Priority 4 Batch queue on an 8350 client with only 1 user. The display is going to my 8 meg PVAX server. Here's output on a couple of runs. Nidus� r foo ELAPSED: 0 00:00:18.98 CPU: 0:00:10.19 BUFIO: 2000 DIRIO: 0 FAULTS: 2 Nidus� r foo ELAPSED: 0 00:00:21.37 CPU: 0:00:10.36 BUFIO: 2000 DIRIO: 0 FAULTS: 2 By the way, DECW$SERVER_0 goes to about 20% of CPU (If you can believe BANNER) while the program is running. We haven't done any special tuning other than following the recommendations here.
428.16	Weirder and weirder	QUARK::LIONEL	The dream is alive	`Tue Mar 21 1989 10:33`	16
	I ran the program on someone else's VS2000, directed at my WS, and had the same symptoms. Taking Vince's suggestion, I ran MONITOR and SHOW PROC/CONT from a SET HOST connection, and found that the system was NOT going compute-bound. In fact, what seemed to be happening was that the server process was going to sleep. It stayed in HIB mode a lot and got hardly any CPU time. A MONITOR MODES showed idle time jumping up to 90% (from 70%) while the test was running - no single process got any significant part of the CPU. Interrupt stack time was 5% or less. The server did not page fault or increase its memory requirements. What now? Steve
428.17		LESLIE::LESLIE	Bizarro Engineer	`Tue Mar 21 1989 11:07`	7
	Try doing this with a direct ethernet connection between the systems. If your ethernet is being flooded with bad packets by a faulty DEQNA or somesuch, all will work okay now and you can chase your facilities people about an ethernet problem. If not, well, its another off the checklist.
428.18	Could it be a lock problem?	NECSC::LEVY	A leaf of all colors	`Tue Mar 21 1989 11:49`	18
	This is a stab in the dark, but could you have a locking problem? When you state that the server is going into a HIB state, it seems that it's looking for a resource that is not available. The LOCKIDTBL entry on the 8350 client on which I run is: SYSGEN> SHOW LOCKIDTBL Parameter Name Current Default Minimum Maximum Unit Dynamic -------------- ------- ------- ------- ------- ---- ------- LOCKIDTBL 7115 200 40 65535 Entries Could this be the problem??? - Dave
428.19	Could be LOCKIDTBL!	QUARK::LIONEL	The dream is alive	`Tue Mar 21 1989 11:58`	13
	Re: .18 (LOCKIDTBL) Hmm.. could be! My WS has LOCKIDTBL set at 180 (I didn't set this - AUTOGEN must have - it's even below the default!) I will up this significantly and see what happens. Re: .17 When I ran from the other VS2000, it was in the office next to mine on the same Ethernet segment. Steve
428.20	Maybe QNA problems?	STAR::BMATTHEWS		`Tue Mar 21 1989 12:16`	9
	The server goes into HIB when it thinks it has nothing to do. The transport should wake the server when data arrives from a client or the driver when data arrives from the keyboard or mouse. The most likely scenario is that the network or QNA on your workstation is having problems. It is also possible that something is amiss in the server and it is not recognizing it has work to do. You could try @sys$manager:decw$startup restart to see if a new invocation of the server helps. Bill
428.21	Check user buffers one more time?	STAR::BMATTHEWS		`Tue Mar 21 1989 12:19`	4
	Did you do a $ MCR NCP SHO KNOWN LINES COUNT before and after running vince's program to see if the user buffer unavailable count is still going up? Bill
428.22	Well, it SOUNDED good...	QUARK::LIONEL	The dream is alive	`Tue Mar 21 1989 13:09`	22
	I ran AUTOGEN with feedback (first time since installing T5.2-410) and specified MIN_LOCKIDTBL as 2000. AUTOGEN noted that my LOCKIDTBL and RESHASHTBL (something like that) were low and raised them. I rebooted and tried Vince's program again. If anything, it's worse. I am running 20 receive buffers now - while running Vince's program, the count of "user buffer unavailable" went up by 3 (for the 25-count loop). Doesn't sound significant. I haven't noticed any other network-related problems. Another data point is that when I run the program, the server activity on my WS grinds to a stop - cursors stop flashing for a minute at a time, the calendar icon takes 30 seconds to repaint, etc. Yet there are ample cycles available, and the memory usage is minimal. Even if the Ethernet connection were bad, that shouldn't affect local server activity, should it? Other people in my group are reporting similar problems. I wish I could get to the bottom of this... Steve
428.23	Could be a network/dw transport/server sched interaction	STAR::BMATTHEWS		`Tue Mar 21 1989 14:09`	27
	I am running 20 receive buffers now - while running Vince's program, the count of "user buffer unavailable" went up by 3 (for the 25-count loop). Doesn't sound significant. I haven't noticed any other network-related problems. > >I think that 3 of 25 is significant. My user buffer unavailable count is >zero and stays at zero. If any DECNET gurus are out there maybe they can >help determine if this is significant or not. I think I remember that >if there is no buffer available that there is a retry involved and also >possibly a delta wait time before the retry is attempted. If so then >possibly the retry delta is way off. Another data point is that when I run the program, the server activity on my WS grinds to a stop - cursors stop flashing for a minute at a time, the calendar icon takes 30 seconds to repaint, etc. Yet there are ample cycles available, and the memory usage is minimal. Even if the Ethernet connection were bad, that shouldn't affect local server activity, should it? >It does make sense because Xsync requires a reply from the server and while >DECNET is doing it's write I suspect the server is waiting for the write >to complete. Maybe Monty can explain how the DECNET writes from the server >work and what could happen if DECNET can't post the write immediately. Bill
428.24		QUARK::LIONEL	The dream is alive	`Tue Mar 21 1989 15:07`	22
	I have some more data... I ran Vince's program from the same 8800 to another person's VS_II/GPX running V5.1, and it ran quickly. We looked at his system's "User Buffer Unavailable" counter and it was zero, after a long time. We compared EXEC, LINE and CIRCUIT parameters between our systems and didn't see anything obvious - in fact, his parameters were often "worse" than mine (he had 6 receive buffers, for example). We concluded that the problem is related to the "user buffer unavailable" problems, but are unable to see why that is happening so often. Our understanding of the retry intervals matches Bills in .23. By the way, once when I ran the program on the 8800, the elapsed time was over four minutes (with a loop count of 25), but the CPU time was under a second. I am wondering if this is something new with T5.2 - I will enter a QAR about it just in case. But if anyone wants to contact me offline, or here, to help resolve this, please do! I'm no longer quite so interested for myself, but I have a feeling that others may run into the same problem. Steve
428.25	decw$server_retry_write_m* logicals?	STAR::BMATTHEWS		`Tue Mar 21 1989 16:34`	4
	Steve, do you have the decw$serverretry logicals set up high? The parameters are now in ms, not 1/10th of a ms. Bill
428.26	No...	QUARK::LIONEL	The dream is alive	`Tue Mar 21 1989 18:32`	10
	Re: .25 No, I don't have them defined. (And while talking with Bill on the phone, I tried various permutations of those logicals to no effect.) I have entered a QAR about this matter. Steve
428.27	The culprit has been identified!	QUARK::LIONEL	The dream is alive	`Wed Mar 22 1989 13:20`	16
	After trying lots of different things with the help of Bill Matthews and Monty Brandenberg, Monty noted that my DECnet exec Pipeline Quota was very high - 65024 - where a value of 10000 is recommended by DECwindows. I had been told to raise it when I installed DFS. So I lowered it back to 10000, and - WOW! What a difference! Everything is lightning fast! The moral is - be careful how high you raise your pipeline quota. I don't yet know what values cause problems, but will be investigating this some more with DECnet and DFS people. Whew! Steve
428.28	Workaround found	STAR::BMATTHEWS		`Wed Mar 22 1989 13:21`	14
	Well on Monty's suggestion Steve lowered pipeline quota to 10000 and the problems went away. Steve had raised pipeline quota because that is supposed to enhance DFS performance. Now there is still something or many things broken here that need looking into but there is now a workaround for people who have this problem. First problem seems to be why does decnet get errors with a large pipeline quota. Second or multiple problems then appear because of some retry timers being much too long. This could be DECNET and/or the DECWindows server retry timers. Bill
428.29		QUARK::LIONEL	The dream is alive	`Wed Mar 22 1989 13:53`	10
	Doing some more experimentation shows that the pipeline quota can be as high as 60000 without noticeable problems, but at 65000 everything dies. I was also told just now that values over 25000 are pointless, since there is no performance gain, just memory use, above that value. I am still getting "User Buffer Unavailable" errors, but I am persuing that elsewhere. Steve
428.30	Pipeline quota up to 64960 appears ok	STAR::BMATTHEWS		`Wed Mar 22 1989 14:13`	5
	After a bit of binary search it appears that all is well with a pipeline quota of up to 64960 but at 64961 and greater things go to pot. At least this is the value on my system. Bill
428.31	Never trust anyone over 32767.	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Wed Mar 22 1989 14:14`	2

428.32	As the ghostly outline of Al Eldridge moves across the screen	POOL::HALLYB	The Smart Money was on Goliath	`Wed Mar 22 1989 14:24`	7
	Interesting that the difference between 65536 and 64960 is exactly 576, a segment buffer size. Any bets that there's some 16-bit arithmetic going on somewhere? And Steve was getting effectively no "pipelining"? John
428.33		STAR::MFOLEY	Rebel without a Clue	`Wed Mar 22 1989 15:07`	8
	RE: .31 32767 is what the DFS folks reccomend for DECnet pipeline quota. (At least when I managed their cluster that's what it was) mike
428.34	It's a bug in DECnet-VAX	BULEAN::CARSON	Knockwurst & Excommunications	`Tue Apr 25 1989 15:25`	10
	.32 is correct. A calculation using a DIVW thinks your pipeline quota is a small negative number. This is fixed in a future release. We will make a patch available to NCSS if anyone really needs 16 bits of Pipeline Quota. Pete Carson DECnet-VAX SW Maint. Eng.
428.35	Need some help in tracking a problem	6317::FEATHERSTON	Ed Featherston	`Fri May 12 1989 12:17`	16
	I've used the program posted in earlier in the replies to verify a problem I am having with running client apps on either of 2 8800's to display any of the workstations in my group running DECWindows. If I use the program, I get elapsed times of anywhere between 30 seconds to several minutes. This is on either 8800 (plenty of null time when running the program), and doens't depend on the display system (I have tried both VS-II's and VS-2000's). To verify it wasn't the workstations I ran the program on a uVAX-II and a VAX-6240. I consistently get an elapsed time between 20-22 seconds on both those systems. The 8800's are clustered together, VMS 5.1, 128MB each. PIPELINE QUOTA is 40000, receive buffers 20. Lots of free NPAGEDYN, SRP, IRP, and LRP's. I am tearing my hair out trying to find the cause. Any suggestions of where to look would be greatly appreciated. Thanks. /ed/
428.36		STAR::BRANDENBERG	Si vis pacem para bellum	`Fri May 12 1989 12:20`	5
	Drop PIPELINE down to 25000 as any more is meaningless. Once a client has connected, check link counters for anything 'interesting'.