[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference bulova::decw_jan-89_to_nov-90

Title:DECWINDOWS 26-JAN-89 to 29-NOV-90
Notice:See 1639.0 for VMS V5.3 kit; 2043.0 for 5.4 IFT kit
Moderator:STAR::VATNE
Created:Mon Oct 30 1989
Last Modified:Mon Dec 31 1990
Last Successful Update:Fri Jun 06 1997
Number of topics:3726
Total number of notes:19516

428.0. "Excessively slow client application startup times" by QUARK::LIONEL (The dream is alive) Fri Mar 17 1989 10:25

I wonder if I'm doing something wrong...

Lots of people tell me how wonderful it is to run an application on a big
server and have the display on a workstation client.  So I try it - on
our 8800, I do a SET DISPLAY/CREATE/NODE=QUARK:: and try running a few
applications.  The results are so poor, there has to be something amis.

The first thing I tried was running the new DECwindows debugger.  The 8800
is running V5.1, but has the T5.2 debugger installed; my VSII/GPX has
T5.2-410.  After THREE MINUTES, the first debugger window started to appear.
Four minutes elapsed before all windows were present.  Clicking on a menu
required about 5-7 seconds before the pulldown menu appeared.  However,
I could not get very far before I got "connection aborted".  (Yes, I've
read the note about the server retry logical, and will ask our system managers
to define it, but I don't think that should affect startup time.)

Next I tried DECW$MAIL - Two minutes for the window to appear.

Then I tried DECW$CLOCK, simple, right?  After two and a half minutes with
nothing on the screen, I get the infamous 2DBA002 error and "connection
aborted".

Please tell me - what do I look at?  The debugger people in my group say
they have no problems (it only [sic] takes them a minute and 20
seconds to bring up the debugger...)

			Steve

T.RTitleUserPersonal
Name
DateLines
428.1network transmission problems?STAR::BMATTHEWSFri Mar 17 1989 10:318
Check the decnet error counts on your workstation. It could be that there
are very many retransmissions going on. I believe you can increase the
number of receive buffers to alleviate that problem. A decnet/ncp guru
will have to provide more details on exactly how to find out what the
error count is and how to increase the receive buffers.
						Bill


428.2Also WSEXTENT and/or WSMAXMCNALY::MILLERBush For President...Kate Bush!Fri Mar 17 1989 10:379
Lots of public clusters have really low WSEXTENTs in users authorize
parameters and/or low WSMAX as a SYSGEN parameter.

In my opinion, they should be *at least* 5000 pages.

Regards,

           == ken miller ==

428.3QUARK::LIONELThe dream is aliveFri Mar 17 1989 11:0437
Well, here's the circuit counters on my workstation - doesn't look bad to me...

Circuit Counters as of 17-MAR-1989 11:09:06

Circuit = QNA-0

       21910  Seconds since last zeroed
       36872  Terminating packets received
       30754  Originating packets sent
           0  Terminating congestion loss
           0  Transit packets received
           0  Transit packets sent
           0  Transit congestion loss
           0  Circuit down
           0  Initialization failure
           0  Adjacency down
           1  Peak adjacencies
       32215  Data blocks sent
     9373289  Bytes sent
       39227  Data blocks received
     2286248  Bytes received
           0  Unrecognized frame destination
           7  User buffer unavailable
   
And on the cluster, my WSQUOTA is 1024, WSEXTENT is 4096, WSMAX is 24600
and there's some 35000 free pages.  

Here's the executor characteristics that may be relevant:

Maximum buffers          = 100
Buffer size              = 576
Pipeline quota           = 65024

What else can I look at?

		Steve

428.4More things to look atSTAR::BMATTHEWSFri Mar 17 1989 12:116
The user buffer unavailable is the problem I have seen in the past. Do a
$ mcr ncp show known lines char command. On my vs2000/gpx I have 6 receive
buffers which is probably low. The big nodes in our cluster have 20 buffers.
My device buffer size is 1498. I don't know whether that is good or bad.
						Bill

428.5Another piece of possibly relevant dataSTAR::BMATTHEWSFri Mar 17 1989 12:142
Look also at your line counters. $ mcr ncp show known lines count. - Bill

428.6QUARK::LIONELThe dream is aliveFri Mar 17 1989 12:2725
Hmm...  I have 10 receive buffers.  I'll try setting it to 20 and see
what I get.  Here are my line counters.

Line = QNA-0

       28687  Seconds since last zeroed
      148651  Data blocks received
       88453  Multicast blocks received
           0  Receive failure
    12909746  Bytes received
     8885300  Multicast bytes received
           0  Data overrun
       53386  Data blocks sent
        2305  Multicast blocks sent
          46  Blocks sent, multiple collisions
          46  Blocks sent, single collision
           0  Blocks sent, initially deferred
    13414383  Bytes sent
      166852  Multicast bytes sent
           0  Send failure
           0  Collision detect check failure
           0  Unrecognized frame destination
           0  System buffer unavailable
         734  User buffer unavailable

428.7QUARK::LIONELThe dream is aliveFri Mar 17 1989 13:204
Ok, tried 20 receive buffers - no difference.  This can't be so hard!

	Steve

428.8client or wsSTAR::BMATTHEWSFri Mar 17 1989 15:304
Try firing up your apps to someone elses workstation to see if it is
a problem on the workstatoin side or the client side.
				Bill

428.9QUARK::LIONELThe dream is aliveFri Mar 17 1989 17:355
Ok, I tried it on someone else's WS - a VS2000 running V5.1.  It's
just as bad.  So maybe it's our cluster?  What server parameters matter?

			Steve

428.10STAR::ORGOVANVince OrgovanSat Mar 18 1989 14:1142
    Steve this is puzzling. I just started up DECW$MAIL on our 8800 with 
    the display directed to my standalone VS2000/GPX with 6Mb. The window 
    appeared in 30 seconds. It's hard to understand what could account for 
    it taking four times longer in your configuration.  
    
    Can you compile and run this program on your 8800 with the display
    directed at your server? It times the basic client-to-server and
    back again round trip. It runs in about 15 seconds elapsed time
    on our 8800. 
    
/*
 * This program times a server round trip.
 * 
 * To compile & link on VMS:
 *
 *	$ cc foo
 *	$ link foo,sys$input/opt
 *	sys$share:decw$xlibshr/share
 *	sys$share:vaxcrtl/share
 *	^Z
 */

#include <decw$include/Xlib.h>

#define loopcount 1000
extern int lib$init_timer();
extern int lib$show_timer();

int main()
{
    Display *dpy;
    int i;

    dpy = XOpenDisplay("");
    lib$init_timer();			/* start the timer */
    for (i=0; i < loopcount; i++) XSync(dpy, 0);
    lib$show_timer();			/* stop & display the timer */
    XCloseDisplay(dpy);
    exit(1);
}
    

428.11Is the 8800 simply a heavily used pig?CVG::PETTENGILLmulpSat Mar 18 1989 16:4718
OBviously the problem is on the 8800.  Can you use another system to get a
better feel for what it should be like; in my experience a 6220 is much better
than you describe.

What kind of load is there on the 8800?  If you get less memory+cpu+io then on
a vs2000 than you aren't going to be better off.

Are you running in batch or interactive?  Batch usually has a lower priority
which will give you bad performance if a lot of `background' work is being done.

Are your UAF or system working set figures very low?  A large working set
extent is needed for a lot of applications.

Are you getting routed either due to cluster alias or because of the LAN
configuration?  This seems unlikely because when I've traced the comm traffic,
it is very similar to LAT traffic except the messages are about twice as large
(ie., ~200 bytes average instead of ~100 bytes).

428.12Ouch!QUARK::LIONELThe dream is aliveMon Mar 20 1989 10:4618
Ok, I ran Vince's program...  The bad news is that I had to reduce the
loop count from 1000 to 25 to get it to complete in a reasonable time...  At
a count of 25, it took 2 minutes and 57 seconds.  While the program was
running, my workstation was compute-bound.

The 8800 I am running on is not heavily loaded, and the working set parameters
are more than adequate.  There is no cluster routing involved, and the
two systems are on the same logical Ethernet (though perhaps on different
segments).  I ran the program interactively.

I would agree that the problem would appear to be something on our cluster.
It can't be just a matter of load - the performance figures I gave above
are just TOO awful - something else has to be at work.

Ideas and consulting offers welcome...

				Steve

428.13STAR::ORGOVANVince OrgovanMon Mar 20 1989 11:0612
    Egads. With a loop count of 25 taking 177 seconds that means that 
    your typical round trip is 7 seconds. It should be something like
    0.015 seconds. And your workstation is compute-bound? I suspect
    that something is drastically wrong on your workstation. 
    
    Maybe you've uncovered a server memory leak or something? Do a
    show process /continuous on your server process (from some set
    host connection that doesn't use the server) and rerun the
    round-trip timer. Is the server process getting much CPU time?
    Is is getting any page faults? How big is it's virtual address
    space? 

428.14More info to gatherSTAR::BMATTHEWSMon Mar 20 1989 11:419
Boy there sure seems to be alot of conflicting data here. If other people
run apps on the 8800 to their workstations it runs ok but if you run apps
to your ws or other ws's then performance is terrible. You should complete
the matrix and see if someone else can fire off apps to your ws and see what
performance is like. Also when you run vince's round trip test you get a ws
that is compute bound. Where are the ws computes going? Are they all in the
server at user mode? kernel mode? some other process?
						Bill

428.15Works ok here...sorryNECSC::LEVYA leaf of all colorsTue Mar 21 1989 08:2114
I thought I'd try this out just for giggles.  I'm running FOO out of a DCL
Command window from FileView which is running in a Priority 4 Batch queue on 
an 8350 client with only 1 user.  The display is going to my 8 meg PVAX server.
Here's output on a couple of runs.

Nidus� r foo
 ELAPSED:    0 00:00:18.98  CPU: 0:00:10.19  BUFIO: 2000  DIRIO: 0  FAULTS: 2
Nidus� r foo
 ELAPSED:    0 00:00:21.37  CPU: 0:00:10.36  BUFIO: 2000  DIRIO: 0  FAULTS: 2

By the way, DECW$SERVER_0 goes to about 20% of CPU (If you can believe BANNER)
while the program is running.  We haven't done any special tuning other than
following the recommendations here.

428.16Weirder and weirderQUARK::LIONELThe dream is aliveTue Mar 21 1989 10:3316
I ran the program on someone else's VS2000, directed at my WS, and had
the same symptoms.

Taking Vince's suggestion, I ran MONITOR and SHOW PROC/CONT from a SET HOST
connection, and found that the system was NOT going compute-bound.  In
fact, what seemed to be happening was that the server process was going
to sleep.  It stayed in HIB mode a lot and got hardly any CPU time.  A
MONITOR MODES showed idle time jumping up to 90% (from 70%) while the
test was running - no single process got any significant part of the CPU.
Interrupt stack time was 5% or less.  The server did not page fault or
increase its memory requirements.

What now?

			Steve

428.17LESLIE::LESLIEBizarro EngineerTue Mar 21 1989 11:077
    Try doing this with a direct ethernet connection between the systems.
    If your ethernet is being flooded with bad packets by a faulty DEQNA or
    somesuch, all will work okay now and you can chase your facilities
    people about an ethernet problem.
    
    If not, well, its another off the checklist.

428.18Could it be a lock problem?NECSC::LEVYA leaf of all colorsTue Mar 21 1989 11:4918
This is a stab in the dark, but could you have a locking problem?

When you state that the server is going into a HIB state, it seems that it's
looking for a resource that is not available.

The LOCKIDTBL entry on the 8350 client on which I run is:

SYSGEN>  SHOW LOCKIDTBL
Parameter Name             Current   Default   Minimum   Maximum Unit  Dynamic
--------------             -------   -------   -------   ------- ----  -------
LOCKIDTBL                     7115       200        40     65535 Entries


Could this be the problem???


	- Dave

428.19Could be LOCKIDTBL!QUARK::LIONELThe dream is aliveTue Mar 21 1989 11:5813
Re: .18 (LOCKIDTBL)

Hmm.. could be!  My WS has LOCKIDTBL set at 180 (I didn't set this - AUTOGEN
must have - it's even below the default!)  I will up this significantly
and see what happens.

Re: .17

When I ran from the other VS2000, it was in the office next to mine on the
same Ethernet segment.

				Steve

428.20Maybe QNA problems?STAR::BMATTHEWSTue Mar 21 1989 12:169
The server goes into HIB when it thinks it has nothing to do. The transport
should wake the server when data arrives from a client or the driver when
data arrives from the keyboard or mouse. The most likely scenario is that
the network or QNA on your workstation is having problems. It is also possible
that something is amiss in the server and it is not recognizing it has work to
do. You could try @sys$manager:decw$startup restart to see if a new invocation
of the server helps.
							Bill

428.21Check user buffers one more time?STAR::BMATTHEWSTue Mar 21 1989 12:194
Did you do a $ MCR NCP SHO KNOWN LINES COUNT before and after running vince's
program to see if the user buffer unavailable count is still going up?
							Bill

428.22Well, it SOUNDED good...QUARK::LIONELThe dream is aliveTue Mar 21 1989 13:0922
I ran AUTOGEN with feedback (first time since installing T5.2-410) and
specified MIN_LOCKIDTBL as 2000.  AUTOGEN noted that my LOCKIDTBL and
RESHASHTBL (something like that) were low and raised them.  I rebooted
and tried Vince's program again.  If anything, it's worse.

I am running 20 receive buffers now - while running Vince's program, the
count of "user buffer unavailable" went up by 3 (for the 25-count loop).
Doesn't sound significant.  I haven't noticed any other network-related
problems.

Another data point is that when I run the program, the server activity
on my WS grinds to a stop - cursors stop flashing for a minute at a time,
the calendar icon takes 30 seconds to repaint, etc.  Yet there are ample
cycles available, and the memory usage is minimal.  Even if the Ethernet
connection were bad, that shouldn't affect local server activity, should
it?

Other people in my group are reporting similar problems.  I wish I could
get to the bottom of this...

				Steve

428.23Could be a network/dw transport/server sched interactionSTAR::BMATTHEWSTue Mar 21 1989 14:0927
I am running 20 receive buffers now - while running Vince's program, the
count of "user buffer unavailable" went up by 3 (for the 25-count loop).
Doesn't sound significant.  I haven't noticed any other network-related
problems.
>
>I think that 3 of 25 is significant. My user buffer unavailable count is
>zero and stays at zero. If any DECNET gurus are out there maybe they can
>help determine if this is significant or not. I think I remember that
>if there is no buffer available that there is a retry involved and also
>possibly a delta wait time before the retry is attempted. If so then
>possibly the retry delta is way off.

Another data point is that when I run the program, the server activity
on my WS grinds to a stop - cursors stop flashing for a minute at a time,
the calendar icon takes 30 seconds to repaint, etc.  Yet there are ample
cycles available, and the memory usage is minimal.  Even if the Ethernet
connection were bad, that shouldn't affect local server activity, should
it?

>It does make sense because Xsync requires a reply from the server and while
>DECNET is doing it's write I suspect the server is waiting for the write
>to complete. Maybe Monty can explain how the DECNET writes from the server
>work and what could happen if DECNET can't post the write immediately.

						Bill

428.24QUARK::LIONELThe dream is aliveTue Mar 21 1989 15:0722
I have some more data...  I ran Vince's program from the same 8800 to
another person's VS_II/GPX running V5.1, and it ran quickly.  We looked
at his system's "User Buffer Unavailable" counter and it was zero, after
a long time.  We compared EXEC, LINE and CIRCUIT parameters between our
systems and didn't see anything obvious - in fact, his parameters were
often "worse" than mine (he had 6 receive buffers, for example).

We concluded that the problem is related to the "user buffer unavailable"
problems, but are unable to see why that is happening so often.  Our
understanding of the retry intervals matches Bills in .23.

By the way, once when I ran the program on the 8800, the elapsed time
was over four minutes (with a loop count of 25), but the CPU time was
under a second.

I am wondering if this is something new with T5.2 - I will enter a QAR
about it just in case.  But if anyone wants to contact me offline, or
here, to help resolve this, please do!  I'm no longer quite so interested
for myself, but I have a feeling that others may run into the same problem.

				Steve

428.25decw$server_retry_write_m* logicals?STAR::BMATTHEWSTue Mar 21 1989 16:344
Steve, do you have the decw$server*retry* logicals set up high?
The parameters are now in ms, not 1/10th of a ms.
						Bill

428.26No...QUARK::LIONELThe dream is aliveTue Mar 21 1989 18:3210
    Re: .25
    
    No, I don't have them defined.  (And while talking with Bill on
    the phone, I tried various permutations of those logicals to
    no effect.)
    
    I have entered a QAR about this matter.
    
    			Steve

428.27The culprit has been identified!QUARK::LIONELThe dream is aliveWed Mar 22 1989 13:2016
After trying lots of different things with the help of Bill Matthews and
Monty Brandenberg, Monty noted that my DECnet exec Pipeline Quota was
very high - 65024 - where a value of 10000 is recommended by DECwindows.
I had been told to raise it when I installed DFS.

So I lowered it back to 10000, and - WOW!  What a difference!  Everything
is lightning fast!

The moral is - be careful how high you raise your pipeline quota.  I don't
yet know what values cause problems, but will be investigating this some more
with DECnet and DFS people.

Whew!

			Steve

428.28Workaround foundSTAR::BMATTHEWSWed Mar 22 1989 13:2114
Well on Monty's suggestion Steve lowered pipeline quota to 10000 and the
problems went away. Steve had raised pipeline quota because that is supposed
to enhance DFS performance. Now there is still something or many things
broken here that need looking into but there is now a workaround for people
who have this problem.

First problem seems to be why does decnet get errors with a large pipeline
quota.

Second or multiple problems then appear because of some retry timers being
much too long. This could be DECNET and/or the DECWindows server retry timers.

					Bill

428.29QUARK::LIONELThe dream is aliveWed Mar 22 1989 13:5310
Doing some more experimentation shows that the pipeline quota can be as
high as 60000 without noticeable problems, but at 65000 everything dies.
I was also told just now that values over 25000 are pointless, since there
is no performance gain, just memory use, above that value.

I am still getting "User Buffer Unavailable" errors, but I am persuing that
elsewhere.

					Steve

428.30Pipeline quota up to 64960 appears okSTAR::BMATTHEWSWed Mar 22 1989 14:135
After a bit of binary search it appears that all is well with a pipeline
quota of up to 64960 but at 64961 and greater things go to pot. At least this
is the value on my system.
						Bill

428.31Never trust anyone over 32767.STAR::BRANDENBERGIntelligence - just a good party trick?Wed Mar 22 1989 14:142
    

428.32As the ghostly outline of Al Eldridge moves across the screenPOOL::HALLYBThe Smart Money was on GoliathWed Mar 22 1989 14:247
    Interesting that the difference between 65536 and 64960 is
    exactly 576, a segment buffer size.  Any bets that there's
    some 16-bit arithmetic going on somewhere?  And Steve was
    getting effectively no "pipelining"?
    
      John

428.33STAR::MFOLEYRebel without a ClueWed Mar 22 1989 15:078
RE: .31


	32767 is what the DFS folks reccomend for DECnet pipeline quota.
	(At least when I managed their cluster that's what it was)

						mike

428.34It's a bug in DECnet-VAXBULEAN::CARSONKnockwurst &amp; ExcommunicationsTue Apr 25 1989 15:2510
	.32 is correct.
	A calculation using a DIVW thinks your pipeline quota is a small
	negative number.

	This is fixed in a future release.  We will make a patch available
	to NCSS if anyone really needs 16 bits of Pipeline Quota.

					Pete Carson
					DECnet-VAX SW Maint. Eng.

428.35Need some help in tracking a problem6317::FEATHERSTONEd FeatherstonFri May 12 1989 12:1716
I've used the program posted in earlier in the replies to verify a problem I am
having with running client apps on either of 2 8800's to display any of the
workstations in my group running DECWindows. If I use the program, I get
elapsed times of anywhere between 30 seconds to several minutes. This is on
either 8800 (plenty of null time when running the program), and doens't depend
on the display system (I have tried both VS-II's and VS-2000's). To verify it
wasn't the workstations I ran the program on a uVAX-II and a VAX-6240. I
consistently get an elapsed time between 20-22 seconds on both those systems.

The 8800's are clustered together, VMS 5.1, 128MB each. PIPELINE QUOTA is 40000,
receive buffers 20. Lots of free NPAGEDYN, SRP, IRP, and LRP's. I am tearing
my hair out trying to find the cause. Any suggestions of where to look would
be greatly appreciated. Thanks.

					/ed/

428.36STAR::BRANDENBERGSi vis pacem para bellumFri May 12 1989 12:205
    
    Drop PIPELINE down to 25000 as any more is meaningless.  Once a client
    has connected, check link counters for anything 'interesting'.