T.R | Title | User | Personal Name | Date | Lines |
---|
428.1 | network transmission problems? | STAR::BMATTHEWS | | Fri Mar 17 1989 10:31 | 8 |
| Check the decnet error counts on your workstation. It could be that there
are very many retransmissions going on. I believe you can increase the
number of receive buffers to alleviate that problem. A decnet/ncp guru
will have to provide more details on exactly how to find out what the
error count is and how to increase the receive buffers.
Bill
|
428.2 | Also WSEXTENT and/or WSMAX | MCNALY::MILLER | Bush For President...Kate Bush! | Fri Mar 17 1989 10:37 | 9 |
| Lots of public clusters have really low WSEXTENTs in users authorize
parameters and/or low WSMAX as a SYSGEN parameter.
In my opinion, they should be *at least* 5000 pages.
Regards,
== ken miller ==
|
428.3 | | QUARK::LIONEL | The dream is alive | Fri Mar 17 1989 11:04 | 37 |
| Well, here's the circuit counters on my workstation - doesn't look bad to me...
Circuit Counters as of 17-MAR-1989 11:09:06
Circuit = QNA-0
21910 Seconds since last zeroed
36872 Terminating packets received
30754 Originating packets sent
0 Terminating congestion loss
0 Transit packets received
0 Transit packets sent
0 Transit congestion loss
0 Circuit down
0 Initialization failure
0 Adjacency down
1 Peak adjacencies
32215 Data blocks sent
9373289 Bytes sent
39227 Data blocks received
2286248 Bytes received
0 Unrecognized frame destination
7 User buffer unavailable
And on the cluster, my WSQUOTA is 1024, WSEXTENT is 4096, WSMAX is 24600
and there's some 35000 free pages.
Here's the executor characteristics that may be relevant:
Maximum buffers = 100
Buffer size = 576
Pipeline quota = 65024
What else can I look at?
Steve
|
428.4 | More things to look at | STAR::BMATTHEWS | | Fri Mar 17 1989 12:11 | 6 |
| The user buffer unavailable is the problem I have seen in the past. Do a
$ mcr ncp show known lines char command. On my vs2000/gpx I have 6 receive
buffers which is probably low. The big nodes in our cluster have 20 buffers.
My device buffer size is 1498. I don't know whether that is good or bad.
Bill
|
428.5 | Another piece of possibly relevant data | STAR::BMATTHEWS | | Fri Mar 17 1989 12:14 | 2 |
| Look also at your line counters. $ mcr ncp show known lines count. - Bill
|
428.6 | | QUARK::LIONEL | The dream is alive | Fri Mar 17 1989 12:27 | 25 |
| Hmm... I have 10 receive buffers. I'll try setting it to 20 and see
what I get. Here are my line counters.
Line = QNA-0
28687 Seconds since last zeroed
148651 Data blocks received
88453 Multicast blocks received
0 Receive failure
12909746 Bytes received
8885300 Multicast bytes received
0 Data overrun
53386 Data blocks sent
2305 Multicast blocks sent
46 Blocks sent, multiple collisions
46 Blocks sent, single collision
0 Blocks sent, initially deferred
13414383 Bytes sent
166852 Multicast bytes sent
0 Send failure
0 Collision detect check failure
0 Unrecognized frame destination
0 System buffer unavailable
734 User buffer unavailable
|
428.7 | | QUARK::LIONEL | The dream is alive | Fri Mar 17 1989 13:20 | 4 |
| Ok, tried 20 receive buffers - no difference. This can't be so hard!
Steve
|
428.8 | client or ws | STAR::BMATTHEWS | | Fri Mar 17 1989 15:30 | 4 |
| Try firing up your apps to someone elses workstation to see if it is
a problem on the workstatoin side or the client side.
Bill
|
428.9 | | QUARK::LIONEL | The dream is alive | Fri Mar 17 1989 17:35 | 5 |
| Ok, I tried it on someone else's WS - a VS2000 running V5.1. It's
just as bad. So maybe it's our cluster? What server parameters matter?
Steve
|
428.10 | | STAR::ORGOVAN | Vince Orgovan | Sat Mar 18 1989 14:11 | 42 |
| Steve this is puzzling. I just started up DECW$MAIL on our 8800 with
the display directed to my standalone VS2000/GPX with 6Mb. The window
appeared in 30 seconds. It's hard to understand what could account for
it taking four times longer in your configuration.
Can you compile and run this program on your 8800 with the display
directed at your server? It times the basic client-to-server and
back again round trip. It runs in about 15 seconds elapsed time
on our 8800.
/*
* This program times a server round trip.
*
* To compile & link on VMS:
*
* $ cc foo
* $ link foo,sys$input/opt
* sys$share:decw$xlibshr/share
* sys$share:vaxcrtl/share
* ^Z
*/
#include <decw$include/Xlib.h>
#define loopcount 1000
extern int lib$init_timer();
extern int lib$show_timer();
int main()
{
Display *dpy;
int i;
dpy = XOpenDisplay("");
lib$init_timer(); /* start the timer */
for (i=0; i < loopcount; i++) XSync(dpy, 0);
lib$show_timer(); /* stop & display the timer */
XCloseDisplay(dpy);
exit(1);
}
|
428.11 | Is the 8800 simply a heavily used pig? | CVG::PETTENGILL | mulp | Sat Mar 18 1989 16:47 | 18 |
| OBviously the problem is on the 8800. Can you use another system to get a
better feel for what it should be like; in my experience a 6220 is much better
than you describe.
What kind of load is there on the 8800? If you get less memory+cpu+io then on
a vs2000 than you aren't going to be better off.
Are you running in batch or interactive? Batch usually has a lower priority
which will give you bad performance if a lot of `background' work is being done.
Are your UAF or system working set figures very low? A large working set
extent is needed for a lot of applications.
Are you getting routed either due to cluster alias or because of the LAN
configuration? This seems unlikely because when I've traced the comm traffic,
it is very similar to LAT traffic except the messages are about twice as large
(ie., ~200 bytes average instead of ~100 bytes).
|
428.12 | Ouch! | QUARK::LIONEL | The dream is alive | Mon Mar 20 1989 10:46 | 18 |
| Ok, I ran Vince's program... The bad news is that I had to reduce the
loop count from 1000 to 25 to get it to complete in a reasonable time... At
a count of 25, it took 2 minutes and 57 seconds. While the program was
running, my workstation was compute-bound.
The 8800 I am running on is not heavily loaded, and the working set parameters
are more than adequate. There is no cluster routing involved, and the
two systems are on the same logical Ethernet (though perhaps on different
segments). I ran the program interactively.
I would agree that the problem would appear to be something on our cluster.
It can't be just a matter of load - the performance figures I gave above
are just TOO awful - something else has to be at work.
Ideas and consulting offers welcome...
Steve
|
428.13 | | STAR::ORGOVAN | Vince Orgovan | Mon Mar 20 1989 11:06 | 12 |
| Egads. With a loop count of 25 taking 177 seconds that means that
your typical round trip is 7 seconds. It should be something like
0.015 seconds. And your workstation is compute-bound? I suspect
that something is drastically wrong on your workstation.
Maybe you've uncovered a server memory leak or something? Do a
show process /continuous on your server process (from some set
host connection that doesn't use the server) and rerun the
round-trip timer. Is the server process getting much CPU time?
Is is getting any page faults? How big is it's virtual address
space?
|
428.14 | More info to gather | STAR::BMATTHEWS | | Mon Mar 20 1989 11:41 | 9 |
| Boy there sure seems to be alot of conflicting data here. If other people
run apps on the 8800 to their workstations it runs ok but if you run apps
to your ws or other ws's then performance is terrible. You should complete
the matrix and see if someone else can fire off apps to your ws and see what
performance is like. Also when you run vince's round trip test you get a ws
that is compute bound. Where are the ws computes going? Are they all in the
server at user mode? kernel mode? some other process?
Bill
|
428.15 | Works ok here...sorry | NECSC::LEVY | A leaf of all colors | Tue Mar 21 1989 08:21 | 14 |
| I thought I'd try this out just for giggles. I'm running FOO out of a DCL
Command window from FileView which is running in a Priority 4 Batch queue on
an 8350 client with only 1 user. The display is going to my 8 meg PVAX server.
Here's output on a couple of runs.
Nidus� r foo
ELAPSED: 0 00:00:18.98 CPU: 0:00:10.19 BUFIO: 2000 DIRIO: 0 FAULTS: 2
Nidus� r foo
ELAPSED: 0 00:00:21.37 CPU: 0:00:10.36 BUFIO: 2000 DIRIO: 0 FAULTS: 2
By the way, DECW$SERVER_0 goes to about 20% of CPU (If you can believe BANNER)
while the program is running. We haven't done any special tuning other than
following the recommendations here.
|
428.16 | Weirder and weirder | QUARK::LIONEL | The dream is alive | Tue Mar 21 1989 10:33 | 16 |
| I ran the program on someone else's VS2000, directed at my WS, and had
the same symptoms.
Taking Vince's suggestion, I ran MONITOR and SHOW PROC/CONT from a SET HOST
connection, and found that the system was NOT going compute-bound. In
fact, what seemed to be happening was that the server process was going
to sleep. It stayed in HIB mode a lot and got hardly any CPU time. A
MONITOR MODES showed idle time jumping up to 90% (from 70%) while the
test was running - no single process got any significant part of the CPU.
Interrupt stack time was 5% or less. The server did not page fault or
increase its memory requirements.
What now?
Steve
|
428.17 | | LESLIE::LESLIE | Bizarro Engineer | Tue Mar 21 1989 11:07 | 7 |
| Try doing this with a direct ethernet connection between the systems.
If your ethernet is being flooded with bad packets by a faulty DEQNA or
somesuch, all will work okay now and you can chase your facilities
people about an ethernet problem.
If not, well, its another off the checklist.
|
428.18 | Could it be a lock problem? | NECSC::LEVY | A leaf of all colors | Tue Mar 21 1989 11:49 | 18 |
| This is a stab in the dark, but could you have a locking problem?
When you state that the server is going into a HIB state, it seems that it's
looking for a resource that is not available.
The LOCKIDTBL entry on the 8350 client on which I run is:
SYSGEN> SHOW LOCKIDTBL
Parameter Name Current Default Minimum Maximum Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
LOCKIDTBL 7115 200 40 65535 Entries
Could this be the problem???
- Dave
|
428.19 | Could be LOCKIDTBL! | QUARK::LIONEL | The dream is alive | Tue Mar 21 1989 11:58 | 13 |
| Re: .18 (LOCKIDTBL)
Hmm.. could be! My WS has LOCKIDTBL set at 180 (I didn't set this - AUTOGEN
must have - it's even below the default!) I will up this significantly
and see what happens.
Re: .17
When I ran from the other VS2000, it was in the office next to mine on the
same Ethernet segment.
Steve
|
428.20 | Maybe QNA problems? | STAR::BMATTHEWS | | Tue Mar 21 1989 12:16 | 9 |
| The server goes into HIB when it thinks it has nothing to do. The transport
should wake the server when data arrives from a client or the driver when
data arrives from the keyboard or mouse. The most likely scenario is that
the network or QNA on your workstation is having problems. It is also possible
that something is amiss in the server and it is not recognizing it has work to
do. You could try @sys$manager:decw$startup restart to see if a new invocation
of the server helps.
Bill
|
428.21 | Check user buffers one more time? | STAR::BMATTHEWS | | Tue Mar 21 1989 12:19 | 4 |
| Did you do a $ MCR NCP SHO KNOWN LINES COUNT before and after running vince's
program to see if the user buffer unavailable count is still going up?
Bill
|
428.22 | Well, it SOUNDED good... | QUARK::LIONEL | The dream is alive | Tue Mar 21 1989 13:09 | 22 |
| I ran AUTOGEN with feedback (first time since installing T5.2-410) and
specified MIN_LOCKIDTBL as 2000. AUTOGEN noted that my LOCKIDTBL and
RESHASHTBL (something like that) were low and raised them. I rebooted
and tried Vince's program again. If anything, it's worse.
I am running 20 receive buffers now - while running Vince's program, the
count of "user buffer unavailable" went up by 3 (for the 25-count loop).
Doesn't sound significant. I haven't noticed any other network-related
problems.
Another data point is that when I run the program, the server activity
on my WS grinds to a stop - cursors stop flashing for a minute at a time,
the calendar icon takes 30 seconds to repaint, etc. Yet there are ample
cycles available, and the memory usage is minimal. Even if the Ethernet
connection were bad, that shouldn't affect local server activity, should
it?
Other people in my group are reporting similar problems. I wish I could
get to the bottom of this...
Steve
|
428.23 | Could be a network/dw transport/server sched interaction | STAR::BMATTHEWS | | Tue Mar 21 1989 14:09 | 27 |
|
I am running 20 receive buffers now - while running Vince's program, the
count of "user buffer unavailable" went up by 3 (for the 25-count loop).
Doesn't sound significant. I haven't noticed any other network-related
problems.
>
>I think that 3 of 25 is significant. My user buffer unavailable count is
>zero and stays at zero. If any DECNET gurus are out there maybe they can
>help determine if this is significant or not. I think I remember that
>if there is no buffer available that there is a retry involved and also
>possibly a delta wait time before the retry is attempted. If so then
>possibly the retry delta is way off.
Another data point is that when I run the program, the server activity
on my WS grinds to a stop - cursors stop flashing for a minute at a time,
the calendar icon takes 30 seconds to repaint, etc. Yet there are ample
cycles available, and the memory usage is minimal. Even if the Ethernet
connection were bad, that shouldn't affect local server activity, should
it?
>It does make sense because Xsync requires a reply from the server and while
>DECNET is doing it's write I suspect the server is waiting for the write
>to complete. Maybe Monty can explain how the DECNET writes from the server
>work and what could happen if DECNET can't post the write immediately.
Bill
|
428.24 | | QUARK::LIONEL | The dream is alive | Tue Mar 21 1989 15:07 | 22 |
| I have some more data... I ran Vince's program from the same 8800 to
another person's VS_II/GPX running V5.1, and it ran quickly. We looked
at his system's "User Buffer Unavailable" counter and it was zero, after
a long time. We compared EXEC, LINE and CIRCUIT parameters between our
systems and didn't see anything obvious - in fact, his parameters were
often "worse" than mine (he had 6 receive buffers, for example).
We concluded that the problem is related to the "user buffer unavailable"
problems, but are unable to see why that is happening so often. Our
understanding of the retry intervals matches Bills in .23.
By the way, once when I ran the program on the 8800, the elapsed time
was over four minutes (with a loop count of 25), but the CPU time was
under a second.
I am wondering if this is something new with T5.2 - I will enter a QAR
about it just in case. But if anyone wants to contact me offline, or
here, to help resolve this, please do! I'm no longer quite so interested
for myself, but I have a feeling that others may run into the same problem.
Steve
|
428.25 | decw$server_retry_write_m* logicals? | STAR::BMATTHEWS | | Tue Mar 21 1989 16:34 | 4 |
| Steve, do you have the decw$server*retry* logicals set up high?
The parameters are now in ms, not 1/10th of a ms.
Bill
|
428.26 | No... | QUARK::LIONEL | The dream is alive | Tue Mar 21 1989 18:32 | 10 |
| Re: .25
No, I don't have them defined. (And while talking with Bill on
the phone, I tried various permutations of those logicals to
no effect.)
I have entered a QAR about this matter.
Steve
|
428.27 | The culprit has been identified! | QUARK::LIONEL | The dream is alive | Wed Mar 22 1989 13:20 | 16 |
| After trying lots of different things with the help of Bill Matthews and
Monty Brandenberg, Monty noted that my DECnet exec Pipeline Quota was
very high - 65024 - where a value of 10000 is recommended by DECwindows.
I had been told to raise it when I installed DFS.
So I lowered it back to 10000, and - WOW! What a difference! Everything
is lightning fast!
The moral is - be careful how high you raise your pipeline quota. I don't
yet know what values cause problems, but will be investigating this some more
with DECnet and DFS people.
Whew!
Steve
|
428.28 | Workaround found | STAR::BMATTHEWS | | Wed Mar 22 1989 13:21 | 14 |
| Well on Monty's suggestion Steve lowered pipeline quota to 10000 and the
problems went away. Steve had raised pipeline quota because that is supposed
to enhance DFS performance. Now there is still something or many things
broken here that need looking into but there is now a workaround for people
who have this problem.
First problem seems to be why does decnet get errors with a large pipeline
quota.
Second or multiple problems then appear because of some retry timers being
much too long. This could be DECNET and/or the DECWindows server retry timers.
Bill
|
428.29 | | QUARK::LIONEL | The dream is alive | Wed Mar 22 1989 13:53 | 10 |
| Doing some more experimentation shows that the pipeline quota can be as
high as 60000 without noticeable problems, but at 65000 everything dies.
I was also told just now that values over 25000 are pointless, since there
is no performance gain, just memory use, above that value.
I am still getting "User Buffer Unavailable" errors, but I am persuing that
elsewhere.
Steve
|
428.30 | Pipeline quota up to 64960 appears ok | STAR::BMATTHEWS | | Wed Mar 22 1989 14:13 | 5 |
| After a bit of binary search it appears that all is well with a pipeline
quota of up to 64960 but at 64961 and greater things go to pot. At least this
is the value on my system.
Bill
|
428.31 | Never trust anyone over 32767. | STAR::BRANDENBERG | Intelligence - just a good party trick? | Wed Mar 22 1989 14:14 | 2 |
|
|
428.32 | As the ghostly outline of Al Eldridge moves across the screen | POOL::HALLYB | The Smart Money was on Goliath | Wed Mar 22 1989 14:24 | 7 |
| Interesting that the difference between 65536 and 64960 is
exactly 576, a segment buffer size. Any bets that there's
some 16-bit arithmetic going on somewhere? And Steve was
getting effectively no "pipelining"?
John
|
428.33 | | STAR::MFOLEY | Rebel without a Clue | Wed Mar 22 1989 15:07 | 8 |
| RE: .31
32767 is what the DFS folks reccomend for DECnet pipeline quota.
(At least when I managed their cluster that's what it was)
mike
|
428.34 | It's a bug in DECnet-VAX | BULEAN::CARSON | Knockwurst & Excommunications | Tue Apr 25 1989 15:25 | 10 |
| .32 is correct.
A calculation using a DIVW thinks your pipeline quota is a small
negative number.
This is fixed in a future release. We will make a patch available
to NCSS if anyone really needs 16 bits of Pipeline Quota.
Pete Carson
DECnet-VAX SW Maint. Eng.
|
428.35 | Need some help in tracking a problem | 6317::FEATHERSTON | Ed Featherston | Fri May 12 1989 12:17 | 16 |
| I've used the program posted in earlier in the replies to verify a problem I am
having with running client apps on either of 2 8800's to display any of the
workstations in my group running DECWindows. If I use the program, I get
elapsed times of anywhere between 30 seconds to several minutes. This is on
either 8800 (plenty of null time when running the program), and doens't depend
on the display system (I have tried both VS-II's and VS-2000's). To verify it
wasn't the workstations I ran the program on a uVAX-II and a VAX-6240. I
consistently get an elapsed time between 20-22 seconds on both those systems.
The 8800's are clustered together, VMS 5.1, 128MB each. PIPELINE QUOTA is 40000,
receive buffers 20. Lots of free NPAGEDYN, SRP, IRP, and LRP's. I am tearing
my hair out trying to find the cause. Any suggestions of where to look would
be greatly appreciated. Thanks.
/ed/
|
428.36 | | STAR::BRANDENBERG | Si vis pacem para bellum | Fri May 12 1989 12:20 | 5 |
|
Drop PIPELINE down to 25000 as any more is meaningless. Once a client
has connected, check link counters for anything 'interesting'.
|