T.R | Title | User | Personal Name | Date | Lines |
---|
84.1 | INFO - are you running EFT kit? | GOSTE::CALLANDER | | Tue Mar 27 1990 15:49 | 10 |
|
Hi,
You have hit upon some of the problems that we are currently working
on. I would be interested in knowing if you are running the EFT
kit. Especially the component version numbers of the DECnet NODE4
Access Module, the TRM Presentation Module, and the base system.
Thanks for the additional information.
|
84.2 | All T1.0.0 ... | PILOU::BONGARTZ | Huckleberry Finn, I presume ? | Wed Mar 28 1990 06:55 | 7 |
| > kit. Especially the component version numbers of the DECnet NODE4
> Access Module, the TRM Presentation Module, and the base system.
All three Component Versions are T1.0.0 ...
( my workaround now is to exit and re-run mcc if it takes
more than 45 seconds for a poll... )
|
84.3 | if you find another goods ones... | GOSTE::CALLANDER | | Wed Mar 28 1990 17:27 | 12 |
|
Thanks for the additional input. We will see what can be done. If
you hit any other commands that go up at such a nice rate it would
be useful if you posted them here. Since different commands go through
different paths in the system, sometimes something that looks like
a small leak on one command, turns out to be something major given
another command.
jill
|
84.4 | got one! (or two?) | PILOU::BONGARTZ | Huckleberry Finn, I presume ? | Fri Mar 30 1990 06:14 | 28 |
| > -< if you find another goods ones... >-
Got another one...
in my original polling loop, I also checked the counters on
the local node (GABIN). Each poll created a SERVER_xxxx process,
which apparently terminated after ca 5 minutes... but as the
commands were given in less time than that,the system filled up
with these processes... and ended up doing nothing but paging
and swapping.
Another thing, though it might not be due to me, MCC or whatever
else - "just a coincidence ?" :
I started my poll server in the afternoon before leaving work,
and left it running over night, polling all the routers here
in Valbonne. During the night, the whole network went down -
systems crashed, etc. The last output from my server was at
03:13, and about that time the problems occured. Wether my
code crashed because of the problem, or the problem occured
because of the polls, is not clear to me - but *if* it's due
to MCC or my server (no privs!), we'd better make sure this
doesn't happen on a customer network.. I'll let the thing run
tonight and let you know if the net goes down the drain again.
Regards,
Marc.
|
84.5 | Thanks for the additional information | PETE::BURGESS | | Fri Mar 30 1990 10:55 | 36 |
| You have presented several problems to us which have been
assigned to different engineers for resolution.
1) The reserved operand fault which occurs when MCC is executed
as a sub-process assigning sys$input/output to mail-boxes.
This seems like a contained problem- I will try to reproduce
your experiment here and diagnosis the problem: Would you
send me the exact commands which you used to create the mcc
sub-process and the commands used for communicating with
the sub-process?
(enet: Pete::Burgess)
2) Virtual memory expansion. This is probably due to "vm leaks".
We have instrumented test versions of MCC with diagnostic
tools for recording vm deallocation problems, and have been
testing this problem since December, and have fixed many problems.
Our focus has probably been the on the normal successful operations,
and the most common error paths. My hypothesis is that
MCC is taking some error paths without properly terminating
its requested operations. We will be trying to reproduce
this problem with our instrumented version of MCC.
3) The performance problems: The DECnet phase 4 project leader
will be contacting you to obtain more diagnostic information.
My first concerns relate to the large number of nml servers which
are being created on your routing servers
\Pete Burgess
|
84.6 | Reduce NETSERVER$TIMEOUT to dump processes | TOOK::CAREY | | Fri Mar 30 1990 12:25 | 26 |
|
The only way we can see MCC "bringing down the network" is by applying
huge loads on all of the routing nodes in the network. If we put
enough pressure on them in terms of excessive NETSERVERs, it is
conceivable that they will be unable to perform normal network
communications. As soon as that happens, the routing traffic increases
dramatically because the routers are trying to understand the topology.
If you've got an appreciable number of routers, the network degrades
rapidly.
So, the first thing to do is get rid of the excessive NETSERVERS.
We don't know why you spawn a new server with each connection. But
until we do, you can at least cut down on the number of server
processes that are out there by setting the NETSERVER process timeout
lower. Do this by setting the system logical NETSERVER$TIMEOUT to just
a few seconds instead of the default of around five minutes. You'll
still suffer the process creation overhead, but at least you won't get
the swapping and paging that you're seeing.
Hope this helps, and I'll give you more on this server problem as soon
as I can find out more.
-Jim Carey
|
84.7 | We Can't Reproduce Multiple Server Problems | TOOK::CAREY | | Mon Apr 02 1990 13:03 | 61 |
|
Marc,
I had a chance to do some experimenting on our network here, and
was unable to reproduce a situation where multiple servers were
spawned and weren't expected. Any details that you could give me
about the exact nature of your requests could help, although I can't
imagine what might be different about them.
I created and checked out the following cases:
- Connecting to a remote node with Proxy Access defined.
This worked fine. Subsequent requests connected to the spawned
server.
- Connecting to a remote node using explicit access (BY USER = "...")
This also worked fine. I did these close together, so the Proxy
Server was still out there, and a new server was created for the
explicit access case. This is normal because VMS has to consider
them to be different processes with different rights. As expected,
subsequent requests connected to the same server just spawned.
- Connecting to a remote node using Default Access (no proxy, no
explicit accounting information)
This worked as expected too. After forming this connection, I had
three servers running: one for the Proxy access, one for the
Explicit Access, and one for the Default Access. Subsequent
requests didn't spawn any new servers.
In fact, once I had the three servers running, I attempted to confuse
the system by using Proxy, Explicit, and Default Access in different
combinations. No problems were encountered, and no additional
processes were spawned (by the way, connecting to an existing server
cuts down the response to a circuit counters request from an estimated
fifteen seconds, to two or three seconds maximum).
We also tried to reproduce the problem on a boundary condition. You
mentioned that your servers were set up to last about five minutes and
that you were requesting counters about every five minutes. We
wondered if the server process could somehow get locked up if a request
came in just as it was being stopped.
Several attempts to cause this to happen were unsuccessful. Since you
appear to reproduce this problem at will, we don't expect that the
problem lies on that boundary.
We still suspect that there is something funny about the NETSERVER
processes that you are creating and will continue to pursue that angle.
I hope that isolating and changing the appropriate network, system, or
account parameters will clean up these servers and get your connections
behaving more closely to what we expect.
-Jim Carey
|
84.8 | Defective Bridge responsible for Network problems | TOOK::CAREY | | Tue Apr 03 1990 11:20 | 11 |
|
Just a little added detail:
While MCC was under suspicion of "bringing down the network" it appears
that a defective bridge was the real culprit this time.
We are still investigating the problems described in this note, but
there is no grounds to fear that DECmcc will topple your network.
-Jim Carey
|
84.9 | Any progress on increasing response time problem? | DSTEG1::MCCANN | | Wed May 09 1990 10:40 | 6 |
|
Has the problem of the ever-increasing response times mentioned in .0
been solved, or its cause identified? If so, will it be fixed in EFT
update?
Jack
|
84.10 | leaks being plugged | GOSTE::CALLANDER | | Wed May 09 1990 17:10 | 24 |
|
There were two things at work in the problems reported. The defective
bridge was the cause of the crash and most of the "slow down" that
was experienced.
The other problem was due to some memory leaks (causing fragmentation
of memory when run for extended periods of time), and the dictionary
lookup overhead.
For EFT update we have made quite a few advances in our memory
management by implementing a local cache for the allocation and
deallocation of temporary memory; a better caching alogrithym for
the dictionary look ups was implmented in the EFT release, and fine
tuned for EFT update; quite a few leaks were plugged; and some of
the slower code paths have been reviewed and condensed to provide
a faster end user response time.
So far people with early, integration, releases of the base system
changes have been very pleased with the enhancements. I hope you
are too. But we are not stopping there, work on performance and
memory management are continuing.
jill
|