T.R | Title | User | Personal Name | Date | Lines |
---|
4160.1 | Reason found | UTRTSC::EISINK | No Kipling apes today | Wed Apr 09 1997 13:56 | 13 |
| Engineering has found the reason why the daemon proceess stalls.
This means the netlogon service requests are not honored, replication
not works and for example the netlogon/alerter service can't be
started/stopped.
A side effect is that when the stall takes to long, PCOMS will run out
of netlogon message buffers and later the message buffers.
The reason for this is mostly a 'bad' network.
A workaround is to disable the alerter service in lanman.ini or
with net stop alerter.
Rob.
|
4160.2 | how bad is 'bad' ? | LNZALI::BACHNER | Mouse not found. Click OK to continue | Thu Apr 10 1997 12:42 | 9 |
| > The reason for this is mostly a 'bad' network.
Would you care to share your definition of 'bad' ? Too many errors of what
type, too many collisions, too many messages of what type ?
And can you confirm that the problem was introduced with ECO1 of V5.0E ?
Thanks,
Hans.
|
4160.3 | | UTRTSC::EISINK | No Kipling apes today | Fri Apr 11 1997 02:11 | 1 |
| THis problem was always in.
|
4160.4 | | LNZALI::BACHNER | Mouse not found. Click OK to continue | Tue Apr 15 1997 11:34 | 8 |
| > This problem was always in.
Strange - I've never seen it before I installed ECO 1 (the log files show this).
Anyway, the additional parameters in PWRK.INI helped - as soon as I restarted
PATHWORKS on all cluster nodes. No need (so far) to disable the alerter service.
Hans.
|
4160.5 | | UTRTSC::SWEEP | I want a lolly... | Wed Apr 16 1997 04:10 | 33 |
| Hans
The fact that you run out of pcoms buffers for netlogon means
that the lmmcp (which receives the netlogon requests) queues
work messages to the daemon. For this it uses a pcoms netlogon
buffer. If the daemon for some reason is not able to process
the netlogon requests fast enough then it is possible that the
mcp process logs pcoms errors.
You have to find the reason why the daemon can't handle these
netlogon requests fast enough. 1 of the reasons could be because
the daemon is busy with something else (like replication). Another
reason could be that the daemon is synchronously waiting (like
on the alerter service or on streams (= network).
We found that the alerter works synchrone (= enters lef or hib state
for several seconds). So if there are lots of alerter messages that
it can happen that pcoms errors are reported. Switching of the
alerter (and problem disappears = prove). We have a fix for this
in that we made the alerter asynchrone.
Another possibility is a wait on streams. We have found 1 scenario
where we saw that the stream to UDP (tcp/ip datagrams) is full and
we had to wait for that (stream = write stream so sending responses
back to the client). We love to think that we found the cause for
this but until test results come in we are not absolutely sure. It
could still be a UCX or a network problem.
You say that lanman.ini changes resolved the problem. Would you like
to tell what the changes were ?
Thanks
Adrie
|
4160.6 | Usual "Fix" | VMSNET::P_NUNEZ | | Wed Apr 16 1997 09:39 | 14 |
|
Adrie,
>You say that lanman.ini changes resolved the problem. Would you like
>to tell what the changes were ?
Hans made changes to PWRK.INI, not LANMAN.INI, so he likely added the
[PCOMS] section MAX_NETLOGON_MESSAGES= and MAX_IPC_MESSAGES=).
But I think all he's really done is delay the problem (though it's
dictated by client activity). What I understand, the server _can_
recover from this, right (it "stalls" rather than "hangs")?
Paul
|
4160.7 | Troubleshooting Ideas? | VMSNET::P_NUNEZ | | Wed Apr 16 1997 09:49 | 24 |
|
Adrie,
>You have to find the reason why the daemon can't handle these
>netlogon requests fast enough. 1 of the reasons could be because
>the daemon is busy with something else (like replication). Another
>reason could be that the daemon is synchronously waiting (like
>on the alerter service or on streams (= network).
Are there any tools available that one could use to see what the daemon
process is doing?
You can obviously tell if the alerter service is running, but is there
a way to tell how many alerts are waiting to be sent (ie, queued)?
Repeated lmmodal -l commands can be used to see if replication is
occurring (or is there a better way?).
And when you say the daemon could be waiting on streams (the network),
are you saying it's having problems sending it's data over IP?
NetBEUI? Both? Would FDDI be a factor? Or just a busy/noisy wire?
Would this be evident in any counters?
<Paul
|
4160.8 | | UTRTSC::SWEEP | I want a lolly... | Mon Apr 21 1997 06:48 | 23 |
| Paul
How we analyse it is by using sda extentions, then look
at the queues and thread stacks, so its not something you
can quickly do in the field.
Later when the sda extentions are common use, we can deliver
some more global adresses so that you CAN have a look. Its
a matter of experience...
Yes its a stall situation, not a hang, so it will resolve
by itself.
For streams its a real hang, as far as we know right now.
Its IP only and its related to flow control (a write stream
filling up before the packets can be transmitted onto the
net). The reason is unclear. It could be that there are large
amounts of incoming packets that are turned around and xmitted
out. Then it should be that IP handles the incoming packets with
higher prio than the outgoing packets, so the write stream can
fill up.
Adrie
|
4160.9 | | HANSBC::BACHNER | Mouse not found. Click OK to continue | Fri Apr 25 1997 08:29 | 14 |
| Sorry for the late reply - I did not follow this string for a few days.
Yes, the changes that helped me were to PWRK.INI, as suggested earlier in this
notes file:
[PCOMS]
MAX_IPC_MESSAGES = 512
MAX_NETLOGON_MESSAGES = 256
This helped both on our local cluster (VAX & Alpha, OpenVMS V6.2, V7.0, V7.1)
and in my customers environment, as I did not receive any more complaints since
I suggested the additions mentioned above.
Hans.
|