[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::pwv50ift

Title:Kit: Note 4229; Please use NOTED::PWDOSWIN5 for V4.x server
Notice:Kit: Note 4229; Please use NOTED::PWDOSWIN5 for V4.x server
Moderator:CPEEDY::KENNEDY
Created:Fri Dec 18 1992
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:4319
Total number of notes:18478

4160.0. "Unexplained PCOMS Netlogon Msg Buffer Exhaustion" by VMSNET::ALLERTON (Episode d'Azur) Fri Feb 14 1997 18:39

We have a customer who has a 2 node VMS 5.5-2 VAXcluster, 5.0d eco3
PATHWORKS, with a light client configuration (77 clients configured) from
which typically 30-40 sessions are established.  The environment is largely
concerned with MS Access application printing of address labels from 
database records.

Every few days, users complain of discontinued server response or inability
to re-establish sessions.  A server restart has been necessary to restore
responsiveness.  Investigation of logged data hasn't turned up too much,
excepting ongoing references to PCOMS Netlogon message buffer exhaustion.

What's perplexing is, the customer has raised his netlogon message buffer
configuration from default to 256 in PWRK.INI, and again he has a rather 
light client session load.  In fact, he continues to log PCOMS errors
when virtually no clients have sessions established.

We're extremely wanting in our understanding of and interested in knowing what 
(other than increasing client load) would account for PCOMS netlogon message 
buffer exhaustion.

Some sample lines from PWRK$LMMCPxxx.log:

9-FEB-1997 21:44:27.67 202010FB:002EBAF0 PCOMS: cannot get message buffer for 
send
9-FEB-1997 21:47:47.67 202010FB:002EBAF0 PCOMS: failed to allocate NET LOGON m
ssage buffer
 9-FEB-1997 21:47:47.67 202010FB:002EBAF0 PCOMS: error occured at source line 
1987
 9-FEB-1997 21:47:47.67 202010FB:002EBAF0       free message buffers  = 320
 9-FEB-1997 21:47:47.71 202010FB:002EBAF0       free logon messages   = 0
 9-FEB-1997 21:47:47.71 202010FB:002EBAF0       free process elements = 6 
(1 message buffers)
.
.  =< ongoing occurrences separated by approx. 3 minute intervals >=
.
 9-FEB-1997 23:37:49.17 202010FB:002EBAF0 PCOMS: cannot get message buffer for 
 send
 9-FEB-1997 23:41:09.17 202010FB:002EBAF0 PCOMS: failed to allocate NET LOGON 
message buffer
 9-FEB-1997 23:41:09.17 PCOMS: error occured at source line 1987
 9-FEB-1997 23:41:09.19 202010FB:002EBAF0       free message buffers  = 285
 9-FEB-1997 23:41:09.19 202010FB:002EBAF0   free logon messages   = 0
 9-FEB-1997 23:41:09.19 202010FB:002EBAF0       free process elements = 6 
(1 message buffers)
 9-FEB-1997 23:41:09.20 202010FB:002EBAF0       free fork elements    = 11 (15 
 message buffers)
 9-FEB-1997 23:41:09.20 202010FB:002EBAF0       free fd elements      = 8 (11 
 message buffers)
 9-FEB-1997 23:41:09.20 202010FB:002EBAF0       free pipe elements    = 5 (2 
message buffers)

PWRK.INI:

[SERVERS]
   LICENSE_S = YES

[PCOMS]
   MAX_IPC_MESSAGES = 512
   MAX_NETLOGON_MESSAGES  = 256


Thank you.
    
    S. Allerton
    PW Support
T.RTitleUserPersonal
Name
DateLines
4160.1Reason foundUTRTSC::EISINKNo Kipling apes todayWed Apr 09 1997 13:5613
    Engineering has found the reason why the daemon proceess stalls.
    This means the netlogon service requests are not honored, replication
    not works and for example the netlogon/alerter service can't be
    started/stopped.
    
    A side effect is that when the stall takes to long, PCOMS will run out 
    of netlogon message buffers and later the message buffers.
    The reason for this is mostly a 'bad' network.
    
    A workaround is to disable the alerter service in lanman.ini or
    with net stop alerter.
    
    		Rob.
4160.2how bad is 'bad' ?LNZALI::BACHNERMouse not found. Click OK to continueThu Apr 10 1997 12:429
>    The reason for this is mostly a 'bad' network.

Would you care to share your definition of 'bad' ?  Too many errors of what
type, too many collisions, too many messages of what type ?

And can you confirm that the problem was introduced with ECO1 of V5.0E ?

Thanks,
Hans.
4160.3UTRTSC::EISINKNo Kipling apes todayFri Apr 11 1997 02:111
    THis problem was always in.
4160.4LNZALI::BACHNERMouse not found. Click OK to continueTue Apr 15 1997 11:348
>    This problem was always in.

Strange - I've never seen it before I installed ECO 1 (the log files show this).

Anyway, the additional parameters in PWRK.INI helped - as soon as I restarted
PATHWORKS on all cluster nodes. No need (so far) to disable the alerter service.

Hans.
4160.5UTRTSC::SWEEPI want a lolly...Wed Apr 16 1997 04:1033
    Hans
    
    The fact that you run out of pcoms buffers for netlogon means
    that the lmmcp (which receives the netlogon requests) queues
    work messages to the daemon. For this it uses a pcoms netlogon
    buffer. If the daemon for some reason is not able to process
    the netlogon requests fast enough then it is possible that the
    mcp process logs pcoms errors.
    
    You have to find the reason why the daemon can't handle these
    netlogon requests fast enough. 1 of the reasons could be because
    the daemon is busy with something else (like replication). Another
    reason could be that the daemon is synchronously waiting (like
    on the alerter service or on streams (= network).
    
    We found that the alerter works synchrone (= enters lef or hib state
    for several seconds). So if there are lots of alerter messages that
    it can happen that pcoms errors are reported. Switching of the
    alerter (and problem disappears = prove). We have a fix for this
    in that we made the alerter asynchrone.
    
    Another possibility is a wait on streams. We have found 1 scenario
    where we saw that the stream to UDP (tcp/ip datagrams) is full and
    we had to wait for that (stream = write stream so sending responses
    back to the client). We love to think that we found the cause for
    this but until test results come in we are not absolutely sure. It
    could still be a UCX or a network problem.
    
    You say that lanman.ini changes resolved the problem. Would you like
    to tell what the changes were ?
    
    Thanks
    Adrie
4160.6Usual "Fix" VMSNET::P_NUNEZWed Apr 16 1997 09:3914
    
    Adrie,
    
    >You say that lanman.ini changes resolved the problem. Would you like
    >to tell what the changes were ?
    
    Hans made changes to PWRK.INI, not LANMAN.INI, so he likely added the 
    [PCOMS] section MAX_NETLOGON_MESSAGES= and MAX_IPC_MESSAGES=).  
    
    But I think all he's really done is delay the problem (though it's 
    dictated by client activity).  What I understand, the server _can_
    recover from this, right (it "stalls" rather than "hangs")?
    
    Paul
4160.7Troubleshooting Ideas?VMSNET::P_NUNEZWed Apr 16 1997 09:4924
    
    Adrie,
    
    >You have to find the reason why the daemon can't handle these
    >netlogon requests fast enough. 1 of the reasons could be because
    >the daemon is busy with something else (like replication). Another
    >reason could be that the daemon is synchronously waiting (like
    >on the alerter service or on streams (= network).
    
    Are there any tools available that one could use to see what the daemon
    process is doing?  
    
    You can obviously tell if the alerter service is running, but is there
    a way to tell how many alerts are waiting to be sent (ie, queued)? 
    
    Repeated lmmodal -l commands can be used to see if replication is
    occurring (or is there a better way?).  
    
    And when you say the daemon could be waiting on streams (the network),
    are you saying it's having problems sending it's data over IP? 
    NetBEUI?  Both?  Would FDDI be a factor?  Or just a busy/noisy wire? 
    Would this be evident in any counters?
    
    <Paul
4160.8UTRTSC::SWEEPI want a lolly...Mon Apr 21 1997 06:4823
    Paul
    
    How we analyse it is by using sda extentions, then look
    at the queues and thread stacks, so its not something you
    can quickly do in the field.
    
    Later when the sda extentions are common use, we can deliver
    some more global adresses so that you CAN have a look. Its
    a matter of experience...
    
    Yes its a stall situation, not a hang, so it will resolve
    by itself.
    
    For streams its a real hang, as far as we know right now.
    Its IP only and its related to flow control (a write stream
    filling up before the packets can be transmitted onto the
    net). The reason is unclear. It could be that there are large
    amounts of incoming packets that are turned around and xmitted
    out. Then it should be that IP handles the incoming packets with
    higher prio than the outgoing packets, so the write stream can
    fill up.
    
    Adrie
4160.9HANSBC::BACHNERMouse not found. Click OK to continueFri Apr 25 1997 08:2914
Sorry for the late reply - I did not follow this string for a few days.

Yes, the changes that helped me were to PWRK.INI, as suggested earlier in this
notes file:

[PCOMS]
  MAX_IPC_MESSAGES = 512
  MAX_NETLOGON_MESSAGES = 256

This helped both on our local cluster (VAX & Alpha, OpenVMS V6.2, V7.0, V7.1)
and in my customers environment, as I did not receive any more complaints since
I suggested the additions mentioned above.

Hans.