[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference noted::pwv50ift

Title:	Kit: Note 4229; Please use NOTED::PWDOSWIN5 for V4.x server
Notice:	Kit: Note 4229; Please use NOTED::PWDOSWIN5 for V4.x server
Moderator:	CPEEDY::KENNEDY

Created:	Fri Dec 18 1992
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	4319
Total number of notes:	18478

4160.0. "Unexplained PCOMS Netlogon Msg Buffer Exhaustion" by VMSNET::ALLERTON (Episode d'Azur) Fri Feb 14 1997 18:39

We have a customer who has a 2 node VMS 5.5-2 VAXcluster, 5.0d eco3
PATHWORKS, with a light client configuration (77 clients configured) from
which typically 30-40 sessions are established.  The environment is largely
concerned with MS Access application printing of address labels from 
database records.

Every few days, users complain of discontinued server response or inability
to re-establish sessions.  A server restart has been necessary to restore
responsiveness.  Investigation of logged data hasn't turned up too much,
excepting ongoing references to PCOMS Netlogon message buffer exhaustion.

What's perplexing is, the customer has raised his netlogon message buffer
configuration from default to 256 in PWRK.INI, and again he has a rather 
light client session load.  In fact, he continues to log PCOMS errors
when virtually no clients have sessions established.

We're extremely wanting in our understanding of and interested in knowing what 
(other than increasing client load) would account for PCOMS netlogon message 
buffer exhaustion.

Some sample lines from PWRK$LMMCPxxx.log:

9-FEB-1997 21:44:27.67 202010FB:002EBAF0 PCOMS: cannot get message buffer for 
send
9-FEB-1997 21:47:47.67 202010FB:002EBAF0 PCOMS: failed to allocate NET LOGON m
ssage buffer
 9-FEB-1997 21:47:47.67 202010FB:002EBAF0 PCOMS: error occured at source line 
1987
 9-FEB-1997 21:47:47.67 202010FB:002EBAF0       free message buffers  = 320
 9-FEB-1997 21:47:47.71 202010FB:002EBAF0       free logon messages   = 0
 9-FEB-1997 21:47:47.71 202010FB:002EBAF0       free process elements = 6 
(1 message buffers)
.
.  =< ongoing occurrences separated by approx. 3 minute intervals >=
.
 9-FEB-1997 23:37:49.17 202010FB:002EBAF0 PCOMS: cannot get message buffer for 
 send
 9-FEB-1997 23:41:09.17 202010FB:002EBAF0 PCOMS: failed to allocate NET LOGON 
message buffer
 9-FEB-1997 23:41:09.17 PCOMS: error occured at source line 1987
 9-FEB-1997 23:41:09.19 202010FB:002EBAF0       free message buffers  = 285
 9-FEB-1997 23:41:09.19 202010FB:002EBAF0   free logon messages   = 0
 9-FEB-1997 23:41:09.19 202010FB:002EBAF0       free process elements = 6 
(1 message buffers)
 9-FEB-1997 23:41:09.20 202010FB:002EBAF0       free fork elements    = 11 (15 
 message buffers)
 9-FEB-1997 23:41:09.20 202010FB:002EBAF0       free fd elements      = 8 (11 
 message buffers)
 9-FEB-1997 23:41:09.20 202010FB:002EBAF0       free pipe elements    = 5 (2 
message buffers)

PWRK.INI:

[SERVERS]
   LICENSE_S = YES

[PCOMS]
   MAX_IPC_MESSAGES = 512
   MAX_NETLOGON_MESSAGES  = 256


Thank you.
    
    S. Allerton
    PW Support

T.R	Title	User	Personal Name	Date	Lines
4160.1	Reason found	UTRTSC::EISINK	No Kipling apes today	`Wed Apr 09 1997 12:56`	13
	Engineering has found the reason why the daemon proceess stalls. This means the netlogon service requests are not honored, replication not works and for example the netlogon/alerter service can't be started/stopped. A side effect is that when the stall takes to long, PCOMS will run out of netlogon message buffers and later the message buffers. The reason for this is mostly a 'bad' network. A workaround is to disable the alerter service in lanman.ini or with net stop alerter. Rob.
4160.2	how bad is 'bad' ?	LNZALI::BACHNER	Mouse not found. Click OK to continue	`Thu Apr 10 1997 11:42`	9
	> The reason for this is mostly a 'bad' network. Would you care to share your definition of 'bad' ? Too many errors of what type, too many collisions, too many messages of what type ? And can you confirm that the problem was introduced with ECO1 of V5.0E ? Thanks, Hans.
4160.3		UTRTSC::EISINK	No Kipling apes today	`Fri Apr 11 1997 01:11`	1
	THis problem was always in.
4160.4		LNZALI::BACHNER	Mouse not found. Click OK to continue	`Tue Apr 15 1997 10:34`	8
	> This problem was always in. Strange - I've never seen it before I installed ECO 1 (the log files show this). Anyway, the additional parameters in PWRK.INI helped - as soon as I restarted PATHWORKS on all cluster nodes. No need (so far) to disable the alerter service. Hans.
4160.5		UTRTSC::SWEEP	I want a lolly...	`Wed Apr 16 1997 03:10`	33
	Hans The fact that you run out of pcoms buffers for netlogon means that the lmmcp (which receives the netlogon requests) queues work messages to the daemon. For this it uses a pcoms netlogon buffer. If the daemon for some reason is not able to process the netlogon requests fast enough then it is possible that the mcp process logs pcoms errors. You have to find the reason why the daemon can't handle these netlogon requests fast enough. 1 of the reasons could be because the daemon is busy with something else (like replication). Another reason could be that the daemon is synchronously waiting (like on the alerter service or on streams (= network). We found that the alerter works synchrone (= enters lef or hib state for several seconds). So if there are lots of alerter messages that it can happen that pcoms errors are reported. Switching of the alerter (and problem disappears = prove). We have a fix for this in that we made the alerter asynchrone. Another possibility is a wait on streams. We have found 1 scenario where we saw that the stream to UDP (tcp/ip datagrams) is full and we had to wait for that (stream = write stream so sending responses back to the client). We love to think that we found the cause for this but until test results come in we are not absolutely sure. It could still be a UCX or a network problem. You say that lanman.ini changes resolved the problem. Would you like to tell what the changes were ? Thanks Adrie
4160.6	Usual "Fix"	VMSNET::P_NUNEZ		`Wed Apr 16 1997 08:39`	14
	Adrie, >You say that lanman.ini changes resolved the problem. Would you like >to tell what the changes were ? Hans made changes to PWRK.INI, not LANMAN.INI, so he likely added the [PCOMS] section MAX_NETLOGON_MESSAGES= and MAX_IPC_MESSAGES=). But I think all he's really done is delay the problem (though it's dictated by client activity). What I understand, the server _can_ recover from this, right (it "stalls" rather than "hangs")? Paul
4160.7	Troubleshooting Ideas?	VMSNET::P_NUNEZ		`Wed Apr 16 1997 08:49`	24
	Adrie, >You have to find the reason why the daemon can't handle these >netlogon requests fast enough. 1 of the reasons could be because >the daemon is busy with something else (like replication). Another >reason could be that the daemon is synchronously waiting (like >on the alerter service or on streams (= network). Are there any tools available that one could use to see what the daemon process is doing? You can obviously tell if the alerter service is running, but is there a way to tell how many alerts are waiting to be sent (ie, queued)? Repeated lmmodal -l commands can be used to see if replication is occurring (or is there a better way?). And when you say the daemon could be waiting on streams (the network), are you saying it's having problems sending it's data over IP? NetBEUI? Both? Would FDDI be a factor? Or just a busy/noisy wire? Would this be evident in any counters? <Paul
4160.8		UTRTSC::SWEEP	I want a lolly...	`Mon Apr 21 1997 05:48`	23
	Paul How we analyse it is by using sda extentions, then look at the queues and thread stacks, so its not something you can quickly do in the field. Later when the sda extentions are common use, we can deliver some more global adresses so that you CAN have a look. Its a matter of experience... Yes its a stall situation, not a hang, so it will resolve by itself. For streams its a real hang, as far as we know right now. Its IP only and its related to flow control (a write stream filling up before the packets can be transmitted onto the net). The reason is unclear. It could be that there are large amounts of incoming packets that are turned around and xmitted out. Then it should be that IP handles the incoming packets with higher prio than the outgoing packets, so the write stream can fill up. Adrie
4160.9		HANSBC::BACHNER	Mouse not found. Click OK to continue	`Fri Apr 25 1997 07:29`	14
	Sorry for the late reply - I did not follow this string for a few days. Yes, the changes that helped me were to PWRK.INI, as suggested earlier in this notes file: [PCOMS] MAX_IPC_MESSAGES = 512 MAX_NETLOGON_MESSAGES = 256 This helped both on our local cluster (VAX & Alpha, OpenVMS V6.2, V7.0, V7.1) and in my customers environment, as I did not receive any more complaints since I suggested the additions mentioned above. Hans.

Conference noted::pwv50ift

4160.0. "Unexplained PCOMS Netlogon Msg Buffer Exhaustion" by VMSNET::ALLERTON (Episode d&#039;Azur) Fri Feb 14 1997 18:39

4160.0. "Unexplained PCOMS Netlogon Msg Buffer Exhaustion" by VMSNET::ALLERTON (Episode d'Azur) Fri Feb 14 1997 18:39