Title: | NAS Message Queuing Bus |
Notice: | KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10 |
Moderator: | PAMSRC::MARCUS EN |
Created: | Wed Feb 27 1991 |
Last Modified: | Fri Jun 06 1997 |
Last Successful Update: | Fri Jun 06 1997 |
Number of topics: | 2898 |
Total number of notes: | 12363 |
Hallo, I have a couple of questions regarding DecmessageQ and the DMQ Adapter! I hope someone can enlighten a true amateur in this field.... Thanks, Ann Pegert 1. Customer is running the DMQ Adapter on VMS Alpha and Data General at several production sites. One relatively new site is experiencing problems with the DMQA offspring processes which seem to stop working after a while. Finally everything stops with "no free server slots available" and a restart is necessary. DMQA is only used to send files from the DG to the Alpha. The same setup at each site. UCX version is 4.1, DMQ V3.2A, DMQA V??, VMS 6.1-1H3. How do I know that a DMQA process really is a zoombie process? They all seem to be in use doing a @dmqa_show, but if I look in UCX, I can see that only one or two sockets actually have increased their I/O count (since the day before). Can I consider the process to be a zoombie if it suddenly stops increasing the UCX device/socket I/O count, CPU time or what? Is there anything we can do to prevent this happening all the time? Where should we start to look? At the network? A new DMQA version? UCX configuration? $>@dmqa_show The following servers are running at 31-JAN-1997 04:57:22.57 00000F03 DmQA_S_4000 (server) 00000F04 DmQA_S_4000A (server) DmQA Server at 4001 is in use (%EFN 64 in DMQA_4000_EFC clear) 0000011E DmQA_S_4000B (server) DmQA Server at 4002 is in use (%EFN 65 in DMQA_4000_EFC clear) 00000F1F DmQA_S_4000C (server) DmQA Server at 4003 is in use (%EFN 66 in DMQA_4000_EFC clear) 00000120 DmQA_S_4000D (server) DmQA Server at 4004 is in use (%EFN 67 in DMQA_4000_EFC clear) 00000F21 DmQA_S_4000E (server) DmQA Server at 4005 is in use (%EFN 68 in DMQA_4000_EFC clear) 00000F22 DmQA_S_4000F (server) DmQA Server at 4006 is in use (%EFN 69 in DMQA_4000_EFC clear) 00000F23 DmQA_S_4000G (server) DmQA Server at 4007 is in use (%EFN 70 in DMQA_4000_EFC clear) 00000F24 DmQA_S_4000H (server) DmQA Server at 4008 is in use (%EFN 71 in DMQA_4000_EFC clear) This server/socket hasn't had any more I/O's completed during the last 8-10 days, no CPU consumption etc. $>ucx sh dev bg2395/ful Device_socket: bg2395 Type: STREAM LOCAL REMOTE Port: 4002 0 Host: USAV01 0.0.0.0 Service: RECEIVE SEND Queued I/O 0 0 Q0LEN 0 Socket buffer bytes 0 0 QLEN 0 Socket buffer quota 4096 4096 QLIMIT 5 Total buffer alloc 0 0 TIMEO 0 Total buffer limit 16384 16384 ERROR 0 Buffer or I/O waits 589 0 OOBMARK 0 Buffer or I/O drops 0 0 I/O completed 588 0 Bytes transferred 0 0 Options: ACCEPT REUSEADR KEEP State: PRIV RCV Buff: WAIT SND Buff: None A socket that still is receiving data: Device_socket: bg2406 Type: STREAM LOCAL REMOTE Port: 4006 0 Host: USAV01 0.0.0.0 Service: RECEIVE SEND Queued I/O 0 0 Q0LEN 0 Socket buffer bytes 0 0 QLEN 0 Socket buffer quota 4096 4096 QLIMIT 5 Total buffer alloc 0 0 TIMEO 0 Total buffer limit 16384 16384 ERROR 0 Buffer or I/O waits 538 0 OOBMARK 0 Buffer or I/O drops 0 0 I/O completed 537 0 Bytes transferred 0 0 Options: ACCEPT REUSEADR KEEP State: PRIV RCV Buff: WAIT SND Buff: None A couple of PAMS_LINK_DOWN has been reported in the log files! ****************************** SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_4000B.LOG;73 17 Jan 12:0:7: SioReceive returned error PAMS__LINK_DOWN, errno 0 ****************************** SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_4000C.LOG;23 22 Jan 8:35:4: SioReceive returned error PAMS__LINK_DOWN, errno 0 23 Jan 14:5:10: SioReceive returned error PAMS__LINK_DOWN, errno 0 23 Jan 19:52:33: SioReceive returned error PAMS__LINK_DOWN, errno 0 ****************************** SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_4000F.LOG;3 30 Jan 19:42:58: SioReceive returned error PAMS__LINK_DOWN, errno 0 2. I tried to follow the instructions in one of the DMQA documents on how to stop and restart the offspring process-es and the listener. I have done it before and it worked fine! What happened this time, as you can see below, is that all eight offspring processes were created, leaving me with "double processe? Did I do anything wrong? PPM>@dmqa_shutdown 4000a DMQA_SHUTDOWN - DECmessageQ Queue Adapter Shutdown Procedure The following servers are running at 31-JAN-1997 05:04:49.05 00000F03 DmQA_S_4000 (server) 00000F04 DmQA_S_4000A (server) 0000011E DmQA_S_4000B (server) 00000F1F DmQA_S_4000C (server) 00000120 DmQA_S_4000D (server) 00000F21 DmQA_S_4000E (server) 00000F22 DmQA_S_4000F (server) 00000F23 DmQA_S_4000G (server) 00000F24 DmQA_S_4000H (server) Shutting down link 4000A Deleting 00000F04 (DmQA_S_4000A) PPM>@dmqa_shutdown 4000g ......... Shutting down link 4000G Deleting 00000F23 (DmQA_S_4000G) PPM>@dmqa_shutdown 4000h ........ Shutting down link 4000H Deleting 00000F24 (DmQA_S_4000H) PPM>@dmqa_shutdown 4000 l DMQA_SHUTDOWN - DECmessageQ Queue Adapter Shutdown Procedure The following servers are running at 31-JAN-1997 05:05:39.77 00000F03 DmQA_S_4000 (server) 0000011E DmQA_S_4000B (server) 00000F1F DmQA_S_4000C (server) 00000120 DmQA_S_4000D (server) 00000F21 DmQA_S_4000E (server) 00000F22 DmQA_S_4000F (server) Shutting down link 4000 Deleting 00000F03 (DmQA_S_4000) PPM>@dmqa_startup "" "" dmqa ERROR 4000 3 "" 8 DMQA_STARTUP - DECmessageQ Queue Adapter Startup Procedure Starting DECmessageQ Queue Adapter Server at 31-JAN-1997 05:06:16.05 Image: SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_SERVER.EXE Process: DmQA_S_4000 DmQ location: SYS$SYSDEVICE:[DMQ$V32.EXE] Bus: 0001 Group: 01800 DMQ$DEBUG: ERROR Output: SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_4000.LOG Params: 4000, 3, , 8 Ok to start this server (Yes/No) <No>? : Y %RUN-S-PROC_ID, identification of created process is 00001CB6 Process completed at 31-JAN-1997 05:06:31.76 PPM>@DMQA_SHOW The following servers are running at 31-JAN-1997 05:06:42.03 00001CB6 DmQA_S_4000 (server) 00001CB7 DmQA_S_4000A (server) DmQA Server at 4001 is available (%EFN 64 in DMQA_4000_EFC set) 0000011E DmQA_S_4000B (server) DmQA Server at 4002 is available (%EFN 65 in DMQA_4000_EFC set) 00001CB8 DmQA_S_4000B (server) DmQA Server at 4002 is available (%EFN 65 in DMQA_4000_EFC set) 00000F1F DmQA_S_4000C (server) DmQA Server at 4003 is available (%EFN 66 in DMQA_4000_EFC set) 00001CB9 DmQA_S_4000C (server) DmQA Server at 4003 is available (%EFN 66 in DMQA_4000_EFC set) 00000120 DmQA_S_4000D (server) DmQA Server at 4004 is available (%EFN 67 in DMQA_4000_EFC set) 00001CBA DmQA_S_4000D (server) DmQA Server at 4004 is available (%EFN 67 in DMQA_4000_EFC set) 00000F21 DmQA_S_4000E (server) DmQA Server at 4005 is available (%EFN 68 in DMQA_4000_EFC set) 00001CBB DmQA_S_4000E (server) DmQA Server at 4005 is available (%EFN 68 in DMQA_4000_EFC set) 00000F22 DmQA_S_4000F (server) DmQA Server at 4006 is available (%EFN 69 in DMQA_4000_EFC set) 00002ABC DmQA_S_4000F (server) DmQA Server at 4006 is available (%EFN 69 in DMQA_4000_EFC set) 00002ABD DmQA_S_4000G (server) DmQA Server at 4007 is available (%EFN 70 in DMQA_4000_EFC set) 00002ABE DmQA_S_4000H (server) DmQA Server at 4008 is available (%EFN 71 in DMQA_4000_EFC set) PPM>UCX SH DEV Port Remote Device_socket Type Local Remote Service Host bg3 STREAM 513 0 RLOGIN 0.0.0.0 bg4 STREAM 23 0 TELNET 0.0.0.0 bg2395 STREAM 4002 0 0.0.0.0 bg2399 STREAM 4003 0 0.0.0.0 bg2401 STREAM 4004 0 0.0.0.0 bg2403 STREAM 4005 0 0.0.0.0 bg2406 STREAM 4006 0 0.0.0.0 bg7189 STREAM 23 1569 TELNET 151.183.7.50 bg7194 STREAM 4000 0 0.0.0.0 bg7195 STREAM 4001 0 0.0.0.0 bg7197 STREAM 4002 0 0.0.0.0 bg7199 STREAM 4003 0 0.0.0.0 bg7201 STREAM 4004 0 0.0.0.0 bg7204 STREAM 4005 0 0.0.0.0 bg7206 STREAM 4006 0 0.0.0.0 bg7208 STREAM 4007 0 0.0.0.0 bg7210 STREAM 4008 0 0.0.0.0
T.R | Title | User | Personal Name | Date | Lines |
---|---|---|---|---|---|
2759.1 | PAMSIC::STEPHENS | Thu Feb 06 1997 08:45 | 57 | ||
Hi Ann, I'll try and lend a hand here. On the surface, to restate your problem in .0, the system (multiple DGs <-> Alpha) was working until the customer expanded with another site. When this new DG (DMQA client) tries to connect to the Alpha, there are problems. On the server side, the error "no free server slots available" happens because there are a limited number of DmQA server processes that will accept connections from DmQA clients, and it seems in this case 8. The DmQA startup will create 8 DmQA servers for up to 8 clients. The master process (in this case DmQA_S_4000) creates the A-H copies of itself and uses an common event flag to communicate activity. When a new DmQA client requests a connection (during the attach), the server (DmQA_S_4000) will look for the first available offspring (A-H) to hand off the connection for this client. When the bit in the event flag is set, the offspring is not busy and can be used for connection, From your first dmqa_show, all of the bits are clear, meaning all 'slots' are taken and no other clients are able to connect. My guess is there is some problem with this new site that causes an unexpected exit of the client, either from a network failure or program bust, which leaves the DmQA server (A-H) on the alpha in a 'hung' situation. Depending on the state of the socket, it may take awhile for the server to realize the client is gone, depending on the keepalive timers for UCX (probe and drop) however, it "should" clean up and go back to the idle state, ready to accept a new connection. To your questions, specifically: >How do I know that a DMQA process really is a zoombie process? They all seem If the dmqa_show shows the EFN clear (it is busy), yet you see no activity for the the socket via UCX for say 10 minutes after you know the client is dead, then something is broken, and you will have to restart that server. >before). Can I consider the process to be a zoombie if it suddenly stops >increasing the UCX device/socket I/O count, CPU time or what? Not exactly, unless you know the client is gone. If the client is just in an idle state you won't see any activity. >Is there anything we can do to prevent this happening all the time? Where It is important for the client code to gracefully exit. The Qadapter does not have 'great' link error recovery, so my best advice is to always try to issue a pams_exit from the client with exiting the DG program. >should we start to look? At the network? A new DMQA version? UCX configuration? You can try changing the probe/drop timers in UCX, but it sounds like once these servers get into this state, they never revcover (2 days). The default probe/drop timers give you about 10 minutes, then the link is dead, and it should signal the server for cleanup. Hope this helps, Bruce. |