[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference pamsrc::decmessageq

Title:	NAS Message Queuing Bus
Notice:	KITS/DOC, see 4.*; Entering QARs, see 9.1; Register in 10
Moderator:	PAMSRC::MARCUSEN

Created:	Wed Feb 27 1991
Last Modified:	Thu Jun 05 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2898
Total number of notes:	12363

2759.0. "DMQA zombie process?" by MALM01::PEGERT () Thu Feb 06 1997 05:10

Hallo,

I have a couple of questions regarding DecmessageQ and the DMQ Adapter! 
I hope someone can enlighten a true amateur in this field....

Thanks,
Ann Pegert


1. Customer is running the DMQ Adapter on VMS Alpha and Data General at 
several production sites.
One relatively new site is experiencing problems with the DMQA offspring 
processes which seem to stop working after a while. Finally everything stops 
with "no free server slots available" and a restart is necessary. DMQA is 
only used to send files from the DG to the Alpha. The same setup at each site.
UCX version is 4.1, DMQ V3.2A, DMQA V??, VMS 6.1-1H3.

How do I know that a DMQA process really is a zoombie process? They all seem 
to be in use doing a @dmqa_show, but if I look in UCX, I can see that only 
one or two sockets actually have increased their I/O count (since the day 
before). Can I consider the process to be a zoombie if it suddenly stops 
increasing the UCX device/socket I/O count, CPU time or what? 
Is there anything we can do to prevent this happening all the time? Where 
should we start to look? At the network? A new DMQA version? UCX configuration?


$>@dmqa_show

The following servers are running at 31-JAN-1997 04:57:22.57

  00000F03      DmQA_S_4000     (server)
  00000F04      DmQA_S_4000A    (server)
  DmQA Server at 4001 is in use (%EFN 64 in DMQA_4000_EFC clear)
  0000011E      DmQA_S_4000B    (server)
  DmQA Server at 4002 is in use (%EFN 65 in DMQA_4000_EFC clear)
  00000F1F      DmQA_S_4000C    (server)
  DmQA Server at 4003 is in use (%EFN 66 in DMQA_4000_EFC clear)
  00000120      DmQA_S_4000D    (server)
  DmQA Server at 4004 is in use (%EFN 67 in DMQA_4000_EFC clear)
  00000F21      DmQA_S_4000E    (server)
  DmQA Server at 4005 is in use (%EFN 68 in DMQA_4000_EFC clear)
  00000F22      DmQA_S_4000F    (server)
  DmQA Server at 4006 is in use (%EFN 69 in DMQA_4000_EFC clear)
  00000F23      DmQA_S_4000G    (server)
  DmQA Server at 4007 is in use (%EFN 70 in DMQA_4000_EFC clear)
  00000F24      DmQA_S_4000H    (server)
  DmQA Server at 4008 is in use (%EFN 71 in DMQA_4000_EFC clear)



This server/socket hasn't had any more I/O's completed during the last 8-10
days, no CPU consumption etc.

$>ucx sh dev bg2395/ful
Device_socket: bg2395      Type: STREAM      LOCAL               REMOTE
                                      Port:   4002                    0
                              Host:         USAV01              0.0.0.0
                              Service:

                                                             RECEIVE       SEND
                                 Queued I/O                        0          0
       Q0LEN         0           Socket buffer bytes               0          0
       QLEN          0           Socket buffer quota            4096       4096
       QLIMIT        5           Total buffer alloc                0          0
       TIMEO         0           Total buffer limit            16384      16384
       ERROR         0           Buffer or I/O waits             589          0
       OOBMARK       0           Buffer or I/O drops               0          0
                                 I/O completed                   588          0
                                 Bytes transferred                 0          0

  Options:  ACCEPT REUSEADR KEEP
  State:    PRIV
  RCV Buff: WAIT
  SND Buff: None


A socket that still is receiving data:
Device_socket: bg2406      Type: STREAM         LOCAL               REMOTE
                                      Port:      4006                    0
                                      Host:    USAV01              0.0.0.0
                                      Service:

                                                          RECEIVE       SEND
                                   Queued I/O                   0          0
       Q0LEN         0             Socket buffer bytes          0          0
       QLEN          0             Socket buffer quota       4096        4096
       QLIMIT        5             Total buffer alloc           0           0
       TIMEO         0             Total buffer limit       16384       16384
       ERROR         0             Buffer or I/O waits        538           0
       OOBMARK       0             Buffer or I/O drops          0           0
                                   I/O completed              537           0
                                   Bytes transferred            0           0

  Options:  ACCEPT REUSEADR KEEP
  State:    PRIV
  RCV Buff: WAIT
  SND Buff: None






A couple of PAMS_LINK_DOWN has been reported in the log files!

******************************
SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_4000B.LOG;73

17 Jan  12:0:7: SioReceive returned error PAMS__LINK_DOWN, errno 0

******************************
SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_4000C.LOG;23

22 Jan  8:35:4: SioReceive returned error PAMS__LINK_DOWN, errno 0
23 Jan  14:5:10: SioReceive returned error PAMS__LINK_DOWN, errno 0
23 Jan  19:52:33: SioReceive returned error PAMS__LINK_DOWN, errno 0

******************************
SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_4000F.LOG;3

30 Jan  19:42:58: SioReceive returned error PAMS__LINK_DOWN, errno 0




2. I tried to follow the instructions in one of the DMQA documents on how 
to stop and restart the offspring process-es and the listener. I have done
it before and it worked fine! What happened this time, as you can see below, 
is that all eight offspring processes were created, leaving me with "double
processe? Did I do anything wrong?


PPM>@dmqa_shutdown 4000a

DMQA_SHUTDOWN - DECmessageQ Queue Adapter Shutdown Procedure


The following servers are running at 31-JAN-1997 05:04:49.05

   00000F03     DmQA_S_4000     (server)
   00000F04     DmQA_S_4000A    (server)
   0000011E     DmQA_S_4000B    (server)
   00000F1F     DmQA_S_4000C    (server)
   00000120     DmQA_S_4000D    (server)
   00000F21     DmQA_S_4000E    (server)
   00000F22     DmQA_S_4000F    (server)
   00000F23     DmQA_S_4000G    (server)
   00000F24     DmQA_S_4000H    (server)


Shutting down link 4000A

    Deleting 00000F04 (DmQA_S_4000A)


PPM>@dmqa_shutdown 4000g
.........

Shutting down link 4000G

    Deleting 00000F23 (DmQA_S_4000G)


PPM>@dmqa_shutdown 4000h
........

Shutting down link 4000H

    Deleting 00000F24 (DmQA_S_4000H)


PPM>@dmqa_shutdown 4000 l

DMQA_SHUTDOWN - DECmessageQ Queue Adapter Shutdown Procedure


The following servers are running at 31-JAN-1997 05:05:39.77

   00000F03     DmQA_S_4000     (server)
   0000011E     DmQA_S_4000B    (server)
   00000F1F     DmQA_S_4000C    (server)
   00000120     DmQA_S_4000D    (server)
   00000F21     DmQA_S_4000E    (server)
   00000F22     DmQA_S_4000F    (server)


Shutting down link 4000

    Deleting 00000F03 (DmQA_S_4000)


PPM>@dmqa_startup "" "" dmqa ERROR 4000 3 "" 8

DMQA_STARTUP - DECmessageQ Queue Adapter Startup Procedure

Starting DECmessageQ Queue Adapter Server at 31-JAN-1997 05:06:16.05

          Image: SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_SERVER.EXE
        Process: DmQA_S_4000
   DmQ location: SYS$SYSDEVICE:[DMQ$V32.EXE]
            Bus: 0001
          Group: 01800
      DMQ$DEBUG: ERROR
         Output: SYS$SYSDEVICE:[DMQ$V32.DMQA]DMQA_4000.LOG
         Params: 4000, 3, , 8

Ok to start this server (Yes/No) <No>? : Y

%RUN-S-PROC_ID, identification of created process is 00001CB6

Process completed at 31-JAN-1997 05:06:31.76

PPM>@DMQA_SHOW

The following servers are running at 31-JAN-1997 05:06:42.03

  00001CB6      DmQA_S_4000     (server)
  00001CB7      DmQA_S_4000A    (server)
  DmQA Server at 4001 is available (%EFN 64 in DMQA_4000_EFC set)
  0000011E      DmQA_S_4000B    (server)
  DmQA Server at 4002 is available (%EFN 65 in DMQA_4000_EFC set)
  00001CB8      DmQA_S_4000B    (server)
  DmQA Server at 4002 is available (%EFN 65 in DMQA_4000_EFC set)
  00000F1F      DmQA_S_4000C    (server)
  DmQA Server at 4003 is available (%EFN 66 in DMQA_4000_EFC set)
  00001CB9      DmQA_S_4000C    (server)
  DmQA Server at 4003 is available (%EFN 66 in DMQA_4000_EFC set)
  00000120      DmQA_S_4000D    (server)
  DmQA Server at 4004 is available (%EFN 67 in DMQA_4000_EFC set)
  00001CBA      DmQA_S_4000D    (server)
  DmQA Server at 4004 is available (%EFN 67 in DMQA_4000_EFC set)
  00000F21      DmQA_S_4000E    (server)
  DmQA Server at 4005 is available (%EFN 68 in DMQA_4000_EFC set)
  00001CBB      DmQA_S_4000E    (server)
  DmQA Server at 4005 is available (%EFN 68 in DMQA_4000_EFC set)
  00000F22      DmQA_S_4000F    (server)
  DmQA Server at 4006 is available (%EFN 69 in DMQA_4000_EFC set)
  00002ABC      DmQA_S_4000F    (server)
  DmQA Server at 4006 is available (%EFN 69 in DMQA_4000_EFC set)
  00002ABD      DmQA_S_4000G    (server)
  DmQA Server at 4007 is available (%EFN 70 in DMQA_4000_EFC set)
  00002ABE      DmQA_S_4000H    (server)
  DmQA Server at 4008 is available (%EFN 71 in DMQA_4000_EFC set)

PPM>UCX SH DEV
                            Port                       Remote
Device_socket  Type    Local  Remote  Service           Host

  bg3         STREAM     513       0  RLOGIN           0.0.0.0
  bg4         STREAM      23       0  TELNET           0.0.0.0
  bg2395      STREAM    4002       0                   0.0.0.0
  bg2399      STREAM    4003       0                   0.0.0.0
  bg2401      STREAM    4004       0                   0.0.0.0
  bg2403      STREAM    4005       0                   0.0.0.0
  bg2406      STREAM    4006       0                   0.0.0.0
  bg7189      STREAM      23    1569  TELNET           151.183.7.50
  bg7194      STREAM    4000       0                   0.0.0.0
  bg7195      STREAM    4001       0                   0.0.0.0
  bg7197      STREAM    4002       0                   0.0.0.0
  bg7199      STREAM    4003       0                   0.0.0.0
  bg7201      STREAM    4004       0                   0.0.0.0
  bg7204      STREAM    4005       0                   0.0.0.0
  bg7206      STREAM    4006       0                   0.0.0.0
  bg7208      STREAM    4007       0                   0.0.0.0
  bg7210      STREAM    4008       0                   0.0.0.0

T.R Title User Personal
Name Date Lines

2759.1 PAMSIC::STEPHENS Thu Feb 06 1997 08:45 57

T.R	Title	User	Personal Name	Date	Lines
2759.1		PAMSIC::STEPHENS		`Thu Feb 06 1997 08:45`	57
	Hi Ann, I'll try and lend a hand here. On the surface, to restate your problem in .0, the system (multiple DGs <-> Alpha) was working until the customer expanded with another site. When this new DG (DMQA client) tries to connect to the Alpha, there are problems. On the server side, the error "no free server slots available" happens because there are a limited number of DmQA server processes that will accept connections from DmQA clients, and it seems in this case 8. The DmQA startup will create 8 DmQA servers for up to 8 clients. The master process (in this case DmQA_S_4000) creates the A-H copies of itself and uses an common event flag to communicate activity. When a new DmQA client requests a connection (during the attach), the server (DmQA_S_4000) will look for the first available offspring (A-H) to hand off the connection for this client. When the bit in the event flag is set, the offspring is not busy and can be used for connection, From your first dmqa_show, all of the bits are clear, meaning all 'slots' are taken and no other clients are able to connect. My guess is there is some problem with this new site that causes an unexpected exit of the client, either from a network failure or program bust, which leaves the DmQA server (A-H) on the alpha in a 'hung' situation. Depending on the state of the socket, it may take awhile for the server to realize the client is gone, depending on the keepalive timers for UCX (probe and drop) however, it "should" clean up and go back to the idle state, ready to accept a new connection. To your questions, specifically: >How do I know that a DMQA process really is a zoombie process? They all seem If the dmqa_show shows the EFN clear (it is busy), yet you see no activity for the the socket via UCX for say 10 minutes after you know the client is dead, then something is broken, and you will have to restart that server. >before). Can I consider the process to be a zoombie if it suddenly stops >increasing the UCX device/socket I/O count, CPU time or what? Not exactly, unless you know the client is gone. If the client is just in an idle state you won't see any activity. >Is there anything we can do to prevent this happening all the time? Where It is important for the client code to gracefully exit. The Qadapter does not have 'great' link error recovery, so my best advice is to always try to issue a pams_exit from the client with exiting the DG program. >should we start to look? At the network? A new DMQA version? UCX configuration? You can try changing the probe/drop timers in UCX, but it sounds like once these servers get into this state, they never revcover (2 days). The default probe/drop timers give you about 10 minutes, then the link is dead, and it should signal the server for cleanup. Hope this helps, Bruce.

Hi Ann,

I'll try and lend a hand here.   On the surface, to restate your 
problem in .0, the system (multiple DGs <-> Alpha) was working
until the customer expanded with another site.   When this new
DG (DMQA client) tries to connect to the Alpha, there are problems.

On the server side, the error "no free server slots available" happens
because there are a limited number of DmQA server processes that will
accept connections from DmQA clients, and it seems in this case 8.
The DmQA startup will create 8 DmQA servers for up to 8 clients.  The
master process (in this case DmQA_S_4000) creates the A-H copies of
itself and uses an common event flag to communicate activity.  When
a new DmQA client requests a connection (during the attach), the server
(DmQA_S_4000) will look for the first available offspring (A-H) to 
hand off the connection for this client.   When the bit in the event
flag is set, the offspring is not busy and can be used for connection,
From your first dmqa_show, all of the bits are clear, meaning all 
'slots' are taken and no other clients are able to connect.

My guess is there is some problem with this new site that causes an
unexpected exit of the client, either from a network failure or program
bust, which leaves the DmQA server (A-H) on the alpha in a 'hung' 
situation.   Depending on the state of the socket, it may take awhile
for the server to realize the client is gone, depending on the keepalive
timers for UCX (probe and drop) however, it "should" clean up and go
back to the idle state, ready to accept a new connection.

To your questions, specifically:

>How do I know that a DMQA process really is a zoombie process? They all seem 

If the dmqa_show shows the EFN clear (it is busy), yet you see no activity
for the the socket via UCX for say 10 minutes after you know the client is
dead, then something is broken, and you will have to restart that server.

>before). Can I consider the process to be a zoombie if it suddenly stops 
>increasing the UCX device/socket I/O count, CPU time or what? 

Not exactly, unless you know the client is gone.  If the client is just
in an idle state you won't see any activity.

>Is there anything we can do to prevent this happening all the time? Where 

It is important for the client code to gracefully exit.  The Qadapter does
not have 'great' link error recovery, so my best advice is to always try
to issue a pams_exit from the client with exiting the DG program.

>should we start to look? At the network? A new DMQA version? UCX configuration?

You can try changing the probe/drop timers in UCX, but it sounds like 
once these servers get into this state, they never revcover (2 days).  
The default probe/drop timers give you about 10 minutes, then the link
is dead, and it should signal the server for cleanup.

Hope this helps,
Bruce.