[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference hydra::axp-developer

Title:Alpha Developer Support
Notice:[email protected], 800-332-4786
Moderator:HYDRA::SYSTEM
Created:Mon Jun 06 1994
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:3722
Total number of notes:11359

3266.0. "Wilco International - Point 21564" by KZIN::ASAP () Tue Mar 04 1997 07:48

    Company Name :  Wilco International - Point 21564
    Contact Name :  Nick Everest
    Phone        :  0171.418.45.00
    Fax          :  0171.418.45.04
    Email        :  [email protected]
    Date/Time in :   4-MAR-1997 12:47:56
    Entered by   :  John Wood
    SPE center   :  REO

    Category     :  UNIX
    OS Version   :  
    System H/W   :  


    Brief Description of Problem:
    -----------------------------

From:	ESSB::ESSB::MRGATE::"ILO::ESSC::rallen"  4-MAR-1997 12:42:11.52
To:	RDGENG::ASAP
CC:	
Subj:	ESCALATION: POINT No.21564, Wilco International TO ASAP READING:    

From:	NAME: ESCTECH@ILO          
	TEL: (822-)6704          
	ADDR: ILO                  <rallen@ESSC@ILO>
To:	ASAP@RDGENG@MRGATE


Hello - 

POINT Log Number	 21564

Company Name 	Wilco International

Engineers name	Nick Everest

Telephone Number 	0171.418.45.00	

Fax Number		0171.418.45.04

E-mail Address	[email protected]

Operating System, Version	UNIX

Platform			

Problem Statement		

     We have C programs running on an OSF UNIX machine, using JAM screens 
     for a user interface. In case of application errors we trap signals in 
     these processes by setting up signal handling as follows:
     
          signal (SIGABRT, w_signal);
          signal (SIGFPE,  w_signal);
          signal (SIGILL,  w_signal);
          signal (SIGINT,  w_signal); 
          signal (SIGSEGV, w_signal);
          signal (SIGTERM, w_signal);
     
     w_signal is our signal handling function. The structure of w_signal 
     is:
     
     void w_signal (int pi_signal)
     {
        printf ("Found system error ('%d')", -pi_signal);
     
        w_db_close ();  /* Closes all our database connections */
     
        signal (SIGABRT, SIG_DFL);
        signal (SIGFPE,  SIG_DFL);
        signal (SIGILL,  SIG_DFL);
        signal (SIGSEGV, SIG_DFL);
        signal (SIGTERM, SIG_DFL);
        signal (SIGINT,  SIG_DFL);
     
        exit (1);
     }
     
     We have periodically experienced problems when our application 
     segments (SIGSEGV). The users report that they see the message 'Found 
     system error ('11')', and then their session hangs. They then control 
     'c' in an attempt to regain control of the session. The session 
     seriously corrupted though, and they are then close the conection with 
     the UNIX machine.
     
     After some time it becomes apparent that the machine is slowing 
     considerably. By executing 'ps' the slow system can be traced to the 
     existance of some 'zombie' processes on the machine, left over from 
     the segmented applications.
     
     Do you think that we are trapping the signals or exiting incorrectly? 
     Any advice or suggestions would be appreciated.
     


Regards,

Richard
Richard Allen
Pre-Sales Technical Support                      DTN fax 822.44.45  
European Software Center                        DTN phone 822.43.52  
Digital Equipment International B.V.    [email protected] 
        ~ FREEPHONE NUMBERS ARE AVAILABLE ON REQUEST ~
T.RTitleUserPersonal
Name
DateLines
3266.1RDGENG::WOOD_J[email protected]Thu Mar 06 1997 08:08168
Date:	 6-MAR-1997 12:54:40.40
From:	DEC:.REO.REOVTX::WOOD_J       "[email protected]"
Subj:	Digital ASAP #21564: defunct (zombie) processes & signals
To:	smtp%"[email protected]"

Nick,

I did find mention of a bug in older Digital UNIX whereby when
the system is short on memory, the output of "ps" is affected
such that it incorrectly displays "<defunct>" for some processes.
However, you're using Digital UNIX v3.2G, which is fairly recent.


Who is the parent process of the defunct processes? E.g. do
something like:
    % ps aux | grep defunct
to identify the PID (second column) of defunct processes, then
do:
    % ps j -p <PID>
and look at the PPID parent-PID field (third column). If your
application is the parent, then I suspect a programming bug;
otherwise it might be an o/s bug. Let me know.


I have done some reading about signals and under what
circumstances a child process becomes defunct (aka zombie).
Attached is an example program which I have been using,
which includes some comments. It may be that your 
application needs to invoke the code of the do_sigaction()
routine.

Does your application create many child processes? If so,
it is possible that your application is not handling the
termination of the child processes properly. This could be
likely if your program was developed on a System-V variant
of UNIX, and is being ported to Digital UNIX where the
default signal handling is different. If possible, I would 
recommend the use of POSIX 1003.1a signals (e.g. sigaction(),
sigprocmask(), etc.).

I'm not convinced that having defunct processes should
unduly slow down your system, because my understanding is
that defunct processes occupy a minimum of memory. However,
you could be running out of process slots, but if this
was the case I'd expect you to get an error message.


A comment about your signal-handler code "w_signal()". There
are restrictions as to what can safely be performed within
a signal handler: see the listing of safe (re-entrant)
functions under the "man 4 signal" man-page. Note that
printf() is not amongst them! 


Hope this helps.

Regards,
  John Wood
  Software Partner Engineering (UK)
  Digital Equipment Co
-----------------------

/* 
    defunct.c 		John Wood  6-March-1997
    
    Program to examine behaviour of defunct (zombie) processes.
    
    A defunct process is created when a parent process forks a child,
    and the child process exits but the parent does not wait for, or
    receive a signal from, the child.
    
    A defunct process has freed up the program's text & data segments,
    and has closed all files, but it still takes up a process table slot,
    and a bit of memory for it's status.
    
    On Digital UNIX v3.0 and greater, this program will by default create
    a number of defunct children, which are only tidied up when the parent
    process exits.
        
    Run this program in the background, then use "ps" to see the defunct
    (zombie) child-processes.
    
    E.g. 
    	cc defunct.c -o defunct.exe	-build .exe
    	defunct.exe &		-run program in background
    	ps aux | grep defunct	-should see all the defunct child procs
       
    To prevent the child processes from becoming defunct, you can make the
    do_sigaction() routine get called. E.g.
    	cc -DDO_SIG defunct.c -o defunct_dosig.exe
    	defunct_dosig.exe &
    	ps aux | grep defunct	-won't see any defunct children
    	
    Alternatively the program can call waitpid() to reap the children. E.g.
	cc -DDO_WAIT defunct.c -o defunct_dowait.exe
    	defunct_dosig.exe &
    	ps aux | grep defunct	-won't see any defunct children
    
*/

#include <stdio.h>
#include <signal.h>
#include <errno.h>
#include <sys/wait.h>


#define NUM_CHILDREN	10	/* number of children to fork */
#define PARENT_SLEEP	60	/* sleep time for parent in seconds */


void do_sigaction()
{
    struct sigaction action = { SIG_IGN, 0, SA_NOCLDWAIT };
    
    /* On Digital UNIX v3.0 and above, need this code to terminate child */
    /* processes so they don't hang around and clutter up the process */
    /* table as <defunct>. See sigaction(2) man-page, ref. SA_NOCLDWAIT */

    printf( "\n\nCalling sigaction() for SIGCHLD with SIG_IGN &
SA_NOCLDWAIT\n\n" );
    if (0 != sigaction( SIGCHLD, &action, 0 ))
        perror( "sigaction" ), exit(1);
}


void fork_children()
{
    int count;
    pid_t childpid;
    
    for (count=0; count < NUM_CHILDREN; count++)
        if (childpid = fork())
            printf( "Child %d process %d created\n", count, childpid );
        else
            exit( 0 );	/* child; exit now => defunct zombie */
}


void do_waitpid()
{
    pid_t pid;
    int status_locn;
    
    printf( "\nCalling waitpid() with WNOHANG\n" );
    while ((pid = waitpid( (pid_t) -1, &status_locn, WNOHANG )) > 0)
    {
        printf( "waitpid() returned <%d>\n", (int) pid );
    }
}


main()
{
#ifdef DO_SIG
    do_sigaction();
#endif

    fork_children();
       
#ifdef DO_WAIT
    do_waitpid();
#endif

    sleep( PARENT_SLEEP );	/* wait a while so user can see defunct children
*/
    printf( "parent exiting now\n" );
}

3266.2RDGENG::WOOD_J[email protected]Fri Mar 14 1997 02:4624
Date:	 6-MAR-1997 15:50:27.79
From:	DEC:.REO.REOVTX::WOOD_J       "[email protected]"
Subj:	Re: Digital ASAP #21564: defunct (zombie) processes & signal
To:	SMTP%"[email protected]"

>     The processes for which these problems occur are actually 
>     single-threaded - we don't create child processes. The comments re: 
>     our signal handling function, w_signal, are interesting. Could the 
>     problem be a result of our call to printf, or any Sybase dblib calls?

Yes, the problem *could* be as a result of printf or sybase_dblib.
It would be better if, for example, your signal-handler for Ctrl-C
set a global flag indicating that the nuser wants to exit. Control would
then retrun from the signal handler back to main-line code, and the
current transaction could be completed. The global flag could then be
examined to see if a graceful exit should be performed before starting
the next transaction. You would need to declare such a global flag as
volatile to prevent compiler optimiations. Retro-fitting this to your
existing application probably isn't trivial. Nor can I guarentee that
it will resolve your defunct process problems. However, it is somewhere
to start.

  John Wood

3266.3RDGENG::WOOD_J[email protected]Fri Mar 14 1997 04:5216
Date:	14-MAR-1997 09:46:24.60
From:	SMTP%"[email protected]"
Subj:	Re[2]: Digital ASAP #21564: defunct (zombie) processes & sig
To:	[email protected] ([email protected])

     John,
     
     I've decided to take on board your suggestions re. the function calls 
     within our signal handler. Since this problem is not a known problem 
     I'm happy for you to close this call.
     
     Thanks for your assistance,
     
     Nick