How do I get rid of zombie processes that persevere?

How do I get rid of zombie processes that persevere?

Unfortunately, it's impossible to generalize how the death of
child processes should behave, because the exact mechanism varies
over the various flavors of Unix.

First of all, by default, you have to do a wait() for child
processes under ALL flavors of Unix. That is, there is no flavor
of Unix that I know of that will automatically flush child
processes that exit, even if you don't do anything to tell it to
do so.

Second, under some SysV-derived systems, if you do
"signal(SIGCHLD, SIG_IGN)" (well, actually, it may be SIGCLD
instead of SIGCHLD, but most of the newer SysV systems have
"#define SIGCHLD SIGCLD" in the header files), then child
processes will be cleaned up automatically, with no further
effort in your part. The best way to find out if it works at
your site is to try it, although if you are trying to write
portable code, it's a bad idea to rely on this in any case.
Unfortunately, POSIX doesn't allow you to do this; the behavior
of setting the SIGCHLD to SIG_IGN under POSIX is undefined, so
you can't do it if your program is supposed to be
POSIX-compliant.

So, what's the POSIX way? As mentioned earlier, you must
install a signal handler and wait. Under POSIX signal handlers
are installed with sigaction. Since you are not interested in
``stopped'' children, only in terminated children, add SA_NOCLDSTOP
to sa_flags. Waiting without blocking is done with waitpid().
The first argument to waitpid should be -1 (wait for any pid),
the third should be WNOHANG. This is the most portable way
and is likely to become more portable in future.

If your systems doesn't support POSIX, there's a number of ways.
The easiest way is signal(SIGCHLD, SIG_IGN), if it works.
If SIG_IGN cannot be used to force automatic clean-up, then you've
got to write a signal handler to do it. It isn't easy at all to
write a signal handler that does things right on all flavors of
Unix, because of the following inconsistencies:

On some flavors of Unix, the SIGCHLD signal handler is called if
one *or more* children have died. This means that if your signal
handler only does one wait() call, then it won't clean up all of
the children. Fortunately, I believe that all Unix flavors for
which this is the case have available to the programmer the
wait3() or waitpid() call, which allows the WNOHANG option to
check whether or not there are any children waiting to be cleaned
up. Therefore, on any system that has wait3()/waitpid(), your
signal handler should call wait3()/waitpid() over and over again
with the WNOHANG option until there are no children left to clean
up. Waitpid() is the preferred interface, as it is in POSIX.

On SysV-derived systems, SIGCHLD signals are regenerated if there
are child processes still waiting to be cleaned up after you exit
the SIGCHLD signal handler. Therefore, it's safe on most SysV
systems to assume when the signal handler gets called that you
only have to clean up one signal, and assume that the handler
will get called again if there are more to clean up after it
exits.

On older systems, there is no way to prevent signal handlers
from being automatically reset to SIG_DFL when the signal
handler gets called. On such systems, you have to put
"signal(SIGCHILD, catcher_func)" (where "catcher_func" is the
name of the handler function) as the last thing in the signal
handler, so that it gets reset.

Fortunately, newer implementations allow signal handlers to be
installed without being reset to SIG_DFL when the handler
function is called. To get around this problem, on systems that
do not have wait3()/waitpid() but do have SIGCLD, you need to
reset the signal handler with a call to signal() after doing at
least one wait() within the handler, each time it is called. For
backward compatibility reasons, System V will keep the old
semantics (reset handler on call) of signal(). Signal handlers
that stick can be installed with sigaction() or sigset().

The summary of all this is that on systems that have waitpid()
(POSIX) or wait3(), you should use that and your signal handler
should loop, and on systems that don't, you should have one call
to wait() per invocation of the signal handler.

One more thing -- if you don't want to go through all of this
trouble, there is a portable way to avoid this problem, although
it is somewhat less efficient. Your parent process should fork,
and then wait right there and then for the child process to
terminate. The child process then forks again, giving you a
child and a grandchild. The child exits immediately (and hence
the parent waiting for it notices its death and continues to
work), and the grandchild does whatever the child was originally
supposed to. Since its parent died, it is inherited by init,
which will do whatever waiting is needed. This method is
inefficient because it requires an extra fork, but is pretty much
completely portable.



Home FAQ