谈谈守护进程与僵尸进程

最新推荐文章于 2023-09-20 20:51:38 发布

最新推荐文章于 2023-09-20 20:51:38 发布 · 347 阅读

文章标签：

#运维 #数据结构与算法

本文深入探讨了在创建守护进程时采用二次fork的原因，解析了内核中进程状态的变化，特别是ZOMBIE状态的由来及处理机制。通过内核代码分析，解释了守护进程二次fork如何避免僵尸进程的产生，确保系统资源的有效管理和稳定性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

04年时维护的第一个商业服务就用了两次fork产生守护进程的做法，前两天在网上看到许多帖子以及一些unix书籍，认为一次fork后产生守护进程足够了，各有道理吧，不过多了一次fork到底是出于什么目的呢？

进程也就是task，看看内核里维护进程的数据结构task_struct，这里有两个成员：

struct task_struct { volatile long state; int exit_state; ... }
看看include/linux/sched.h里的value取值：

#define TASK_RUNNING 0 #define TASK_INTERRUPTIBLE 1 #define TASK_UNINTERRUPTIBLE 2 #define __TASK_STOPPED 4 #define __TASK_TRACED 8 /* in tsk->exit_state */ #define EXIT_ZOMBIE 16 #define EXIT_DEAD 32 /* in tsk->state again */ #define TASK_DEAD 64 #define TASK_WAKEKILL 128 #define TASK_WAKING 256 #define TASK_STATE_MAX 512
可以看到，进程状态里除了大家都理解的running/interuptible/uninterruptible/stop等状态外，还有一个ZOMBIE状态，这个状态是怎么回事呢？

这是因为linux里的进程都属于一颗树，树的根结点是linux系统初始化结束阶段时启动的init进程，这个进程的pid是1，所有的其他进程都是它的子孙。除了init，任何进程一定有他的父进程，而父进程会负责分配（fork）、回收（wait4）它申请的进程资源。这个树状关系也比较健壮，当某个进程还在运行时，它的父进程却退出了，这个进程却没有成为孤儿进程，因为linux有一个机制，init进程会接管它，成为它的父进程。这也是守护进程的由来了，因为守护进程的其中一个要求就是希望init成为守护进程的父进程。

如果某个进程自身终止了，在调用exit清理完相关的内容文件等资源后，它就会进入ZOMBIE状态，它的父进程会调用wait4来回收这个task_struct，但是，如果父进程一直没有调用wait4去释放子进程的task_struct，问题就来了，这个task_struct谁来回收呢？永远没有人，除非父进程终止后，被init进程接管这个ZOMBIE进程，然后调用wait4来回收进程描述符。如果父进程一直在运行着，这个ZOMBIE会永远的占用系统资源，用KILL发任何信号量也不能释放它。这是很可怕的，因为服务器上可能会出现无数ZOMBIE进程导致机器挂掉。

来看看内核代码吧。进程在退出时执行sys_exit（C程序里在main函数返回会执行到），而它会调用do_exit，do_exit首先清理进程使用的资源，然后调用exit_notify方法，将进程置为僵尸ZOMBIE状态，决定是否要以init进程做为当前进程的父进程，最后通知当前进程的父进程：

kernel/exit.c

static void exit_notify(struct task_struct *tsk) { int state; struct task_struct *t; struct list_head ptrace_dead, *_p, *_n; if (signal_pending(tsk) && !tsk->signal->group_exit && !thread_group_empty(tsk)) { /* * This occurs when there was a race between our exit * syscall and a group signal choosing us as the one to * wake up. It could be that we are the only thread * alerted to check for pending signals, but another thread * should be woken now to take the signal since we will not. * Now we'll wake all the threads in the group just to make * sure someone gets all the pending signals. */ read_lock(&tasklist_lock); spin_lock_irq(&tsk->sighand->siglock); for (t = next_thread(tsk); t != tsk; t = next_thread(t)) if (!signal_pending(t) && !(t->flags & PF_EXITING)) { recalc_sigpending_tsk(t); if (signal_pending(t)) signal_wake_up(t, 0); } spin_unlock_irq(&tsk->sighand->siglock); read_unlock(&tasklist_lock); } write_lock_irq(&tasklist_lock); /* * This does two things: * * A. Make init inherit all the child processes * B. Check to see if any process groups have become orphaned * as a result of our exiting, and if they have any stopped * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2) */ INIT_LIST_HEAD(&ptrace_dead); forget_original_parent(tsk, &ptrace_dead); BUG_ON(!list_empty(&tsk->children)); BUG_ON(!list_empty(&tsk->ptrace_children)); /* * Check to see if any process groups have become orphaned * as a result of our exiting, and if they have any stopped * jobs, send them a SIGHUP and then a SIGCONT. (POSIX 3.2.2.2) * * Case i: Our father is in a different pgrp than we are * and we were the only connection outside, so our pgrp * is about to become orphaned. */ t = tsk->real_parent; if ((process_group(t) != process_group(tsk)) && (t->signal->session == tsk->signal->session) && will_become_orphaned_pgrp(process_group(tsk), tsk) && has_stopped_jobs(process_group(tsk))) { __kill_pg_info(SIGHUP, (void *)1, process_group(tsk)); __kill_pg_info(SIGCONT, (void *)1, process_group(tsk)); } /* Let father know we died * * Thread signals are configurable, but you aren't going to use * that to send signals to arbitary processes. * That stops right now. * * If the parent exec id doesn't match the exec id we saved * when we started then we know the parent has changed security * domain. * * If our self_exec id doesn't match our parent_exec_id then * we have changed execution domain as these two values started * the same after a fork. * */ if (tsk->exit_signal != SIGCHLD && tsk->exit_signal != -1 && ( tsk->parent_exec_id != t->self_exec_id || tsk->self_exec_id != tsk->parent_exec_id) && !capable(CAP_KILL)) tsk->exit_signal = SIGCHLD; /* If something other than our normal parent is ptracing us, then * send it a SIGCHLD instead of honoring exit_signal. exit_signal * only has special meaning to our real parent. */ if (tsk->exit_signal != -1 && thread_group_empty(tsk)) { int signal = tsk->parent == tsk->real_parent ? tsk->exit_signal : SIGCHLD; do_notify_parent(tsk, signal); } else if (tsk->ptrace) { do_notify_parent(tsk, SIGCHLD); } state = EXIT_ZOMBIE; if (tsk->exit_signal == -1 && tsk->ptrace == 0) state = EXIT_DEAD; tsk->exit_state = state; /* * Clear these here so that update_process_times() won't try to deliver * itimer, profile or rlimit signals to this task while it is in late exit. */ tsk->it_virt_value = 0; tsk->it_prof_value = 0; write_unlock_irq(&tasklist_lock); list_for_each_safe(_p, _n, &ptrace_dead) { list_del_init(_p); t = list_entry(_p,struct task_struct,ptrace_list); release_task(t); } /* If the process is dead, release it - nobody will wait for it */ if (state == EXIT_DEAD) release_task(tsk); /* PF_DEAD causes final put_task_struct after we schedule. */ preempt_disable(); tsk->flags |= PF_DEAD; }
大家可以看到这段内核代码的注释非常全。forget_original_parent这个函数还会把该进程的所有子孙进程重设父进程，交给init进程接管。

回过头来，看看为什么守护进程要fork两次。这里有一个假定，父进程生成守护进程后，还有自己的事要做，它的人生意义并不只是为了生成守护进程。这样，如果父进程fork一次创建了一个守护进程，然后继续做其它事时阻塞了，这时守护进程一直在运行，父进程却没有正常退出。如果守护进程因为正常或非正常原因退出了，就会变成ZOMBIE进程。

如果fork两次呢？父进程先fork出一个儿子进程，儿子进程再fork出孙子进程做为守护进程，然后儿子进程立刻退出，守护进程被init进程接管，这样无论父进程做什么事，无论怎么被阻塞，都与守护进程无关了。所以，fork两次的守护进程很安全，避免了僵尸进程出现的可能性。