bug诞生记——信号（signal）处理导致死锁

原创已于 2024-07-22 18:35:07 修改 · 2.9k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#signal #死锁 #c++

于 2019-09-05 23:35:09 首次发布

bug诞生记专栏收录该内容

6 篇文章

订阅专栏

本文探讨了一个诡异的死锁现象，发生在多线程环境下，主进程维护一个子进程，使用信号处理函数和线程同步机制。文章深入分析了死锁的原因，并提出了解决方案，强调在信号处理函数中使用无锁结构的重要性。

这个bug源于项目中一个诡异的现象：代码层面没有明显的锁的问题，但是执行时发生了死锁一样的表现。我把业务逻辑简化为：父进程一直维持一个子进程。（转载请指明出于breaksoftware的csdn博客）

首先我们定义一个结构体ProcessGuard，它持有子进程的ID以及保护它的的锁。这样我们在多线程中，可以安全的操作这个结构体。

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
#include <signal.h>
#include <pthread.h>

struct ProcessGuard {
    pthread_mutex_t pids_mutex;
    pid_t pid;
};

主进程的主线程启动一个线程，用于不停监视ProcessGuard的pid是否为0（即子进程不存在）。如果不存在就创建子进程，并把进程ID记录到pid中；

void chile_process() {
    while (1) {
        printf("This is the child process. My PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());
        sleep(1);
    }
}

void create_process_routine() {
    printf("This is the child thread of parent process. My PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());
    while (1) {
        int child = 0;
        if (child == 0) {
            pthread_mutex_lock(&g_guard->pids_mutex);
        }
        
        if (g_guard->pid != 0) {
            continue;    
        }
        
        pid_t pid = fork();
        sleep(1);
        printf("Create child process %d.\n", pid);

        if (pid < 0) {
            perror("fork failed");
        }
        else if (pid == 0) {
            chile_process();
            child = 1;
            break;
        }
        else {
            // parent process
            g_guard->pid = pid;
            printf("dispatch task to process. pid is %d.\n", pid);
        }

        if (child == 0) {
            pthread_mutex_unlock(&g_guard->pids_mutex);  
        }
        else {
            break;
        }
    }
}

我们在父进程的主线程中注册一个signal监听。如果子进程被杀掉，则将ProcessGuard中pid设置为0，这样父进程的监控线程将重新启动一个进程。

void sighandler(int signum) {
    printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.\n", signum, getpid(), pthread_self());
    pthread_mutex_lock(&g_guard->pids_mutex);
    g_guard->pid = 0;
    pthread_mutex_unlock(&g_guard->pids_mutex); 
}

最后看下父进程，它初始化一些结构后，注册了signal处理事件并启动了创建子进程的线程。

int main(void) {
    pthread_t creat_process_tid;

    g_guard = malloc(sizeof(struct ProcessGuard));
    pthread_mutex_t pids_mutex;
    if (pthread_mutex_init(&g_guard->pids_mutex, NULL) != 0) {
        perror("init pids_mutex error.");
        exit(1);
    }
    g_guard->pid = 0;

    printf("This is the Main thread of parent process.PID is %d.My thread_id is %lu.\n", getpid(), pthread_self());

    signal(SIGCHLD, sighandler);

    pthread_create(&creat_process_tid, NULL, (void*)create_process_routine, NULL);

    while(1)  {
        printf("Get task from network.\n");
        sleep(1);
    }
    
    pthread_mutex_destroy(&g_guard->pids_mutex);

    return 0;
}

上述代码，我们看到锁只在线程函数create_process_routine和signal处理函数sighandler中被使用了。它们两个在代码层面没有任何调用关系，所以不应该出现死锁！但是实际并非如此。

我们运行程序，并且杀死子进程，会发现主进程并没有重新启动一个新的子进程。

$ ./test      
This is the Main thread of parent process.PID is 17641.My thread_id is 140014057678656.
Get task from network.
This is the child thread of parent process. My PID is 17641.My thread_id is 140014049122048.
Create child process 17643.
dispatch task to process. pid is 17643.
Create child process 0.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
This is the child process. My PID is 17643.My thread_id is 140014049122048.
Get task from network.
This is the parent process.Catch signal 17.My PID is 17641.My thread_id is 140014049122048.
Get task from network.
Get task from network.
Get task from network.
Get task from network.
Get task from network.

这个和我们代码设计不符合，而且不太符合逻辑。于是我们使用gdb attach主进程。

Attaching to process 17641
[New LWP 17642]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
28      ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7f57902be740 (LWP 17641) "test" 0x00007f578fb7a9d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fffd2b41190, remaining=remaining@entry=0x7fffd2b41190)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
  2    Thread 0x7f578fa95700 (LWP 17642) "test" __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
(gdb) t 2
[Switching to thread 2 (Thread 0x7f578fa95700 (LWP 17642))]
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78
#2  0x000055c512c29a9d in sighandler ()
#3  <signal handler called>
#4  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:133
#5  0x00007f578fe91023 in __GI___pthread_mutex_lock (mutex=0x55c51383e260) at ../nptl/pthread_mutex_lock.c:78
#6  0x000055c512c29b42 in create_process_routine ()
#7  0x00007f578fe8e6db in start_thread (arg=0x7f578fa95700) at pthread_create.c:463
#8  0x00007f578fbb788f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

我们查看线程2的调用栈，发现栈帧5和栈帧1锁住了相同的mutex(0x55c51383e260)。而我们线程代码中锁是加/解成对，那么第二个锁是哪儿来的呢？

我们看到栈帧1的锁是源于栈帧2对应的函数sighandler，即下面代码

void sighandler(int signum) {
    printf("This is the parent process.Catch signal %d.My PID is %d.My thread_id is %lu.\n", signum, getpid(), pthread_self());
    pthread_mutex_lock(&g_guard->pids_mutex);
    g_guard->pid = 0;
    pthread_mutex_unlock(&g_guard->pids_mutex); 
}

于是，问题来了。我们在线程函数create_process_routine中从来没有调用sighandler，那这个调用是哪儿来的？

在linux文档signal(7) - Linux manual page中，我们发现了有关signal的这段话

A process-directed signal may be delivered to any
one of the threads that does not currently have the signal blocked.
If more than one of the threads has the signal unblocked, then the
kernel chooses an arbitrary thread to which to deliver the signal.

这句话是说process-directed signal会被投递到当前没有被标记不接受该signal的任意一个线程中。具体是哪个，是由系统内核决定的。这就意味着我们的sighandler可能在主线程中执行，也可能在子线程中执行。于是发生了我们上面的死锁现象。

那么如何解决？官方的方法是使用sigprocmask让一些存在潜在死锁关系的线程不接收这些信号。但是这个方案在复杂的系统中是存在缺陷的。因为我们的工程往往使用各种开源库或者第三方库，我们无法控制它们启动线程的问题。所以，我的建议是：在signal处理函数中，尽量使用无锁结构。通过中间数据的设计，将复杂的业务代码和signal处理函数隔离。