Linux进程与线程及其他知识点拾遗_父进程创建的线程-优快云博客

本文链接：https://blog.youkuaiyun.com/jltxgcy/article/details/74836472

0x00进程与线程的区别

进程和线程都是由父进程创建出来了，区别在于是否共享页表和页目录表。

进程是通过系统调用fork来创建的。使用全新的页表和页目录表，实行copy_on_write策略。

asmlinkage int sys_fork(struct pt_regs regs)  
{  
    return do_fork(SIGCHLD, regs.esp, &regs, 0);  
}

线程在C++层是通过pthread_create函数创建的，最后会调用到clone，

int clone(int (*fn)(void *arg), void *child_stack, int flags, void *arg)，传入的参数是

    int clone_flags = (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGNAL
    | CLONE_SETTLS | CLONE_PARENT_SETTID
    | CLONE_CHILD_CLEARTID | CLONE_SYSVSEM
    | 0);

asmlinkage int sys_clone(struct pt_regs regs)  
{  
    unsigned long clone_flags;  
    unsigned long newsp;  
  
    clone_flags = regs.ebx;//就是用户态的flags  
    newsp = regs.ecx;//就是用户态的child_stack  
    if (!newsp)  
        newsp = regs.esp;  
    return do_fork(clone_flags, newsp, &regs, 0);  
}

线程是共享父进程的页表和页目录表，线程是有独立的堆栈的，全局变量是共享的。

还有一种通过vfork出来的线程，并没有独立的堆栈。所以子线程在执行时，父进程必须等待；子线程执行完毕，通知父进程继续执行。

asmlinkage int sys_vfork(struct pt_regs regs)  
{  
    return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, &regs, 0);//主要区别是有两个标志位CLONE_VFORK，CLONE_VM  
}

0x01进程调度

进程调度分为主动调度和被动调度。

1、以这个例子来说下主动调度。

 #include <stdio.h>  
  
int main()  
{  
    int child;  
    char *args[] = {"/bin/echo", "Hello", "World!", NULL};  
      
    if(!(child = fork()))  
    {  
        /* child */  
        execve("/bin/echo", args, NULL});  
        printf("I am back, something is wrong!\n");  
    }     
    else  
    {  
        /* father */  
        wait4(child, NULL, 0, NULL);  
    }  
}

1）、父进程fork出子进程

2）、父进程执行wait4，并调用schedule切换到子进程

3）、子进程开始执行execve

4）、子进程执行完/bin/echo之后，会调动do_exit，部分销毁子进程，并调用schedule切换到正在sys_wait4等待的父进程

5）、父进程彻底销毁子进程

这里我们看到了两处主动调度。

2、什么时候发生被动调度呢？

当唤醒某个进程，或者时钟中断发现当前进程的时间片已经耗尽，会将当前进程task_struck的need_resched置为1，当中断、异常或者系统调用返回到用户空间之前会调用schedule执行调度。

0x02进程切换是如何完成的呢

首先说明下，linux调度的基本单元的线程和进程。

1、切换到子进程

我们知道fork出来的子进程，堆栈如下，pt_regs保存了返回的值，如eax已经被设置为0，fork返回0用来标记子进程：

进程切换的关键代码是

#define switch_to(prev,next,last) do {                  \  
    asm volatile("pushl %%esi\n\t"                  \ //把esi存入现在进程prev的堆栈  
             "pushl %%edi\n\t"                  \ //把edi存入现在进程prev的堆栈  
             "pushl %%ebp\n\t"                  \ //把ebp存入现在进程prev的堆栈  
             "movl %%esp,%0\n\t"    /* save ESP */      \ //现在进程prev的esp保存在prev->thread.esp  
             "movl %3,%%esp\n\t"    /* restore ESP */   \ //将要切换的进程next->thread.esp保存在esp中，堆栈已经切换了   
             "movl $1f,%1\n\t"      /* save EIP */      \ //现在进程prev的eip(也就是"1:\t"地址)保存在prev->thread.eip  
             "pushl %4\n\t"     /* restore EIP */   \ //将要切换的进程next->thread.eip保存在eip中  
             "jmp __switch_to\n"                \ //且不说__switch_to中干了些什么，当CPU执行到那里的ret指令时，由于是通过jmp指令转过去的，最后进入堆栈的next->thread.eip就变成了返回地址  
             "1:\t"                     \ //如果切换的不是子进程，next->thread.eip实际上就是上一次保存在prev->thread.eip，也就是这一行语句  
             "popl %%ebp\n\t"                   \ //由于堆栈已经切换过来，pop出的都是上面存入进程prev堆栈的内容  
             "popl %%edi\n\t"                   \  
             "popl %%esi\n\t"                   \  
             :"=m" (prev->thread.esp),"=m" (prev->thread.eip),    \  
              "=b" (last)                   \  
             :"m" (next->thread.esp),"m" (next->thread.eip),  \  
              "a" (prev), "d" (next),               \  
              "b" (prev));                  \  
} while (0)

由于我们在fork中，设置了一些值，这里就用了：

int copy_thread(int nr, unsigned long clone_flags, unsigned long esp,  
    unsigned long unused,  
    struct task_struct * p, struct pt_regs * regs)  
{  
    struct pt_regs * childregs;  
  
    childregs = ((struct pt_regs *) (THREAD_SIZE + (unsigned long) p)) - 1;//指向了子进程系统空间堆栈中的pt_regs结构  
    struct_cpy(childregs, regs);//把当前进程系统空间堆栈中的pt_regs结构复制过去  
    childregs->eax = 0;//子进程系统空间堆栈中的pt_regs结构eax置成0  
    childregs->esp = esp;//子进程系统空间堆栈中的pt_regs结构esp置成这里的参数esp，在fork中，则来自调用do_fork()前夕的regs.esp,所以实际上并没有改变  
  
    p->thread.esp = (unsigned long) childregs;//子进程系统空间堆栈中pt_regs结构的起始地址  
    p->thread.esp0 = (unsigned long) (childregs+1);//指向子进程的系统空间堆栈的顶端  
  
    p->thread.eip = (unsigned long) ret_from_fork;  
  
    savesegment(fs,p->thread.fs);  
    savesegment(gs,p->thread.gs);  
  
    unlazy_fpu(current);  
    struct_cpy(&p->thread.i387, current->thread.i387);  
  
    return 0;  
}

所以执行完swith_to代码后，程序开始执行ret_from_fork，pop堆栈，返回到用户态，由于我们已经设置堆栈中eax为0，所以子进程fork返回到用户态的值为0。

2、切换到其他进程

进程切换关键代码swith_to，被终止的进程下一次回来继续执行的eip就是popl %%ebp\n\t的地址，此时堆栈已经切回到当前进程的堆栈。

和切换到子进程不同的是返回eip不同，子进程eip指向ret_from_fork；其他进程eip指向popl %%ebp\n\t的地址。

堆栈也是不同的，子进程的堆栈中只有regs；而其他进程的堆栈除了regs，还有一些函数的调用中使用的栈，不过返回到用户空间的时候，进程栈都是清空的。

3、将新进程页面目录的起始物理地址装入到控制寄存器CR3中

static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk, unsigned cpu)  
{  
    if (prev != next) {  
        /* stop flush ipis for the previous mm */  
        clear_bit(cpu, &prev->cpu_vm_mask);  
        /* 
         * Re-load LDT if necessary 
         */  
        if (prev->context.segments != next->context.segments)  
            load_LDT(next);  
#ifdef CONFIG_SMP  
        cpu_tlbstate[cpu].state = TLBSTATE_OK;  
        cpu_tlbstate[cpu].active_mm = next;  
#endif  
        set_bit(cpu, &next->cpu_vm_mask);  
        /* Re-load page tables */  
        asm volatile("movl %0,%%cr3": :"r" (__pa(next->pgd)));//我们只关心这一句，将新进程页面目录的起始物理地址装入到控制寄存器CR3中  
    }  
#ifdef CONFIG_SMP  
    else {  
        cpu_tlbstate[cpu].state = TLBSTATE_OK;  
        if(cpu_tlbstate[cpu].active_mm != next)  
            BUG();  
        if(!test_and_set_bit(cpu, &next->cpu_vm_mask)) {  
            /* We were in lazy tlb mode and leave_mm disabled  
             * tlb flush IPI delivery. We must flush our tlb. 
             */  
            local_flush_tlb();  
        }  
    }  
#endif  
}

4、更新进程栈首地址

Linux没有为每一个进程都准备一个tss段，而是每一个cpu使用一个tss段，tr寄存器保存该段。进程切换时，只更新唯一tss段中的esp0字段到新进程的内核栈。

在进程切换关键代码swith_to中会调用__swith_to：

void __switch_to(struct task_struct *prev_p, struct task_struct *next_p)  
{  
    struct thread_struct *prev = &prev_p->thread,  
                 *next = &next_p->thread;  
    struct tss_struct *tss = init_tss + smp_processor_id();  
  
    unlazy_fpu(prev_p);  
  
    /* 
     * Reload esp0, LDT and the page table pointer: 
     */  
    tss->esp0 = next->esp0;//将TSS中的内核空间(0级)堆栈指针换成next->esp0,指向子进程的系统空间堆栈的顶端  
  
    /* 
     * Save away %fs and %gs. No need to save %es and %ds, as 
     * those are always kernel segments while inside the kernel. 
     */  
    asm volatile("movl %%fs,%0":"=m" (*(int *)&prev->fs));  
    asm volatile("movl %%gs,%0":"=m" (*(int *)&prev->gs));  
  
    /* 
     * Restore %fs and %gs. 
     */  
    loadsegment(fs, next->fs);  
    loadsegment(gs, next->gs);  
  
    /* 
     * Now maybe reload the debug registers 
     */  
    if (next->debugreg[7]){  
        loaddebug(next, 0);  
        loaddebug(next, 1);  
        loaddebug(next, 2);  
        loaddebug(next, 3);  
        /* no 4 and 5 */  
        loaddebug(next, 6);  
        loaddebug(next, 7);  
    }  
  
    if (prev->ioperm || next->ioperm) {  
        if (next->ioperm) {  
            /* 
             * 4 cachelines copy ... not good, but not that 
             * bad either. Anyone got something better? 
             * This only affects processes which use ioperm(). 
             * [Putting the TSSs into 4k-tlb mapped regions 
             * and playing VM tricks to switch the IO bitmap 
             * is not really acceptable.] 
             */  
            memcpy(tss->io_bitmap, next->io_bitmap,  
                 IO_BITMAP_SIZE*sizeof(unsigned long));  
            tss->bitmap = IO_BITMAP_OFFSET;  
        } else  
            /* 
             * a bitmap offset pointing outside of the TSS limit 
             * causes a nicely controllable SIGSEGV if a process 
             * tries to use a port IO instruction. The first 
             * sys_ioperm() call sets up the bitmap properly. 
             */  
            tss->bitmap = INVALID_IO_BITMAP_OFFSET;  
    }  
}

0x04中断、异常、系统调用

这三者切换到内核态的堆栈就是tss中esp0。

他们在返回用户空间时都会检查是是否有下半部需要处理，是否需要调度，是否有信号需要处理。

中断上半部一般都是在关中断下进行，执行一些时间短和硬件相关的操作；中断下半部一般在开中断下进程，一般执行一些耗时长的任务。

相同中断线的中断不可以打断同类中断，但是不同中断线的中断可以相互打断。

tasklet是特殊的软中断，与一般的软中断不同，某一段tasklet代码在某个时刻只能在一个CPU上运行，而不像一般的软中断服务函数（即softirq_action结构中的action函数指针）那样——在同一时刻可以被多个CPU并发地执行。

0x05进程的页目录表、页表

0号进程->1号内核进程->1号用户进程（init进程）->getty进程->shell进程，具体可参考http://blog.youkuaiyun.com/gongxifacai_believe/article/details/53771464。

以android linux内核为例，0号进程不仅创建了init进程，还创建了2号内核进程，2号内核进程进一步创建了很多内核线程，每个内核线程都有独立的堆栈，并共享内核数据。

内核线程和父进程共享页目录表和页表，但不能映射到用户空间的页面。

每个用户态进程或者线程都是4G的虚拟地址空间，0~3G为用户态映射的区域，3G~4G为内核态映射的区域（对应物理地址0~1G）。

所以每个用户态进程都有独立的页目录表和页表，线程则共享父进程的页目录表和页表；

在进程或者线程切换时cr3指向新的页目录表。

0x06进程的睡眠等待和唤醒

Linux中处于等待状态的进程分为两种：可中断的等待状态（TASK_INTERRUPTIBLE）和不可中断的等待状态(TASK_UNINTERRUPTIBLE)）。处于可中断等待态的进程可以被信号唤醒，如果收到信号，该进程就从等待状态进入可运行状态，并且加入到运行队列中，等待被调度；而处于不可中断等待态的进程是因为硬件环境不能满足而等待，例如等待特定的系统资源，它任何情况下都不能被打断，只能用特定的方式来唤醒它，例如唤醒函数wake_up（）等。参考http://blog.youkuaiyun.com/jansonzhe/article/details/47341383

TASK_INTERRUPTIBLE：

1、进程状态被设置为TASK_INTERRUPTIBLE，并通过wait_event将当前进程加入得到一个全局的等待队列中。

2、仅进程状态被设置为TASK_INTERRUPTIBLE

TASK_UNINTERRUPTIBLE：

1、进程状态被设置为TASK_UNINTERRUPTIBLE，并通过wait_event将当前进程加入得到一个全局的等待队列中。

唤醒：

1、适用于TASK_INTERRUPTIBLE和TASK_UNINTERRUPTIBLE

当等待队列的条件满足时，通过wake_up函数，将全局等待队列中的目标进程状态设置为TASK_RUNNING，将进程need_resched设置为1，在系统调用返回时会调用schedule来完成一次调度。

那么wait_event和wait_event_interruptible之间有什么区别呢？

#define __wait_event_interruptible(wq, condition, ret)      \
do {                                                        \
    DEFINE_WAIT(__wait);                                    \
    for (;;) {                                              \
        prepare_to_wait(&wq, &__wait, TASK_INTERRUPTIBLE);  \
        if (condition)                                      \
            break;                                          \
        if (!signal_pending(current)) {                     \
            schedule();                                     \
            continue;                                       \
        }                                                   \
        ret = -ERESTARTSYS;                                 \
        break;                                              \
    }                                                       \
    finish_wait(&wq, &__wait);                              \
} while (0)

#define wait_event(wq, condition)                   
do {                                    
    if (condition) //判断条件是否满足，如果满足则退出等待         
        break;                          
    __wait_event(wq, condition);//如果不满足，则进入__wait_event宏
} while (0)

#define __wait_event(wq, condition)                     
do {    
DEFINE_WAIT(__wait);
/*定义并且初始化等待队列项，后面我们会将这个等待队列项加入我们的等待队列当中，同时在初始化的过程中，会定义func函数的调用函数autoremove_wake_function函数，该函数会调用default_wake_function函数。*/                    

    for (;;) {                          
        prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE);    
/*调用prepare_to_wait函数，将等待项加入等待队列当中，并将进程状态置为不可中断TASK_UNINTERRUPTIBLE；*/
        if (condition)  //继续判断条件是否满足                    
            break;                      
        schedule(); //如果不满足，则交出CPU的控制权，使当前进程进入休眠状态                      
    }
    /**如果condition满足，即没有进入休眠状态，跳出了上面的for循环，便会将该等待队列进程设置为可运行状态，并从其所在的等待队列头中删除    */                          
    finish_wait(&wq, &__wait);              
} while (0)

wait_event_interruptible不一定要condition满足，如果是有信号同样可以退出wait_event_interruptible。

wake_up只唤醒等待队列中一个进程。

#define wake_up(x)          __wake_up(x, TASK_NORMAL, 1, NULL)  
#define wake_up_nr(x, nr)       __wake_up(x, TASK_NORMAL, nr, NULL)  
#define wake_up_all(x)          __wake_up(x, TASK_NORMAL, 0, NULL)  
#define wake_up_locked(x)       __wake_up_locked((x), TASK_NORMAL)  
  
#define wake_up_interruptible(x)    __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL)  
#define wake_up_interruptible_nr(x, nr) __wake_up(x, TASK_INTERRUPTIBLE, nr, NULL)  
#define wake_up_interruptible_all(x)    __wake_up(x, TASK_INTERRUPTIBLE, 0, NULL)  
#define wake_up_interruptible_sync(x)   __wake_up_sync((x), TASK_INTERRUPTIBLE, 1)

2、只使用与TASK_INTERRUPTIBLE

在向目标进程发送信号时，会检查目标进程状态如果是TASK_INTERRUPTIBLE，会设置为TASK_RUNNING，代码如下：

if ((t->state & TASK_INTERRUPTIBLE) && signal_pending(t))  
        wake_up_process(t);//设置进程状态为TASK_RUNNING

0x07copy_from_user和copy_to_user

是在内核态直接使用汇编，从用户态拷贝数据到内核，和从内核拷贝数据到用户态。

原理：内核态有权限可以访问任意虚拟内存地址。