CFS 调度器学习笔记

最新推荐文章于 2023-01-06 17:00:34 发布

废言Pro

最新推荐文章于 2023-01-06 17:00:34 发布

阅读量950

点赞数

分类专栏： # linux 进程管理

linux 进程管理专栏收录该内容

35 篇文章

订阅专栏

本文深入剖析了Linux CFS（完全公平调度程序）的工作原理，详细介绍了CFS如何根据进程权重分配运行时间，确保进程间的公平调度。文章通过创建进程、唤醒进程、主动调度和时钟中断四大场景，逐一解析了CFS的关键实现细节。

首先声明，本文参考了网上很多CFS的文章，包括：《使用完全公平调度程序（CFS）进行多任务处理》 --Avinesh Kumar 《 Linux进程管理之CFS组调度分析》 --ericxiao Inside the Linux 2.6 Completely Fair Scheduler -- M. Tim Jones 完全公平调度（CFS） -- wxc200 特对以上作者表示感谢

我也就是怕自己忘了，记录一下，写着写着就想整理一下了，最后写了这么个东西，也不怎么浅显，
但是笔记的作用应该还能起到，如果觉得有用随便看看，不免有错漏之处，还望指出，呵呵。
by peimichael

一、概述

首先简单介绍一下基本的设计思路，
CFS思路很简单，就是根据各个进程的权重分配运行时间(权重怎么来的后面再说)。

进程的运行时间计算公式为:分配给进程的运行时间 = 调度周期 * 进程权重 / 所有进程权重之和 (公式1)

调度周期很好理解，就是将所有处于TASK_RUNNING态进程都调度一遍的时间，差不多相当于O(1)调度算法中运行队列和过期队列切换一次的时间（我对O(1)调度算法看得不是很熟，如有错误还望各位大虾指出）。

举个例子，比如只有两个进程A, B，权重分别为1和2，调度周期设为30ms，那么分配给A的CPU时间为：30ms * (1/(1+2)) = 10ms
而B的CPU时间为30ms * (2/(1+2)) = 20ms，那么在这30ms中A将运行10ms，B将运行20ms。

公平怎么体现呢？它们的运行时间并不一样阿？
其实公平是体现在另外一个量上面，叫做virtual runtime(vruntime)，它记录着进程已经运行的时间，但是并不是直接记录，而是要根据进程的权重将运行时间放大或者缩小一个比例。
我们来看下从实际运行时间到vruntime的换算公式
vruntime = 实际运行时间 * 1024 / 进程权重。 (公式2)

为了不把大家搞晕，这里我直接写1024，实际上它等于nice为0的进程的权重，代码中是NICE_0_LOAD。
也就是说，所有进程都以nice为0的进程的权重1024作为基准，计算自己的vruntime增加速度。

还以上面AB两个进程为例，B的权重是A的2倍，那么B的vruntime增加速度只有A的一半。现在我们把公式2中的实际运行时间用公式1来替换，可以得到这么一个结果：
vruntime = (调度周期 * 进程权重 / 所有进程总权重) * 1024 / 进程权重 = 调度周期 * 1024 / 所有进程总权重
看出什么眉目没有？没错，虽然进程的权重不同，但是它们的 vruntime增长速度应该是一样的，与权重无关。
好，既然所有进程的vruntime增长速度宏观上看应该是同时推进的，那么就可以用这个vruntime来选择运行的进程，谁的vruntime值较小就说明它以前占用cpu的时间较短，
受到了“不公平”对待，因此下一个运行进程就是它。这样既能公平选择进程，又能保证高优先级进程获得较多的运行时间。这就是CFS的主要思想了。

再补充一下权重的来源，权重跟进程nice值之间有一一对应的关系，可以通过全局数组prio_to_weight来转换，nice值越大，权重越低

下面来分析代码。网上已经有很多cfs的文章，因此我打算换一个方式来写，选择几个点来进行情景分析，
包括进程创建时，进程被唤醒，主动调度(schedule)，时钟中断。

介绍代码之前先介绍一下CFS相关的结构
第一个是调度实体sched_entity，它代表一个调度单位，在组调度关闭的时候可以把他等同为进程。每一个task_struct中都有一个sched_entity，进程的vruntime和权重都保存在这个结构中。
那么所有的sched_entity怎么组织在一起呢？红黑树。所有的sched_entity以vruntime为key(实际上是以vruntime-min_vruntime为单位，难道是防止溢出？反正结果是一样的)插入到红黑树中，同时缓存树的最左侧节点，也就是vruntime最小的节点，这样可以迅速选中vruntime最小的进程。
注意只有等待CPU的就绪态进程在这棵树上，睡眠进程和正在运行的进程都不在树上。
我从ibm developer works上偷过来一张图来展示一下它们的关系：

现在开始分情景解析CFS。

二、创建进程

第一个情景选为进程创建时CFS相关变量的初始化。
我们知道，Linux创建进程使用fork或者clone或者vfork等系统调用，最终都会到do_fork。
如果没有设置CLONE_STOPPED，则会进入wake_up_new_task函数，我们看看这个函数的关键部分

/*
* wake_up_new_task - wake up a newly created task for the first time.
*
* This function will do some initial scheduler statistics housekeeping
* that must be done for every newly created context, then puts the task
* on the runqueue and wakes it.
*/
void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
{
    .....
    if (!p->sched_class->task_new || !current->se.on_rq) {
        activate_task(rq, p, 0);
    } else {
        /*
         * Let the scheduling class do new task startup
         * management (if any):
         */
        p->sched_class->task_new(rq, p);
        inc_nr_running(rq);
    }
    check_preempt_curr(rq, p, 0);
    .....
}

上面那个if语句我不知道什么情况下会为真，我测试了一下，在上面两个分支各加一个计数器，判断为真的情况只有2次(我毫无根据的猜测是idle进程和init进程)，而判断为假的情况有近万次。
因此我们只看下面的分支，如果哪位前辈知道真相的话还望告诉我一声，万分感谢。再下面就是检测是否能够形成抢占，如果新进程能够抢占当前进程则进行进程切换。

我们一个一个函数来看
p->sched_class->task_new对应的函数是task_new_fair:

/*
* Share the fairness runtime between parent and child, thus the
* total amount of pressure for CPU stays equal - new tasks
* get a chance to run but frequent forkers are not allowed to
* monopolize the CPU. Note: the parent runqueue is locked,
* the child is not running yet.
*/
static void task_new_fair(struct rq *rq, struct task_struct *p)
{
    struct cfs_rq *cfs_rq = task_cfs_rq(p);

struct sched_entity *se = &p->se, *curr = cfs_rq->curr;

int this_cpu = smp_processor_id();

sched_info_queued(p);

update_curr(cfs_rq);

place_entity(cfs_rq, se, 1);

    /* 'curr' will be NULL if the child belongs to a different group */
    if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
            curr && curr->vruntime < se->vruntime) {
        /*
         * Upon rescheduling, sched_class::put_prev_task() will place
         * 'current' within the tree based on its new key value.
         */
        swap(curr->vruntime, se->vruntime);
        resched_task(rq->curr);
    }
    enqueue_task_fair(rq, p, 0);
}

这里有两个重要的函数，update_curr，place_entity。
其中update_curr在这里可以忽略，它是更新进程的一些随时间变化的信息，我们放到后面再看，
place_entity是更新新进程的vruntime值，以便把他插入红黑树。新进程的vruntime确定之后有一个判断，满足以下几个条件时，交换父子进程的vruntime：
1.sysctl设置了子进程优先运行
2.fork出的子进程与父进程在同一个cpu上
3.父进程不为空（这个条件为什么会发生暂不明白，难道是fork第一个进程的时候？）
4.父进程的vruntime小于子进程的vruntime
几个条件都还比较好理解，说下第四个，因为CFS总是选择vruntime最小的进程运行，因此必须保证子进程vruntime比父进程小，作者没有直接把子进程的vruntime设置为较小的值，
而是采用交换的方法，可以防止通过fork新进程来大量占用cpu时间，马上还要讲到。最后，调用enqueue_task_fair将新进程插入CFS红黑树中

下面我们看下place_entity是怎么计算新进程的vruntime的。

这里是计算进程的初始vruntime。它以cfs队列的min_vruntime为基准，再加上进程在一次调度周期中所增长的vruntime。这里并不是计算进程应该运行的时间，而是先把进程的已经运行时间设为一个较大的值，但是该进程明明还没有运行过啊，为什么要这样做呢？假设新进程都能获得最小的vruntime(min_vruntime)，那么新进程会第一个被调度运行，这样程序员就能通过不断的fork新进程来让自己的程序一直占据CPU，这显然是不合理的，这跟以前使用时间片的内核中父子进程要平分父进程的时间片是一个道理。

再解释下min_vruntime，这是每个cfs队列一个的变量，它一般小于等于所有就绪态进程的最小vruntime，也有例外，比如对睡眠进程进行时间补偿会导致vruntime小于min_vruntime。
至于sched_vslice计算细节暂且不细看，大体上说就是把概述中给出的两个公式结合起来如下：
sched_vslice = (调度周期 * 进程权重 / 所有进程总权重) * NICE_0_LOAD / 进程权重也就是算出进程应分配的实际cpu时间，再把它转化为vruntime。

把这个vruntime加在进程上之后，就相当于认为新进程在这一轮调度中已经运行过了。

好了，到这里又可以回到wake_up_new_task(希望你还没晕，能回溯回去:-))，
看看check_preempt_curr(rq, p, 0);这个函数就直接调用了check_preempt_wakeup

/*
* Preempt the current task with a newly woken task if needed:
*/我略去了一些不太重要的代码
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int sync)
{
    struct task_struct *curr = rq->curr;
    struct sched_entity *se = &curr->se, *pse = &p->se; //se是当前进程，pse是新进程
    /*
     * Only set the backward buddy when the current task is still on the
     * rq. This can happen when a wakeup gets interleaved with schedule on
     * the ->pre_schedule() or idle_balance() point, either of which can
     * drop the rq lock.
     *
     * Also, during early boot the idle thread is in the fair class, for
     * obvious reasons its a bad idea to schedule back to the idle thread.
     */
    if (sched_feat(LAST_BUDDY) && likely(se->on_rq && curr != rq->idle))
        set_last_buddy(se);
    set_next_buddy(pse);
    while (se) {
        if (wakeup_preempt_entity(se, pse) == 1) {
            resched_task(curr);
            break;
        }
        se = parent_entity(se);
        pse = parent_entity(pse);
    }
}

首先对于last和next两个字段给予说明。
如果这两个字段不为NULL，那么last指向最近被调度出去的进程，next指向被调度上cpu的进程。例如A正在运行，被B抢占，那么last指向A，next指向B。这两个指针有什么用呢?

当CFS在调度点选择下一个运行进程时，会优先照顾这两个进程，我们后面会看到，这里只要记住。这两个指针只使用一次，就是在上面这个函数退出后，返回用户空间时会触发schedule，
在那里选择下一个调度进程时会优先选择next，次优先选择last，选择完后，就会清空这两个指针。

这样设计的原因是，在上面的函数中检测结果是可以抢占并不代表已经抢占，而只是设置了调度标志，在最后触发schedule时抢占进程B并不一定是最终被调度的进程(为什么？因为我们判断能否抢占的根据是抢占进程B比运行进程A的vruntime小，但红黑树中可能有比抢占进程B的vruntime更小的进程C，这样在调度时就会选中vruntime最小的C，而不是抢占进程B)，但是我们当然希望优先调度B，因为我们就是为了运行B才设置了调度标志，所以这里用一个next指针指向B，以便给他个后门走，如果B实在不争气，vruntime太大，就还是继续运行被抢占进程A比较合理，因此last指向被抢占进程，这是一个比next小一点的后门，如果next走后门失败，就让被抢占进程A也走一次后门，如果被抢占进程A也不争气，vruntime也太大，只好从红黑树中挑一个vruntime最小的了。不管它们走后门是否成功，一旦选出下一个进程，就立刻清空这两个指针，不能老开着这个后门吧。需要注意的是，schedule中清空这两个指针只在2.6.29及之后的内核才有，之前的内核没有那句话。

然后调用wakeup_preempt_entity检测是否满足抢占条件，如果满足（返回值为1）则对当前进程设置TIF_NEED_RESCHED标志，在退出系统调用时会触发schedule函数进行进程切换,
这个函数后面再说。

我们看看wakeup_preempt_entity(se, pse)，究竟怎么判断后者是否能够抢占前者

/*
* Should 'se' preempt 'curr'.
*
*             |s1
*        |s2
*   |s3
*         g
*      |<--->|c
*
*  w(c, s1) = -1
*  w(c, s2) =  0
*  w(c, s3) =  1
*
*/
static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
    s64 gran, vdiff = curr->vruntime - se->vruntime;
    if (vdiff <= 0)
        return -1;
    gran = wakeup_gran(curr);
    if (vdiff > gran)
        return 1;
    return 0;
}

这个函数返回-1表示新进程vruntime大于当前进程，当然不能抢占，
返回0表示虽然新进程vruntime比当前进程小，但是没有小到调度粒度，一般也不能抢占
返回1表示新进程vruntime比当前进程小的超过了调度粒度，可以抢占。
调度粒度是什么概念呢？这个也很好理解，只是需要对前面的概念作出一点调整，前面说每次都简单选择vruntime最小的进程调度，其实也不完全是这样。
假设进程A和B的vruntime很接近，那么A先运行了一个tick，vruntime比B大了，B又运行一个tick，vruntime又比A大了，又切换到A，这样就会在AB间频繁切换，对性能影响很大，
因此如果当前进程的时间没有用完，就只有当有进程的vruntime比当前进程小超过调度粒度时，才能进行进程切换。
函数上面注释中那个图就是这个意思，我们看下：
横坐标表示vruntime，s1 s2 s3分别表示新进程，c表示当前进程，g表示调度粒度。s3肯定能抢占c；而s1不可能抢占c。
s2虽然vruntime比c小，但是在调度粒度之内，能否抢占要看情况，像现在这种状况就不能抢占。

到这里，创建进程时的调度相关代码就介绍完了。

三、唤醒进程
我们再看看唤醒进程时的CFS动作，看下函数try_to_wake_up，很长的函数，只留几行代码

/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
* @state: the mask of task states that can be woken
* @sync: do a synchronous wakeup?
*
* Put it on the run-queue if it's not already there. The "current"
* thread is always on the run-queue (except when the actual
* re-schedule is in progress), and as such you're allowed to do
* the simpler "current->state = TASK_RUNNING" to mark yourself
* runnable without the overhead of this.
*
* returns failure only if the task is already active.
*/
static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
{
    int cpu, orig_cpu, this_cpu, success = 0;
    unsigned long flags;
    struct rq *rq;
    rq = task_rq_lock(p, &flags);
    if (p->se.on_rq)
        goto out_running;
    update_rq_clock(rq);
    activate_task(rq, p, 1);
    success = 1;
out_running:
    check_preempt_curr(rq, p, sync);
    p->state = TASK_RUNNING;
out:
    current->se.last_wakeup = current->se.sum_exec_runtime;
    task_rq_unlock(rq, &flags);
    return success;
}

update_rq_clock就是更新cfs_rq的时钟，保持与系统时间同步。
重点是activate_task，它将进程加入红黑树并且对vruntime做一些调整，
然后用check_preempt_curr检查是否构成抢占条件(这个抢占条件和内核可抢占的条件不是一个意思)，如果可以抢占则设置TIF_NEED_RESCHED标识。

===》内核可以抢占，不是可以直接抢占，为何还要设置这个标志？？难道抢占点，没唤醒新进程这个点？？
因为check_preempt_curr讲过了，我们只顺着下面的顺序走一遍
activate_task
-->enqueue_task
-->enqueue_task_fair
-->enqueue_entity
-->place_entity

static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
{
    if (task_contributes_to_load(p))
        rq->nr_uninterruptible--;
    enqueue_task(rq, p, wakeup);
    inc_nr_running(rq); //运行队列上的就绪任务多了一个
}
static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
{
    sched_info_queued(p);
    p->sched_class->enqueue_task(rq, p, wakeup);
    p->se.on_rq = 1;  //被唤醒的任务会将on_rq设为1
}
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
*/
static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &p->se;
    for_each_sched_entity(se) {
        if (se->on_rq)
            break;
        cfs_rq = cfs_rq_of(se);
        enqueue_entity(cfs_rq, se, wakeup);
        wakeup = 1;
    }
    hrtick_update(rq);
}
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int wakeup)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);
    account_entity_enqueue(cfs_rq, se);
    if (wakeup) {
        place_entity(cfs_rq, se, 0);
        enqueue_sleeper(cfs_rq, se);
    }
    update_stats_enqueue(cfs_rq, se);
    check_spread(cfs_rq, se);
    if (se != cfs_rq->curr)
        __enqueue_entity(cfs_rq, se);  //把进程加入CFS红黑树
}

这里还需要再看一遍place_entity，前面虽然看过一次，但是第三个参数不一样，当参数3为0的时候走的是另一条路径，我们看下

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;
    /*
     * The 'current' period is already promised to the current tasks,
     * however the extra weight of the new task will slow them down a
     * little, place the new task so that it fits in the slot that
     * stays open at the end.
     */
    if (initial && sched_feat(START_DEBIT))
        vruntime += sched_vslice(cfs_rq, se);
    if (!initial) {
        /* sleeps upto a single latency don't count. */
        if (sched_feat(NEW_FAIR_SLEEPERS)) {
            unsigned long thresh = sysctl_sched_latency;
            /*
             * convert the sleeper threshold into virtual time
             */
            if (sched_feat(NORMALIZED_SLEEPER))
                thresh = calc_delta_fair(thresh, se);
            vruntime -= thresh;
        }

        /* ensure we never gain time by being placed backwards. */
        vruntime = max_vruntime(se->vruntime, vruntime);
    }
    se->vruntime = vruntime;
}

initial不同时两条路径有什么不同呢？路径1是新创建任务的时候对其vruntime进行初始化，将它放在红黑树右端。而下面这条路是唤醒睡眠任务时的代码，我们设想一个任务睡眠了很长时间，它的vruntime就一直不会更新，这样当它醒来时vruntime会远远小于运行队列上的任何一个任务，于是它会长期占用CPU，这显然是不合理的。所以这要对唤醒任务的vruntime进行一些调整，我们可以看到，这里是用min_vruntime减去一个thresh,这个thresh的计算过程就是将sysctl_sched_latency换算成进程的vruntime，而这个sysctl_sched_latency就是默认的调度周期，单核CPU上一般为20ms。之所以要减去一个值是为了对睡眠进程做一个补偿，能让它醒来时可以快速的到CPU。

个人感觉这个设计非常聪明，以前O(1)调度器有一个复杂的公式（到现在我也没能记住），用来区分交互式进程和CPU密集型进程，详情请参考ULK等书，而现在CFS无须再使用那个复杂的公式了，只要是常睡眠的进程，它被唤醒时一定会有很小的vruntime，可以立刻运行，省却了很多特殊情况的处理。同时还要注意那句注释 ensure we never gain time by being placed backwards，本来这里是给因为长时间睡眠而vruntime远远小于min_vruntime的进程补偿的，但是有些进程只睡眠很短时间，这样在它醒来后vruntime还是大于min_vruntime，不能让进程通过睡眠获得额外的运行时间，所以最后选择计算出的补偿时间与进程原本vruntime中的较大者。
到这里，place_entity就讲完了。但是我有一个问题，为什么计算thresh要用整个调度周期换算成vruntime？个人感觉应该用(调度周期 * 进程权重 / 所有进程总权重)再换算成vruntime才合理阿，用整个调度周期是不是补偿太多了？

四、进程调度schedule

下面看下主动调度代码schedule。

/*
* schedule() is the main scheduler function.
*/
asmlinkage void __sched schedule(void)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    struct rq *rq;
    int cpu;

need_resched:
    preempt_disable(); //在这里面被抢占可能出现问题，先禁止它！
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    rcu_qsctr_inc(cpu);
    prev = rq->curr;
    switch_count = &prev->nivcsw;
    release_kernel_lock(prev);

need_resched_nonpreemptible:
    spin_lock_irq(&rq->lock);
    update_rq_clock(rq);
    clear_tsk_need_resched(prev); //清除需要调度的位

    //state==0是TASK_RUNNING，不等于0就是准备睡眠，正常情况下应该将它移出运行队列
    //但是还要检查下是否有信号过来，如果有信号并且进程处于可中断睡眠就唤醒它
    //注意对于需要睡眠的进程，这里调用deactive_task将其移出队列并且on_rq也被清零
    //这个deactivate_task函数就不看了，很简单
    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
        if (unlikely(signal_pending_state(prev->state, prev)))
            prev->state = TASK_RUNNING;
        else
            deactivate_task(rq, prev, 1);
        switch_count = &prev->nvcsw;
    }
    if (unlikely(!rq->nr_running))
        idle_balance(cpu, rq);

    //这两个函数都是重点，我们下面分析
    prev->sched_class->put_prev_task(rq, prev);
    next = pick_next_task(rq, prev);

    if (likely(prev != next)) {
        sched_info_switch(prev, next);
        rq->nr_switches++;
        rq->curr = next;
        ++*switch_count;
        //完成进程切换，不讲了，跟CFS没关系
        context_switch(rq, prev, next); /* unlocks the rq */
        /*
         * the context switch might have flipped the stack from under
         * us, hence refresh the local variables.
         */
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
    } else
        spin_unlock_irq(&rq->lock);
    if (unlikely(reacquire_kernel_lock(current) < 0))
        goto need_resched_nonpreemptible;
    preempt_enable_no_resched();
    //这里新进程也可能有TIF_NEED_RESCHED标志，如果新进程也需要调度则再调度一次
    if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
        goto need_resched;
}

首先看put_prev_task，它等于put_prev_task_fair，后者基本上就是直接调用put_prev_entity

static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
    /*
     * If still on the runqueue then deactivate_task()
     * was not called and update_curr() has to be done:
     */
    //记得这里的on_rq吗？在schedule函数中如果进程状态不是TASK_RUNNING，
    //那么会调用deactivate_task将prev移出运行队列，on_rq清零。因此这里也是只有当
    //prev进程仍然在运行态的时候才需要更新vruntime等信息。
    //如果prev进程因为被抢占或者因为时间到了而被调度出去则on_rq仍然为1
    if (prev->on_rq)
        update_curr(cfs_rq);

    check_spread(cfs_rq, prev);
    //这里也一样，只有当prev进程仍在运行状态的时候才需要更新vruntime信息
    //实际上正在cpu上运行的进程是不在红黑树中的，只有在等待CPU的进程才在红黑树
    //因此这里将调度出的进程重新加入红黑树。on_rq并不代表在红黑树中，而是代表在运行状态
    if (prev->on_rq) {
        update_stats_wait_start(cfs_rq, prev);
        /* Put 'current' back into the tree. */
        //这个函数也不跟进去了，就是把进程以(vruntime-min_vruntime)为key加入到红黑树中
        __enqueue_entity(cfs_rq, prev);
    }
    //没有当前进程了，这个当前进程将在pick_next_task中更新
    cfs_rq->curr = NULL;
}

再回到schedule中看看pick_next_task函数，基本上也就是直接调用pick_next_task_fair

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
    struct task_struct *p;
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;
    if (unlikely(!cfs_rq->nr_running))
        return NULL;
    do {
        //这两个函数是重点，选择下一个要执行的任务
        se = pick_next_entity(cfs_rq);
        set_next_entity(cfs_rq, se);
        cfs_rq = group_cfs_rq(se);
    } while (cfs_rq);
    p = task_of(se);
    hrtick_start_fair(rq, p);
    return p;
}

主要看下pick_next_entity和set_next_entity

static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
    //__pick_next_entity就是直接选择红黑树缓存的最左结点，也就是vruntime最小的结点
    struct sched_entity *se = __pick_next_entity(cfs_rq);

    //下面的wakeup_preempt_entity已经讲过，忘记的同学可以到上面看下
    //这里就是前面所说的优先照顾next和last进程，只有当__pick_next_entity选出来的进程
    //的vruntime比next和last都小超过调度粒度时才轮到它运行，否则就是next或者last
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, se) < 1)
        return cfs_rq->next;
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, se) < 1)
        return cfs_rq->last;
    return se;
}
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    /* 'current' is not kept within the tree. */
    //这里什么情况下条件会为假？我以为刚唤醒的进程可能不在rq上
    //但是回到上面去看了下，唤醒的进程也通过activate_task将on_rq置1了
    //新创建的进程on_rq也被置1，这里什么情况会为假，想不出来
    //这里我也测试了一下，在条件为真假的路径上各设置了一个计数器
    //当条件为真经过了将近五十万次的时候条件为假仅有一次，
    //所以我们可以认为基本上都会直接进入if语句块执行
    if (se->on_rq) {
        /*这里注释是不是写错了？dequeued写成了enqueued?
         * Any task has to be enqueued before it get to execute on
         * a CPU. So account for the time it spent waiting on the
         * runqueue.
         */
        update_stats_wait_end(cfs_rq, se);
        //就是把结点从红黑树上取下来。前面说过，占用CPU的进程不在红黑树上
        __dequeue_entity(cfs_rq, se);
    }
    update_stats_curr_start(cfs_rq, se);
    cfs_rq->curr = se;  //OK，在put_prev_entity中清空的curr在这里被更新
    //将进程运行总时间保存到prev_..中，这样进程本次调度的运行时间可以用下面公式计算：
    //进程本次运行已占用CPU时间 =  sum_exec_runtime - prev_sum_exec_runtime
    //这里sum_exec_runtime会在每次时钟tick中更新
    se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

到此schedule函数也讲完了。

关于dequeue_task，dequeue_entity和__dequeue_entity三者区别
前两者差不太多，不同的那一部分我也没看明白。。。主要是它们都会将on_rq清零，
我觉得是当进程要离开TASK_RUNNING状态时调用，这两个函数可以将进程取下运行队列。
而__dequeue_entity不会将on_rq清零，只是将进程从红黑树上取下，
我觉得一般用在进程将获得CPU的情况，这时需要将它从红黑树取下，但是还要保留在rq上。

五、时钟中断

接下来的情景就是时钟中断，时钟中断在time_init_hook中初始化，中断函数为timer_interrupt
按照如下路径
timer_interrupt
-->do_timer_interrupt_hook
-->这里有一个回调函数，在我机器上测试调用的是tick_handle_oneshot_broadcast
-->从tick_handle_oneshot_broadcast后面一部分过程怎么走的没搞清楚，有时间用kgdb跟踪一下
-->反正最后是到了tick_handle_periodic
-->tick_periodic
-->update_process_times
-->scheduler_tick 这里面跟CFS相关的主要就是更新了cfs_rq的时钟
-->通过回调函数调到task_tick_fair，没作什么事，直接进入了entity_tick
-->entity_tick这个函数看下

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);
    //....无关代码
    if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT))
        check_preempt_tick(cfs_rq, curr);
}

entity_tick函数就是更新状态信息，然后检测是否满足抢占条件。前面我们一直忽略update_curr,这里需要看一下了

static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_of(cfs_rq)->clock; //这个clock刚刚在scheduler_tick中更新过
    unsigned long delta_exec;
    /*
     * Get the amount of time the current task was running
     * since the last time we changed load (this cannot
     * overflow on 32 bits):
     */
    //exec_start记录的是上一次调用update_curr的时间，我们用当前时间减去exec_start
    //就得到了从上次计算vruntime到现在进程又运行的时间，用这个时间换算成vruntime
    //然后加到vruntime上，这一切是在__update_curr中完成的
    delta_exec = (unsigned long)(now - curr->exec_start);
    __update_curr(cfs_rq, curr, delta_exec);
    curr->exec_start = now;
    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);
        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }
}
/*
* Update the current task's runtime statistics. Skip current tasks that
* are not in our scheduling class.
*/
static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
          unsigned long delta_exec)
{
    unsigned long delta_exec_weighted;

//前面说的sum_exec_runtime就是在这里计算的，它等于进程从创建开始占用CPU的总时间
curr->sum_exec_runtime += delta_exec;

    //下面变量的weighted表示这个值是从运行时间考虑权重因素换算来的vruntime，再写一遍这个公式
    //vruntime(delta_exec_weighted) = 实际运行时间(delta_exe) * 1024 / 进程权重
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);

    //将进程刚刚运行的时间换算成vruntime后立刻加到进程的vruntime上。
    curr->vruntime += delta_exec_weighted;

    //因为有进程的vruntime变了，因此cfs_rq的min_vruntime可能也要变化，更新它。
    //这个函数不难，就不跟进去了，就是先取tmp = min(curr->vruntime,leftmost->vruntime)
    //然后cfs_rq->min_vruntime = max(tmp, cfs_rq->min_vruntime)
    update_min_vruntime(cfs_rq);
}

OK，更新完CFS状态之后回到entity_tick中，这时需要检测是否满足抢占条件，这里也是CFS的关键之一

static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;
    //这里sched_slice跟上面讲过的sched_vslice很象，不过sched_vslice换算成了vruntime，
    //而这里这个就是实际时间，没有经过换算，返回值的就是此进程在一个调度周期中应该运行的时间
    ideal_runtime = sched_slice(cfs_rq, curr);

    //上面提到过这个公式了，计算进程已占用的CPU时间，如果超过了应该占用的时间（ideal_runtime）
    //则设置TIF_NEED_RESCHED标志，在退出时钟中断的过程中会调用schedule函数进行进程切换
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;

if (delta_exec > ideal_runtime)
resched_task(rq_of(cfs_rq)->curr);
}

附:一个小测试

看prio_to_weight数组，不同nice值的进程权重差距相当大，最大权重是最小权重的6000倍左右，
是不是意味着如果同时有两个进程A和B在系统中运行，A的nice为－20，B的为19，那么A分配的运行时间
是B的6000倍呢？没错。我做了一个实验，先运行如下程序，

int main()
{
    errno = 0;
    if(fork()) {
        setpriority(PRIO_PROCESS, 0, -20);
    } else {
        setpriority(PRIO_PROCESS, 0, 19);
    }
    if(errno) {
        printf("failed/n");
        exit(EXIT_FAILURE);
    }
    printf("pid:%d/n", getpid());
    while(1);
    return 0;
}

然后再插入如下模块

#define T1 (第一个进程的pid)
#define T2 (第二个进程的pid)
static int __init sched_test_init(void)
{
    struct task_struct *p;
    for_each_process(p) {
        if(p->pid == T1 || p->pid == T2)
            printk("%d runtime:%llu/n", p->pid, p->se.sum_exec_runtime);
    }
    return -1; //返回-1防止模块真正插入，我们只需要打印出上面的信息就可以了。
}

再dmesg查看结果，我测试过两次，一次设置nice分别为0和6，那么权重之比为
1024 / 272 = 3.7647，结果运行时间之比为 146390068851 / 38892147894 = 3.7640可以看到结果相当接近，
另一次设置nice分别为-20和19，权重之比为88761 / 15 = 5917.4000，结果运行时间之比为187800781308 / 32603290 = 5760.1788，也很接近。
可以看到，上面的权重与运行时间成正比的结论是成立的。

实际上，当我运行一个nice为-20的程序后，整个系统非常卡，几乎成了幻灯片，也说明nice值的不同带来的差距非常明显。

另外有一点值得一提，虽然整个系统很卡，但是对鼠标键盘的响应还是很快，我打字的时候几乎不会有什么延迟，这也说明，虽然CFS没有通过复杂的经验公式区分
交互式进程，但是它的设计思路使他天然地对交互式进程的响应可能比O(1)调度还要好。