EEVDF代码详解（一）

最新推荐文章于 2025-03-26 14:09:13 发布

R-Linux

最新推荐文章于 2025-03-26 14:09:13 发布

阅读量1.5k

点赞数 32

分类专栏：进程调度文章标签： linux c语言算法

本文链接：https://blog.youkuaiyun.com/2503_90541313/article/details/145646179

版权

进程调度专栏收录该内容

7 篇文章

订阅专栏

前言

EEVDF是由Peter Zijlstra提出并在kernel-6.6中合入主线的调度算法，已经替代CFS成为Linux kernel中普通进程的默认调度器。基本介绍可参考CFS的完美进化版 – EEVDF调度器简介，本文针对EEVDF中的一些关键代码进行讲解。

1. 如何挑选下一个进程

eevdf挑选进程的核心理念是：

选择合格的进程（也就是lag >= 0）
在合格的进程中选择虚拟截至时间最早的

deadline也用红黑树来保存，且维护了每个se的min_deadline，以保证可以更快找到最小虚拟截至时间
se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
这里是利用了最小堆的思想

代码实现

现在pick_next_entity的逻辑很清晰，主要在pick_eevdf中

static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	/*
	 * Enabling NEXT_BUDDY will affect latency but not fairness.
	 */
	if (sched_feat(NEXT_BUDDY) &&
	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
		return cfs_rq->next;

	return pick_eevdf(cfs_rq);
}

pick_eevdf

static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
{
	struct sched_entity *se = __pick_eevdf(cfs_rq);

	if (!se) {
		struct sched_entity *left = __pick_first_entity(cfs_rq);
		if (left) {
			pr_err("EEVDF scheduling fail, picking leftmost\n");
			return left;
		}
	}

	return se;
}

__pick_eevdf

主要就是两个步骤：

找到包含min_deadline的那个分支best_left
从best_left中找到包含min_deadline的调度实体

static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
{
	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
	struct sched_entity *curr = cfs_rq->curr;
	struct sched_entity *best = NULL;
	struct sched_entity *best_left = NULL;

	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
		curr = NULL;
	best = curr;

	/*
	 * Once selected, run a task until it either becomes non-eligible or
	 * until it gets a new slice. See the HACK in set_next_entity().
	 */
	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
		return curr;

	while (node) {
		struct sched_entity *se = __node_2_se(node);

		/*
		 * 如果当前se不合格，就直接找他的左子树.
		 */
		if (!entity_eligible(cfs_rq, se)) {
			node = node->rb_left;
			continue;
		}

		/*
		 * 如果best为null或者se的deadline更小，那就更新best
		 */
		if (!best || deadline_gt(deadline, best, se))
			best = se;

		/*
		 * Every se in a left branch is eligible, keep track of the
		 * branch with the best min_deadline
		 */
		if (node->rb_left) {
			struct sched_entity *left = __node_2_se(node->rb_left);

			/*
			 *best_left保存包含min_deadline的分支，所以如果当前节点的左子树的min_deadline更小，
			 *那就更新best_left
			 */
			if (!best_left || deadline_gt(min_deadline, best_left, left))
				best_left = left;

			/*
			 * 如果当前se的min_deadline已经等于左子树中的min_deadline，说明已经找到
			 * 最小的min_deadline了，直接退出。
			 */
			if (left->min_deadline == se->min_deadline)
				break;
		}

		/* 如果当前se的deadline等于min_deadline，就不用再去找右子树了 */
		if (se->deadline == se->min_deadline)
			break;

		/* 否则，min_deadline在右子树中， */
		node = node->rb_right;
	}

	/*
	 * We ran into an eligible node which is itself the best.
	 * (Or nr_running == 0 and both are NULL)
	 */
	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
		return best;

	/*
	 * 现在我们已经找到存在min_deadline的best_left分支了，且该分支中所有的进程都是合格的，
	 * 只需要在此分支中找到deadline = min_deadline的se就可以了。
	 */
	node = &best_left->run_node;
	while (node) {
		struct sched_entity *se = __node_2_se(node);

		/* min_deadline is the current node */
		if (se->deadline == se->min_deadline)
			return se;

		/* min_deadline is in the left branch */
		if (node->rb_left &&
		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
			node = node->rb_left;
			continue;
		}

		/* else min_deadline is in the right branch */
		node = node->rb_right;
	}
	return NULL;
}

2. 任务放置

(1). 如何设定lag值？

如果想要放置一个task，且保持一定的滞后量，应该如何设定lag值？由于将一个task放入队列后会对整体的weight造成影响，从而影响平均虚拟时间V，那么放入队列之后实际的滞后量一定和我们最初设定的lag值有偏差，我们需要提前计算出这部分偏差，以便设定和我们预期效果一样的lag值。

(2). 公式推导

我们最终需要的是weight对lag的影响，所以最终应该得到加入队列之后的lag: l_i’与加入队列前设定的l_i的比例关系，这个值应该受队列当前总权重W和该进程w_i的影响。

进程i的lag值表示为l_i，平均虚拟时间为V，进程i的虚拟运行时间为v_i，则有：

l_i = V - v_i ⇒ v_i = V - l_i

进程入队之后的平均虚拟运行时间V’可以表示为：

//W和V分别为进程入队前的权重和平均虚拟运行时间
V' = (W * V + w_i * v_i) / (W + w_i)  
   = (W * V + w_i * (V - l_i)) / (W + w_i)
   = (V * (W + w_i) - w_i * l_i) / (W + w_i)
   = V - w_i * l_i / (W + w_i)

进程入队之后的l_i’可以表示为：

l_i' = V' - v_i
     = V - w_i * l_i / (W + w_i) - (V - l_i)
     = l_i - w_i * l_i / (W + w_i)
     = l_i * W / (W + w_i)

可以看到进程入队之后的l_i’是严格小于入队前设定的l_i的，入队前后的lag和入队前后的总权重呈反比，也就是：

l_i'/l_i = W/W' //W’为入队后的总权重

所以我们可以在入队前设定lag时先乘(W+w_i)/W，就可以让其滞后值与我们预期保持一致了。

(3). 代码实现

理解了上述公式推导，再看代码就很简单了

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	u64 vslice, vruntime = avg_vruntime(cfs_rq);
	s64 lag = 0;
	//根据weight计算应得的虚拟时间片
	se->slice = sysctl_sched_base_slice;
	vslice = calc_delta_fair(se->slice, se);

	/*
	 * Due to how V is constructed as the weighted average of entities,
	 * adding tasks with positive lag, or removing tasks with negative lag
	 * will move 'time' backwards, this can screw around with the lag of
	 * other tasks.
	 *
	 * EEVDF: placement strategy #1 / #2
	 */
	if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
		struct sched_entity *curr = cfs_rq->curr;
		unsigned long load;

		lag = se->vlag;
		
		load = cfs_rq->avg_load;
		if (curr && curr->on_rq)
			load += scale_load_down(curr->load.weight);

		//计算l_i * (W + w_i)
		lag *= load + scale_load_down(se->load.weight);
		if (WARN_ON_ONCE(!load))
			load = 1;
		//计算 l_i * (W + w_i) / W
		lag = div_s64(lag, load);
	}
	
	//设定初始虚拟运行时间为：当前平均虚拟时间 - 滞后量： v_i = V - l_i
	se->vruntime = vruntime - lag;

	/*
	 * When joining the competition; the exisiting tasks will be,
	 * on average, halfway through their slice, as such start tasks
	 * off with half a slice to ease into the competition.
	 */
	 /*给新进程的虚拟时间片/2，让其更早一点运行
	  *因为将当前进程加入队列时，当前队列中的进程大概平均消耗掉了他们一半的时间片，
	  *所以将新进程的时间片也/2，保证其略等于平均值，能及时得到运行。
	  */
	if (sched_feat(PLACE_DEADLINE_INITIAL) && (flags & ENQUEUE_INITIAL))
		vslice /= 2;

	/*
	 * EEVDF: vd_i = ve_i + r_i/w_i
	 */
	se->deadline = se->vruntime + vslice;
}

3. 如何更新deadline

只有当当前时间片已经用完，也就是vruntime > deadline时才会去更新deadline。
deadline = vruntime + vslice
其中vslice是申请的虚拟时间片，与权重有关
vslice = r_i / w_i
r_i是实际申请的时间片，可以看出weight越大，vslice越小，deadline越小，也就更容易优先运行。

这里和vruntime的原理是一样的，vruntime同样也是相同物理时间时，weight越大，vruntime越小

代码实现

static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	/*
	 *如果当前vruntime还小于deadline，那说明时间片还没有用完，不需要更新
	 */
	if ((s64)(se->vruntime - se->deadline) < 0)
		return;

	/*
	 * For EEVDF the virtual time slope is determined by w_i (iow.
	 * nice) while the request time r_i is determined by
	 * sysctl_sched_base_slice.
	 */
	se->slice = sysctl_sched_base_slice;

	/*
	 *deadline等于当前vruntime加上申请的虚拟时间片（虚拟时间片和负载w_i相关）
	 * EEVDF: vd_i = ve_i + r_i / w_i
	 */
	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);

	/*
	 * 当前进程申请的时间片已经用完了，将其标记为可抢占
	 */
	if (cfs_rq->nr_running > 1) {
		resched_curr(rq_of(cfs_rq));
		clear_buddies(cfs_rq, se);
	}
}