Documentation/CFS Scheduler

最新推荐文章于 2024-10-10 14:00:01 发布

翻译最新推荐文章于 2024-10-10 14:00:01 发布 · 770 阅读

本文详细介绍了Linux 2.6.23版本中引入的完全公平调度器(CFS)。CFS旨在为桌面应用程序提供更公平的CPU分配，通过使用虚拟运行时间和红黑树数据结构实现任务调度，支持多种调度策略如SCHED_NORMAL、SCHED_BATCH和SCHED_IDLE。

Chinese translated version of Documentation/CFS Scheduler

If you have any comment or update to the content, please contact the
original document maintainer directly. However, if you have a problem
communicating in English you can also ask the Chinese maintainer for
help. Contact the Chinese maintainer if this translation is outdated
or if there is a problem with the translation.

Chinese maintainer: 赵晶 anana53@qq.com
---------------------------------------------------------------------
Documentation/CFS Scheduler 的中文翻译

如果想评论或更新本文的内容，请直接联系原文档的维护者。如果你使用英文
交流有困难的话，也可以向中文版维护者求助。如果本翻译更新不及时或者翻
译存在问题，请联系中文版维护者。

中文版维护者：赵晶 anana53@qq.com
中文版翻译者：赵晶 anana53@qq.com
中文版校译者：赵晶 anana53@qq.com

以下为正文
---------------------------------------------------------------------
=============
CFS Scheduler
=============

1. OVERVIEW

CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
replacement for the previous vanilla scheduler's SCHED_OTHER interactivity
code.

=============
CFS调度
=============
1、综述

CFS 代表"完全公平调度"和新的“桌面”的进程
调度由Ingo Molnar的实施，并在Linux2.6.23中合并。它是
替代以前的香草调度程序的SCHED_OTHER的交互性
代码。

80% of CFS's design can be summed up in a single sentence: CFS basically models
an "ideal, precise multi-tasking CPU" on real hardware.

80%的CFS设计可以用一句话概括： CFS 在硬件上基本模型了一个“理想的、精确的多任务的CPU”。

"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical
power and which can run each task at precise equal speed, in parallel, each at
1/nr_running speed. For example: if there are 2 tasks running, then it runs
each at 50% physical power --- i.e., actually in parallel.

理想的多任务的CPU”（不存在的）是一个有100%物理

功率和可以精确相等的速度运行每个任务的的CPU，并行的，每个有

1 / nr_running速度。例如：如果有2个任务在运行，然后运行

每一个以50%的功率---即，实质上是并行的。

On real hardware, we can run only a single task at once, so we have to
introduce the concept of "virtual runtime." The virtual runtime of a task
specifies when its next timeslice would start execution on the ideal
multi-tasking CPU described above. In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks.

在真实的硬件上，我们可以一次只运行一个单一的任务，因此我们必须

介绍“虚拟运行“的概念。任务的虚拟运行

指它下次开始执行的时间会在理想的

多任务的CPU上面描述的。在实践中，一个任务的虚拟运行

是它的实际运行时间归一化正在运行的任务的总数。

2. FEW IMPLEMENTATION DETAILS

In CFS the virtual runtime is expressed and tracked via the per-task
p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
timestamp and measure the "expected CPU time" a task should have gotten.

[ small detail: on "ideal" hardware, at any time all tasks would have the same
p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
would ever get "out of balance" from the "ideal" share of CPU time. ]

CFS's task picking logic is based on this p->se.vruntime value and it is thus
very simple: it always tries to run the task with the smallest p->se.vruntime
value (i.e., the task which executed least so far). CFS always tries to split
up CPU time between runnable tasks as close to "ideal multitasking hardware" as
possible.

Most of the rest of CFS's design just falls out of this really simple concept,
with a few add-on embellishments like nice levels, multiprocessing and various
algorithm variants to recognize sleepers.

2、一些实现细节

在CFS的虚拟运行将被描述及通过每个任务
P＞se.vruntime（纳秒单位）的值来跟踪。通过这种方式，很有可能得到准确的
时间戳和衡量一个任务的“预期的CPU时间”。

[小细节：“理想”的硬件，在任何时间，所有的任务都相同
P＞se.vruntime值——即会同时执行，没有任务
会从“理想”分享CPU时间中失衡。]

CFS的任务选择逻辑是基于P - > se.vruntime值，因此
很简单：它总是试图以最小的P - > se.vruntime值运行任务
（即，至少到目前为止所执行的任务）。CFS总是试图分裂
在可运行的任务尽可能地接近“理想的多任务处理的硬件”中的
CPU时间。

多数CFS的设计都属于这其实很简单的概念，
有一些附加了像优先级，多处理器和各种
算法的变型去识别休眠这样的装饰。

3. THE RBTREE

CFS's design is quite radical: it does not use the old data structures for the
runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
task execution, and thus has no "array switch" artifacts (by which both the
previous vanilla scheduler and RSDL/SD are affected).

CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
increasing value tracking the smallest vruntime among all tasks in the
runqueue. The total amount of work done by the system is tracked using
min_vruntime; that value is used to place newly activated entities on the left
side of the tree as much as possible.

The total number of running tasks in the runqueue is accounted through the
rq->cfs.load value, which is the sum of the weights of the tasks queued on the
runqueue.

3、RBTREE

CFS的设计是相当激进的：它不为runqueues使用旧的数据结构，
但它使用的时间排序的rbtree建立一个“时间轴”的未来
任务的执行，因此没有“队列交换”的假象（受两个
以前的香草调度器和RSDL / SD的影响）。

CFS还保持RQ - > cfs.min_vruntime的值，这是一个单调
递增跟踪运行队列中所有任务之中的最小vruntime。
使用跟踪的系统所做的工作总量是min_vruntime ；
该值用于尽可能多的放置新激活在
这棵树左边的实体。

在 runqueue 中运行的任务的总数通过
RQ - > cfs.load的值来计数，这是在运行队列上工作任务的权重总和。

CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
p->se.vruntime key (there is a subtraction using rq->cfs.min_vruntime to
account for possible wraparounds). CFS picks the "leftmost" task from this
tree and sticks to it.
As the system progresses forwards, the executed tasks are put into the tree
more and more to the right --- slowly but surely giving a chance for every task
to become the "leftmost task" and thus get on the CPU within a deterministic
amount of time.

Summing up, CFS works like this: it runs a task a bit, and when the task
schedules (or a scheduler tick happens) the task's CPU usage is "accounted
for": the (small) time it just spent using the physical CPU is added to
p->se.vruntime. Once p->se.vruntime gets high enough so that another task
becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
small amount of "granularity" distance relative to the leftmost task so that we
do not over-schedule tasks and trash the cache), then the new leftmost task is
picked and the current task is preempted.

CFS保留了一个有时间序的rbtree，在这儿可运行的任务按
P＞se.vruntime的值来排序（有一个差分方法使用RQ - > cfs.min_vruntime来
解释可能wraparounds）。CFS从这个树上选最左任务并执行它。
随着系统的进展，执行的任务放入树
更多向右——缓慢但确保给每一个任务一个机会成为“最左”任务
从而获得一个确定的
大量时间的CPU。

总括来说，CFS是这样工作的：它运行一个任务一点，当任务
时间表（或调度程序刻度线发生）任务的CPU使用的是“占用“：
把它使用了物理CPU（小）的时间加到
P＞se.vruntime上。一旦P＞se.vruntime获得足够高的值，另一个任务
成为有时间序rbtree树上保留的“最左边的任务”，（加上一个
少量的“粒度”距离相对最左边的任务使我们
不在任务调度和垃圾缓存），那么新的最左边的任务被选中
并且当前的任务被抢占。

4. SOME FEATURES OF CFS

CFS uses nanosecond granularity accounting and does not rely on any jiffies or
other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
way the previous scheduler had, and has no heuristics whatsoever. There is
only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):

/proc/sys/kernel/sched_min_granularity_ns

which can be used to tune the scheduler from "desktop" (i.e., low latencies) to
"server" (i.e., good batching) workloads. It defaults to a setting suitable
for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too.

Due to its design, the CFS scheduler is not prone to any of the "attacks" that
exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
chew.c, ring-test.c, massive_intr.c all work fine and do not impact
interactivity and produce the expected behavior.

The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
than the previous vanilla scheduler: both types of workloads are isolated much
more aggressively.

SMP load-balancing has been reworked/sanitized: the runqueue-walking
assumptions are gone from the load-balancing code now, and iterators of the
scheduling modules are used. The balancing code got quite a bit simpler as a
result.

4、CFS的一些特点

CFS使用纳秒级粒度计数，并且不依赖于任何jiffies 或
其他HZ的细节。因此，CFS调度器没有以前的调度方式中“时间片”的概念，
并没有任何的启发式算法。只有一个
中央调谐（你得打开config_sched_debug）：

/proc/sys/kernel/sched_min_granularity_ns

它可以从“桌面”来调整调度程序（即，低延迟）来
“服务”（即，好配料）负载。缺省值设定为合适的
桌面工作负载。sched_batch由CFS调度器模块处理过。

由于其设计，CFS调度是不容易受任何当今
存在反对库存调度算法的“攻击”：fiftyp.C，thud.c,
chew.c, ring-test.c, massive_intr.c 都工作良好，不影响
互动性和产生预期的行为。

CFS调度器在处理优先级和sched_batch
比以前的香草调度具有更强的能力：这两种类型的工作负载是孤立的
更强制性的。

SMP负载平衡已经返工/清除：现在runqueue 行走假设从负载平衡的代码中消失了，
并且在迭代器的调度模块中使用。平衡代码作为结果太简单了一点。

5. Scheduling policies

CFS implements three scheduling policies:

- SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
policy that is used for regular tasks.

- SCHED_BATCH: Does not preempt nearly as often as regular tasks
would, thereby allowing tasks to run longer and make better use of
caches but at the cost of interactivity. This is well suited for
batch jobs.

- SCHED_IDLE: This is even weaker than nice 19, but its not a true
idle timer scheduler in order to avoid to get into priority
inversion problems which would deadlock the machine.

SCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by
POSIX.

The command chrt from util-linux-ng 2.13.1.1 can set all of these except
SCHED_IDLE.

5、调度策略

CFS实现了三种调度策略：

- sched_normal（传统上被称为sched_other）：调度
策略是用来完成常规任务。

- sched_batch：经常不抢占常规任务,
从而允许任务运行的时间更长,使更好的利用
缓存，但有互动性的成本。这是非常适合
批处理作业。

- sched_idle：这比19级优先级更弱，但它不是一个真正的
空闲计时器调度器为了避免进入引起死机的优先
反演问题。

SCHED_FIFO / rr实现sched / rt。c和是指定的
POSIX。

从util-linux-ng 2.13.1.1中的命令CHRT可以设置除了
sched_idle的所有。

6. SCHEDULING CLASSES

The new CFS scheduler has been designed in such a way to introduce "Scheduling
Classes," an extensible hierarchy of scheduler modules. These modules
encapsulate scheduling policy details and are handled by the scheduler core
without the core code assuming too much about them.

sched/fair.c implements the CFS scheduler described above.

sched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT
priority levels, instead of 140 in the previous scheduler) and it needs no
expired array.

Scheduling classes are implemented through the sched_class structure, which
contains hooks to functions that must be called whenever an interesting event
occurs.

6.调度类

新CFS调度程序被设计在这样一种方式介绍“调度
类”,一个可扩展的层次结构的调度程序的模块。这些模块
封装调度策略的详细信息，并由调度程序核心处理
没有假设太多关于他们的核心代码。

sched/fair.c 实现上文所述的CFS 调度程序。

sched/rt.c 用比以前的香草计划程序更简单的方式
实现 SCHED_FIFO 和 SCHED_RR 的语义。它使用 100 runqueues (为所有 100 RT
优先级别，而不是之前计划程序的140），它不需要
已过期的数组。

调度类是通过sched_class结构实现,
它包含钩子函数,发生一个有趣的事件时必须被调用。

This is the (partial) list of the hooks:

- enqueue_task(...)

Called when a task enters a runnable state.
It puts the scheduling entity (task) into the red-black tree and
increments the nr_running variable.

- dequeue_task(...)

When a task is no longer runnable, this function is called to keep the
corresponding scheduling entity out of the red-black tree. It decrements
the nr_running variable.

- yield_task(...)

This function is basically just a dequeue followed by an enqueue, unless the
compat_yield sysctl is turned on; in that case, it places the scheduling
entity at the right-most end of the red-black tree.

- check_preempt_curr(...)

This function checks if a task that entered the runnable state should
preempt the currently running task.

- pick_next_task(...)

This function chooses the most appropriate task eligible to run next.

- set_curr_task(...)

This function is called when a task changes its scheduling class or changes
its task group.

- task_tick(...)

This function is mostly called from time tick functions; it might lead to
process switch. This drives the running preemption.

这是钩子函数的 (部分) 列表：

-enqueue_task(...)

当一项任务进入可运行状态时调用。
它将调度实体（任务）放入红黑树并
递增 nr_running的变量。

-dequeue_task(...)

当一项任务不再是可运行的时候，调用该函数以保持
相应的调度实体从红黑树取出。它递减
nr_running的变量。

-yield_task(...)

此函数基本上只是在 enqueue 后面的出列，除非
compat_yield sysctl 被开启状态；在这种情况下，它安置调度
在红黑树的最右端的实体。

-check_preempt_curr(...)

此函数检查是否进入可运行状态的任务应
抢占当前正在运行的任务。

-pick_next_task(...)

该函数选择最合适的任务资格运行下一个。

-set_curr_task(...)

当一项任务更改其调度类或更改其任务组时将调用此函数。

-task_tick(...)

这个函数是通常报时信号功能时被调用，它可能会导致
进程切换。它会促使运行抢占。

7. GROUP SCHEDULER EXTENSIONS TO CFS

Normally, the scheduler operates on individual tasks and strives to provide
fair CPU time to each task. Sometimes, it may be desirable to group tasks and
provide fair CPU time to each such task group. For example, it may be
desirable to first provide fair CPU time to each user on the system and then to
each task belonging to a user.

CONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be
grouped and divides CPU time fairly among such groups.

CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
SCHED_RR) tasks.

CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
SCHED_BATCH) tasks.

These options need CONFIG_CGROUPS to be defined, and let the administrator
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
Documentation/cgroups/cgroups.txt for more information about this filesystem.

When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create
task groups and modify their CPU share using the "cgroups" pseudo filesystem.

7、组调度程序扩展到 CFS

通常情况下，调度程序对各个任务进行操作，并力求提供
对每个任务的公平 CPU 时间。有时，可能需要对任务进行分组和
向每个这种任务组提供公平的 CPU 时间。例如,它可能是
需要首先提供公平的CPU时间给系统上的每个用户,然后
每个任务属于一个用户。

CONFIG_CGROUP_SCHED，力图做到正是这样。它允许将任务
分组和划分平等的CPU 时间到组里。

CONFIG_RT_GROUP_SCHED允许分组真是时间 (即 SCHED_FIFO 和
SCHED_RR）的任务。

CONFIG_FAIR_GROUP_SCHED 允许分组CFS (即 SCHED_NORMAL 和
SCHED_BATCH）的任务。

这些选项需要 CONFIG_CGROUPS 来加以界定，并让管理员
创建任务，使用"cgroup"伪文件系统的任意组。关于此
文件系统的详细信息请参阅Documentation/cgroups/cgroups.txt 。

每个 CONFIG_FAIR_GROUP_SCHED 被定义时，使用伪文件系统为组
创建一个"cpu.shares"文件。请参见示例以下步骤创建
任务组和修改其使用的"cgroups"伪文件系统的 CPU 份额。

# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu
# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
# cd /sys/fs/cgroup/cpu

# mkdir multimedia # create "multimedia" group of tasks
# mkdir browser # create "browser" group of tasks

# #Configure the multimedia group to receive twice the CPU bandwidth
# #that of browser group

# echo 2048 > multimedia/cpu.shares
# echo 1024 > browser/cpu.shares

# firefox & # Launch firefox and move it to "browser" group
# echo <firefox_pid> > browser/tasks

# #Launch gmplayer (or your favourite movie player)
# echo <movie_player_pid> > multimedia/tasks