cgroup源码分析——基于centos3.10.0-693.25.4

最新推荐文章于 2025-05-16 10:26:51 发布

原创

最新推荐文章于 2025-05-16 10:26:51 发布 · 1k 阅读

3 ·

CC 4.0 BY-SA版权

本文主要分析了Linux内核中cgroup（控制组）的实现，包括数据结构如css_set、cgroup_subsys_state等，以及文件系统接口和子系统的实现。cgroup在系统中用于管理和限制进程资源，其核心是通过数据结构和文件系统接口提供资源控制。在任务依附进程的过程中，cgroupAttachTask函数起关键作用。文章指出，cgroup框架允许添加子系统以实现不同资源的限制，如cpuacct子系统用于统计进程CPU使用情况。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

内核升级完测试兄弟跑ltprun套件，发现跑完后cgroup失效了。看系统一切运行正常内核也没啥错误日志，又不熟cgroup的实现，在一顿翻代码后发现cgroup注册了CPU热插拔的notifier chain。去翻ltp的测试内容发现，CPU热插拔赫然在列。为啥不一开始先翻ltp都测哪些东西呢，真是浪费一番功夫。
关于cgroup是什么做什么用，这里不在赘述，这里主要是分析一些cgroup的实现。cgroup可以分三个部分：描述子系统(subsys)和cgroup等对象以及其依附关系的数据结构；提供给用户空间的文件系统接口；各子系统。

数据结构

先看下cgroup里都有哪些概念

任务（task）。在 cgroups 中，任务就是系统的一个进程。
控制族群（control group）。控制族群就是一组按照某种标准划分的进程。 Cgroups 中的资
源控制都是以控制族群为单位实现。一个进程可以加入到某个控制族群，也从一个进程组迁
移到另一个控制族群。一个进程组的进程可以使用 cgroups 以控制族群为单位分配的资源，
同时受到 cgroups 以控制族群为单位设定的限制。
层级（hierarchy）。控制族群可以组织成 hierarchical 的形式，既一颗控制族群树。控制族
群树上的子节点控制族群是父节点控制族群的孩子，继承父控制族群的特定的属性。
子系统（subsytem）。一个子系统就是一个资源控制器，比如 cpu 子系统就是控制 cpu 时
间分配的一个控制器。子系统必须附加（attach）到一个层级上才能起作用，一个子系统附
加到某个层级以后，这个层级上的所有控制族群都受到这个子系统的控制。

这些概念有些相互的关系

每次在系统中创建新层级时，该系统中的所有任务都是那个层级的默认 cgroup（我们称
之为 root cgroup ，此cgroup在创建层级时自动创建，后面在该层级中创建的cgroup都是此
cgroup的后代）的初始成员。
一个子系统最多只能附加到一个层级。
一个层级可以附加多个子系统。
一个任务可以是多个cgroup的成员，但是这些cgroup必须在不同的层级。
系统中的进程（任务）创建子进程（任务）时，该子任务自动成为其父进程所在 cgroup 的
成员。然后可根据需要将该子任务移动到不同的 cgroup 中，但开始时它总是继承其父任务
的cgroup。

css_set

task和cgroup之间是多对多的关系，cgroup和subsys是一对多的关系，task和subsys也是多对多的关系(task可以依附多个cgroup，一个cgroup可能依附了多个subsys也依附了很多task)。要描述这些关系不容易，如果task通过各cgroup来引用各subsys再从subsys获取到资源限制，这比较低效。但是从task视角来看，每个task受到各个子系统的限制的是一定的，内核用css_set来描述多个subsys的组合，task通过css_set知道它受哪些限制，加快了访问速度，而且subsys组合是有限的，减少了内核数据结构的复杂度。
看css_set结构体，css_set表示一种资源限制的集合(比如cpu 20% mem 40%和cpu 30% mem 20%是不同的资源限制，用不同的css_set)并且连接进程和subsys。

struct css_set {
   
   

	/* Reference count */
	atomic_t refcount;

	/*
	 * List running through all cgroup groups in the same hash
	 * slot. Protected by css_set_lock
	 */
	struct hlist_node hlist;	//链接到全局css_set hash链表

	/*
	 * List running through all tasks using this cgroup
	 * group. Protected by css_set_lock
	 */
	struct list_head tasks;	//引用此css_set的进程链表

	/*
	 * List of cg_cgroup_link objects on link chains from
	 * cgroups referenced from this css_set. Protected by
	 * css_set_lock
	 */
	struct list_head cg_links;	//此css_set关联的cgroup的链表，通过cg_cgroup_link结构体来连接

	/*
	 * Set of subsystem states, one for each subsystem. This array
	 * is immutable after creation apart from the init_css_set
	 * during subsystem registration (at boot time) and modular subsystem
	 * loading/unloading.
	 */
	//用于引用到css_set里具体的subsys,每个subsys在这里都有个元素，有没有是一回事，用不用是另外一回事
	struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];

	/* For RCU-protected deletion */
	struct rcu_head rcu_head;
};

//进程结构体
struct task_struct {
   
   
...
#ifdef CONFIG_CGROUPS
	/* Control Group info protected by css_set_lock */
	struct css_set __rcu *cgroups;	//指向task关联的css_set
	/* cg_list protected by css_set_lock and tsk->alloc_lock */
	struct list_head cg_list;	//链接到task关联的css_set->tasks链表
#endif
...
}

css_set的tasks链表是所有使用该css_set的进程，进程task_struct的cgroups指针指向该进程相关的css_set，并通过cg_list链接到该css_set的tasks链表。
css_set通过hlist链接到全局的css_set_table hash链表中，方便查找css_set。
cg_links用于链接所有关于此css_set的cgroup，cgroup并不是直接连接到此list_head，而是通过cg_cgroup_link结构体连接。
subsys是subsys指针数组，所有subsys都会有个实例结构体在里面

css_set可以称为cgroup group，即代表一组cgroup，一个css_set可以关联着很多cgroup。

cgroup_subsys_state

再分析下cgroup_subsys_state 结构体，这个结构体是css_set连接到具体subsys实例的桥梁，css_set有一个cgroup_subsys_state指针数组，共CGROUP_SUBSYS_COUNT个元素，意味这每个subsys在其中都有个cgroup_subsys_state实例结构体，但这个结构体里并没有包含实际控制信息。那具体控制信息在哪呢？cgroup_subsys_state实际是和kobjecct类似作用的东西，里面包含了各个subsys共有的信息。通过container_of获取的到subsys直接的实例结构体，subsys的私有控制信息都在该实际结构体中。

/* Per-subsystem/per-cgroup state maintained by the system. */
struct cgroup_subsys_state {
   
   
	/*
	 * The cgroup that this subsystem is attached to. Useful
	 * for subsystems that want to know about the cgroup
	 * hierarchy structure
	 */
	struct cgroup *cgroup;	//本subsys所依附的cgroup

	/*
	 * State maintained by the cgroup system to allow subsystems
	 * to be "busy". Should be accessed via css_get(),
	 * css_tryget() and css_put().
	 */

	atomic_t refcnt;

	unsigned long flags;
	/* ID for this css, if possible */
	struct css_id __rcu *id;

	/* Used to put @cgroup->dentry on the last css_put() */
	struct work_struct dput_work;
};

task可以通过task_struct->css_set->subsys->cgroup找到该task所依附的cgroup。

cgroup

cgroup是描述cgroup(一个control group)的结构体。cgroup文件系统中，每个目录就是一个control group。cgroup结构体的主要作用是关联css_set和subsys，而对于task到cgroup则不需要直接连接，虽然文件系统中目录下有tasks这个文件，task可以通过css_set引用到cgroup。

struct cgroup {
   
   
	unsigned long flags;		/* "unsigned long" so bitops work */

	/*
	 * count users of this cgroup. >0 means busy, but doesn't
	 * necessarily indicate the number of tasks in the cgroup
	 */
	atomic_t count;

	int id;				/* ida allocated in-hierarchy ID */

	/*
	 * We link our 'sibling' struct into our parent's 'children'.
	 * Our children link their 'sibling' into our 'children'.
	 */
	struct list_head sibling;	/* my parent's children */
	struct list_head children;	/* my children */
	struct list_head files;		/* my files */

	struct cgroup *parent;		/* my parent */
	struct dentry *dentry;		/* cgroup fs entry, RCU protected */

	/*
	 * This is a copy of dentry->d_name, and it's needed because
	 * we can't use dentry->d_name in cgroup_path().
	 *
	 * You must acquire rcu_read_lock() to access cgrp->name, and
	 * the only place that can change it is rename(), which is
	 * protected by parent dir's i_mutex.
	 *
	 * Normally you should use cgroup_name() wrapper rather than
	 * access it directly.
	 */
	struct cgroup_name __rcu *name;

	/* Private pointers for each registered subsystem */
	struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];

	struct cgroupfs_root *root;	//指向hierarchy结构体，

	/*
	 * List of cg_cgroup_links pointing at css_sets with
	 * tasks in this cgroup. Protected by css_set_lock
	 */
	struct list_head css_sets;

	struct list_head allcg_node;	/* cgroupfs_root->allcg_list */
	struct list_head cft_q_node;	/* used during cftype add/rm */

	/*
	 * Linked list running through all cgroups that can
	 * potentially be reaped by the release agent. Protected by
	 * release_list_lock
	 */
	struct list_head release_list;

	/*
	 * list of pidlists, up to two for each namespace (one for procs, one
	 * for tasks); created on demand.
	 */
	struct list_head pidlists;
	struct mutex pidlist_mutex;

	/* For RCU-protected deletion */
	struct rcu_head rcu_head;
	struct work_struct free_work;

	/* List of events which userspace want to receive */
	struct list_head event_list;
	spinlock_t event_list_lock;

	/* directory xattrs */
	struct simple_xattrs xattrs;
};

cgroup结构体我们只看几个关键的字段(实际其它字段我还未理解透彻)

sibling chidren parent用于链接父母兄弟子女cgroup
files dentry用于文件系统
name 保存cgroup的名字，和dentry->d_name一样
subsys cgroup_subsys_state数组，每个subsys都有一个元素
root 指向cfroupfs_root，每个cgroup文件系统中有个cgroupfs_root
css_set 本cgroup参与构成的css_set的集合链表
allcg_node 链接到cgroupfs_root->allcg_list
release_list 和cgroup文件系统目录下release文件和cgroupreliese功能相关，不分析

cgroupfs_root

cgroup 是cgroup文件系统描述一个目录的结构体，cgroup是属于一个层级的，而层级有一个专门的结构体描述，如同文件系统有个super_block描述一样。这个结构体名叫cgroupfs_root，其也跟对应对cgroup文件系统sb相关联。

/*
 - A cgroupfs_root represents the root of a cgroup hierarchy, and may be
 - associated with a superblock to form an active hierarchy.  This is
 - internal to cgroup core.  Don't access directly from controllers.
 */
struct cgroupfs_root {
   
   
	struct super_block *sb;

	/*
	 * The bitmask of subsystems intended to be attached to this
	 * hierarchy
	 */
	unsigned long subsys_mask;

	/* Unique id for this hierarchy. */
	int hierarchy_id;

	/* The bitmask of subsystems currently attached to this hierarchy */
	unsigned long actual_subsys_mask;

	/* A list running through the attached subsystems */
	struct list_head subsys_list;

	/* The root cgroup for this hierarchy */
	struct cgroup top_cgroup;

	/* Tracks how many cgroups are currently defined in hierarchy.*/
	int number_of_cgroups;

	/* A list running through the active hierarchies */
	struct list_head root_list;

	/* All cgroups on this root, cgroup_mutex protected */
	struct list_head allcg_list;

	/* Hierarchy-specific flags */
	unsigned long flags;

	/* IDs for cgroups in this hierarchy */
	struct ida cgroup_ida;

	/* The path to use for release notifications. */
	char release_agent_path[PATH_MAX];

	/* The name for this hierarchy - may be empty */
	char name[MAX_CGROUP_ROOT_NAMELEN];
};

sb 这个层级相关联的文件系统的super_block
subsys_mask actual_subsys_mask 这两个mask是挂载了的subsysmask，和这个层级下的cgroup中的cgroup->subsys数组搭配使用，这样cgroup便可知道自己挂载了哪些subsys。
subsys_list 本cgroupfs_root运行时挂载了的subsys的链表
top_cgroup 根目录所关联的cgroup
number_of_cgroups 此层级的cgroup总数
root_list 把此cgroupfs_root链接到一个全局链表 roots
allcg_list 本层级所有的cgroup链表
release_agent_path release相关

cgroup_subsys

cgroup_subsys描述一个subsys，内核中的subsys在代码里定义好了，不存在动态添加subsys。所有subsys存在数组subsys中，其数组元素声明在linux/cgroup_subsys.h中，而实际每个元素是定义在各个子系统的文件中，比如mem_cgroup的子系统mem_cgroup_subsys是定义在mm\memcontrol.c中。

static struct cgroup_subsys *subsys[CGROUP_SUBSYS_COUNT] = {
   
   
#include <linux/cgroup_subsys.h>
};

struct cgroup_subsys {
   
   
	struct cgroup_subsys_state *(*css_alloc)(struct cgroup *cgrp);
	int (*css_online)(struct cgroup *cgrp);
	void (*css_offline)(struct cgroup *cgrp);
	void (*css_free)(struct cgroup *cgrp);

	int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
	void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
	void (*attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
	RH_KABI_REPLACE(void (*fork)(struct task_struct *task),
			void (*fork)(struct task_struct *task, void *priv))
	void (*exit)(struct cgroup *cgrp, struct cgroup *old_cgrp,
		     struct task_struct *task);
	void (*bind)(struct cgroup *root);

	int subsys_id;
	int disabled;
	int early_init;
	/*
	 * True if this subsys uses ID. ID is not available before cgroup_init()
	 * (not available in early_init time.)
	 */
	bool use_id;

	/*
	 * If %false, this subsystem is properly hierarchical -
	 * configuration, resource accounting and restriction on a parent
	 * cgroup cover those of its children.  If %true, hierarchy support
	 * is broken in some ways - some subsystems ignore hierarchy
	 * completely while others are only implemented half-way.
	 *
	 * It's now disallowed to create nested cgroups if the subsystem is
	 * broken and cgroup core will emit a warning message on such
	 * cases.  Eventually, all subsystems will be made properly
	 * hierarchical and this will go away.
	 */
	bool broken_hierarchy;
	bool warned_broken_hierarchy;

#define MAX_CGROUP_TYPE_NAMELEN 32
	const char *name;

	/*
	 * Link to parent, and list entry in parent's children.
	 * Protected by cgroup_lock()
	 */
	struct cgroupfs_root *root;
	struct list_head sibling;
	/* used when use_id == true */