内核升级完测试兄弟跑ltprun套件,发现跑完后cgroup失效了。看系统一切运行正常内核也没啥错误日志,又不熟cgroup的实现,在一顿翻代码后发现cgroup注册了CPU热插拔的notifier chain。去翻ltp的测试内容发现,CPU热插拔赫然在列。为啥不一开始先翻ltp都测哪些东西呢,真是浪费一番功夫。
关于cgroup是什么做什么用,这里不在赘述,这里主要是分析一些cgroup的实现。cgroup可以分三个部分:描述子系统(subsys)和cgroup等对象以及其依附关系的数据结构;提供给用户空间的文件系统接口;各子系统。
数据结构
先看下cgroup里都有哪些概念
- 任务(task)。在 cgroups 中,任务就是系统的一个进程。
- 控制族群(control group)。控制族群就是一组按照某种标准划分的进程。 Cgroups 中的资
源控制都是以控制族群为单位实现。一个进程可以加入到某个控制族群,也从一个进程组迁
移到另一个控制族群。一个进程组的进程可以使用 cgroups 以控制族群为单位分配的资源,
同时受到 cgroups 以控制族群为单位设定的限制。 - 层级(hierarchy)。控制族群可以组织成 hierarchical 的形式,既一颗控制族群树。控制族
群树上的子节点控制族群是父节点控制族群的孩子,继承父控制族群的特定的属性。 - 子系统(subsytem)。一个子系统就是一个资源控制器,比如 cpu 子系统就是控制 cpu 时
间分配的一个控制器。子系统必须附加(attach)到一个层级上才能起作用,一个子系统附
加到某个层级以后,这个层级上的所有控制族群都受到这个子系统的控制。
这些概念有些相互的关系
- 每次在系统中创建新层级时,该系统中的所有任务都是那个层级的默认 cgroup(我们称
之为 root cgroup ,此cgroup在创建层级时自动创建,后面在该层级中创建的cgroup都是此
cgroup的后代)的初始成员。 - 一个子系统最多只能附加到一个层级。
- 一个层级可以附加多个子系统。
- 一个任务可以是多个cgroup的成员,但是这些cgroup必须在不同的层级。
- 系统中的进程(任务)创建子进程(任务)时,该子任务自动成为其父进程所在 cgroup 的
成员。然后可根据需要将该子任务移动到不同的 cgroup 中,但开始时它总是继承其父任务
的cgroup。
css_set
task和cgroup之间是多对多的关系,cgroup和subsys是一对多的关系,task和subsys也是多对多的关系(task可以依附多个cgroup,一个cgroup可能依附了多个subsys也依附了很多task)。要描述这些关系不容易,如果task通过各cgroup来引用各subsys再从subsys获取到资源限制,这比较低效。但是从task视角来看,每个task受到各个子系统的限制的是一定的,内核用css_set来描述多个subsys的组合,task通过css_set知道它受哪些限制,加快了访问速度,而且subsys组合是有限的,减少了内核数据结构的复杂度。
看css_set结构体,css_set表示一种资源限制的集合(比如cpu 20% mem 40%和cpu 30% mem 20%是不同的资源限制,用不同的css_set)并且连接进程和subsys。
struct css_set {
/* Reference count */
atomic_t refcount;
/*
* List running through all cgroup groups in the same hash
* slot. Protected by css_set_lock
*/
struct hlist_node hlist; //链接到全局css_set hash链表
/*
* List running through all tasks using this cgroup
* group. Protected by css_set_lock
*/
struct list_head tasks; //引用此css_set的进程链表
/*
* List of cg_cgroup_link objects on link chains from
* cgroups referenced from this css_set. Protected by
* css_set_lock
*/
struct list_head cg_links; //此css_set关联的cgroup的链表,通过cg_cgroup_link结构体来连接
/*
* Set of subsystem states, one for each subsystem. This array
* is immutable after creation apart from the init_css_set
* during subsystem registration (at boot time) and modular subsystem
* loading/unloading.
*/
//用于引用到css_set里具体的subsys,每个subsys在这里都有个元素,有没有是一回事,用不用是另外一回事
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
/* For RCU-protected deletion */
struct rcu_head rcu_head;
};
//进程结构体
struct task_struct {
...
#ifdef CONFIG_CGROUPS
/* Control Group info protected by css_set_lock */
struct css_set __rcu *cgroups; //指向task关联的css_set
/* cg_list protected by css_set_lock and tsk->alloc_lock */
struct list_head cg_list; //链接到task关联的css_set->tasks链表
#endif
...
}
- css_set的tasks链表是所有使用该css_set的进程,进程task_struct的cgroups指针指向该进程相关的css_set,并通过cg_list链接到该css_set的tasks链表。
- css_set通过hlist链接到全局的css_set_table hash链表中,方便查找css_set。
- cg_links用于链接所有关于此css_set的cgroup,cgroup并不是直接连接到此list_head,而是通过cg_cgroup_link结构体连接。
- subsys是subsys指针数组,所有subsys都会有个实例结构体在里面
css_set可以称为cgroup group,即代表一组cgroup,一个css_set可以关联着很多cgroup。
cgroup_subsys_state
再分析下cgroup_subsys_state 结构体,这个结构体是css_set连接到具体subsys实例的桥梁,css_set有一个cgroup_subsys_state指针数组,共CGROUP_SUBSYS_COUNT个元素,意味这每个subsys在其中都有个cgroup_subsys_state实例结构体,但这个结构体里并没有包含实际控制信息。那具体控制信息在哪呢?cgroup_subsys_state实际是和kobjecct类似作用的东西,里面包含了各个subsys共有的信息。通过container_of获取的到subsys直接的实例结构体,subsys的私有控制信息都在该实际结构体中。
/* Per-subsystem/per-cgroup state maintained by the system. */
struct cgroup_subsys_state {
/*
* The cgroup that this subsystem is attached to. Useful
* for subsystems that want to know about the cgroup
* hierarchy structure
*/
struct cgroup *cgroup; //本subsys所依附的cgroup
/*
* State maintained by the cgroup system to allow subsystems
* to be "busy". Should be accessed via css_get(),
* css_tryget() and css_put().
*/
atomic_t refcnt;
unsigned long flags;
/* ID for this css, if possible */
struct css_id __rcu *id;
/* Used to put @cgroup->dentry on the last css_put() */
struct work_struct dput_work;
};
task可以通过task_struct->css_set->subsys->cgroup找到该task所依附的cgroup。
cgroup
cgroup是描述cgroup(一个control group)的结构体。cgroup文件系统中,每个目录就是一个control group。cgroup结构体的主要作用是关联css_set和subsys,而对于task到cgroup则不需要直接连接,虽然文件系统中目录下有tasks这个文件,task可以通过css_set引用到cgroup。
struct cgroup {
unsigned long flags; /* "unsigned long" so bitops work */
/*
* count users of this cgroup. >0 means busy, but doesn't
* necessarily indicate the number of tasks in the cgroup
*/
atomic_t count;
int id; /* ida allocated in-hierarchy ID */
/*
* We link our 'sibling' struct into our parent's 'children'.
* Our children link their 'sibling' into our 'children'.
*/
struct list_head sibling; /* my parent's children */
struct list_head children; /* my children */
struct list_head files; /* my files */
struct cgroup *parent; /* my parent */
struct dentry *dentry; /* cgroup fs entry, RCU protected */
/*
* This is a copy of dentry->d_name, and it's needed because
* we can't use dentry->d_name in cgroup_path().
*
* You must acquire rcu_read_lock() to access cgrp->name, and
* the only place that can change it is rename(), which is
* protected by parent dir's i_mutex.
*
* Normally you should use cgroup_name() wrapper rather than
* access it directly.
*/
struct cgroup_name __rcu *name;
/* Private pointers for each registered subsystem */
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
struct cgroupfs_root *root; //指向hierarchy结构体,
/*
* List of cg_cgroup_links pointing at css_sets with
* tasks in this cgroup. Protected by css_set_lock
*/
struct list_head css_sets;
struct list_head allcg_node; /* cgroupfs_root->allcg_list */
struct list_head cft_q_node; /* used during cftype add/rm */
/*
* Linked list running through all cgroups that can
* potentially be reaped by the release agent. Protected by
* release_list_lock
*/
struct list_head release_list;
/*
* list of pidlists, up to two for each namespace (one for procs, one
* for tasks); created on demand.
*/
struct list_head pidlists;
struct mutex pidlist_mutex;
/* For RCU-protected deletion */
struct rcu_head rcu_head;
struct work_struct free_work;
/* List of events which userspace want to receive */
struct list_head event_list;
spinlock_t event_list_lock;
/* directory xattrs */
struct simple_xattrs xattrs;
};
cgroup结构体我们只看几个关键的字段(实际其它字段我还未理解透彻)
- sibling chidren parent用于链接父母兄弟子女cgroup
- files dentry用于文件系统
- name 保存cgroup的名字,和dentry->d_name一样
- subsys cgroup_subsys_state数组,每个subsys都有一个元素
- root 指向cfroupfs_root,每个cgroup文件系统中有个cgroupfs_root
- css_set 本cgroup参与构成的css_set的集合链表
- allcg_node 链接到cgroupfs_root->allcg_list
- release_list 和cgroup文件系统目录下release文件和cgroupreliese功能相关,不分析
cgroupfs_root
cgroup 是cgroup文件系统描述一个目录的结构体,cgroup是属于一个层级的,而层级有一个专门的结构体描述,如同文件系统有个super_block描述一样。这个结构体名叫cgroupfs_root,其也跟对应对cgroup文件系统sb相关联。
/*
- A cgroupfs_root represents the root of a cgroup hierarchy, and may be
- associated with a superblock to form an active hierarchy. This is
- internal to cgroup core. Don't access directly from controllers.
*/
struct cgroupfs_root {
struct super_block *sb;
/*
* The bitmask of subsystems intended to be attached to this
* hierarchy
*/
unsigned long subsys_mask;
/* Unique id for this hierarchy. */
int hierarchy_id;
/* The bitmask of subsystems currently attached to this hierarchy */
unsigned long actual_subsys_mask;
/* A list running through the attached subsystems */
struct list_head subsys_list;
/* The root cgroup for this hierarchy */
struct cgroup top_cgroup;
/* Tracks how many cgroups are currently defined in hierarchy.*/
int number_of_cgroups;
/* A list running through the active hierarchies */
struct list_head root_list;
/* All cgroups on this root, cgroup_mutex protected */
struct list_head allcg_list;
/* Hierarchy-specific flags */
unsigned long flags;
/* IDs for cgroups in this hierarchy */
struct ida cgroup_ida;
/* The path to use for release notifications. */
char release_agent_path[PATH_MAX];
/* The name for this hierarchy - may be empty */
char name[MAX_CGROUP_ROOT_NAMELEN];
};
- sb 这个层级相关联的文件系统的super_block
- subsys_mask actual_subsys_mask 这两个mask是挂载了的subsysmask, 和这个层级下的cgroup中的cgroup->subsys数组搭配使用,这样cgroup便可知道自己挂载了哪些subsys。
- subsys_list 本cgroupfs_root运行时挂载了的subsys的链表
- top_cgroup 根目录所关联的cgroup
- number_of_cgroups 此层级的cgroup总数
- root_list 把此cgroupfs_root链接到一个全局链表 roots
- allcg_list 本层级所有的cgroup链表
- release_agent_path release相关
cgroup_subsys
cgroup_subsys描述一个subsys,内核中的subsys在代码里定义好了,不存在动态添加subsys。所有subsys存在数组subsys中,其数组元素声明在linux/cgroup_subsys.h中,而实际每个元素是定义在各个子系统的文件中,比如mem_cgroup的子系统mem_cgroup_subsys是定义在mm\memcontrol.c中。
static struct cgroup_subsys *subsys[CGROUP_SUBSYS_COUNT] = {
#include <linux/cgroup_subsys.h>
};
struct cgroup_subsys {
struct cgroup_subsys_state *(*css_alloc)(struct cgroup *cgrp);
int (*css_online)(struct cgroup *cgrp);
void (*css_offline)(struct cgroup *cgrp);
void (*css_free)(struct cgroup *cgrp);
int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
void (*attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
RH_KABI_REPLACE(void (*fork)(struct task_struct *task),
void (*fork)(struct task_struct *task, void *priv))
void (*exit)(struct cgroup *cgrp, struct cgroup *old_cgrp,
struct task_struct *task);
void (*bind)(struct cgroup *root);
int subsys_id;
int disabled;
int early_init;
/*
* True if this subsys uses ID. ID is not available before cgroup_init()
* (not available in early_init time.)
*/
bool use_id;
/*
* If %false, this subsystem is properly hierarchical -
* configuration, resource accounting and restriction on a parent
* cgroup cover those of its children. If %true, hierarchy support
* is broken in some ways - some subsystems ignore hierarchy
* completely while others are only implemented half-way.
*
* It's now disallowed to create nested cgroups if the subsystem is
* broken and cgroup core will emit a warning message on such
* cases. Eventually, all subsystems will be made properly
* hierarchical and this will go away.
*/
bool broken_hierarchy;
bool warned_broken_hierarchy;
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;
/*
* Link to parent, and list entry in parent's children.
* Protected by cgroup_lock()
*/
struct cgroupfs_root *root;
struct list_head sibling;
/* used when use_id == true */