Linux-进程的管理与调度5(基于6.1内核)---Linux下0号进程
一、3特殊进程
Linux下有3个特殊的进程,idle进程(PID = 0), init进程(PID = 1)和kthreadd(PID = 2)
* idle进程由系统自动创建, 运行在内核态
idle进程其pid=0,其前身是系统创建的第一个进程,也是唯一一个没有通过fork或者kernel_thread产生的进程。完成加载系统后,演变为进程调度、交换
* init进程由idle通过kernel_thread创建,在内核空间完成初始化后, 加载init程序, 并最终用户空间
由0进程创建,完成系统的初始化. 是系统中所有其它用户进程的祖先进程
Linux中的所有进程都是有init进程创建并运行的。首先Linux内核启动,然后在用户空间中启动init进程,再启动其他系统进程。在系统启动完成完成后,init将变为守护进程监视系统其他进程。
* kthreadd进程由idle通过kernel_thread创建,并始终运行在内核空间, 负责所有内核线程的调度和管理
它的任务就是管理和调度其他内核线程kernel_thread, 会循环执行一个kthread的函数,该函数的作用就是运行kthread_create_list全局链表中维护的kthread, 当我们调用kernel_thread创建的内核线程会被加入到此链表中,因此所有的内核线程都是直接或者间接的以kthreadd为父进程
二、idle的创建
在smp系统中,每个处理器单元有独立的一个运行队列,而每个运行队列上又有一个idle进程,即有多少处理器单元,就有多少idle进程。
idle进程其pid=0,其前身是系统创建的第一个进程,也是唯一一个没有通过fork()产生的进程。在smp系统中,每个处理器单元有独立的一个运行队列,而每个运行队列上又有一个idle进程,即有多少处理器单元,就有多少idle进程。系统的空闲时间,其实就是指idle进程的”运行时间”。
嵌入式系统从固态程序到uboot,再加载linux内核开始运行的,一直到指定shell开始运行告一段落,这时用户开始操作Linux。
2.1、0号进程上下文信息–init_task描述符
init_task是内核中所有进程、线程的task_struct雏形,在内核初始化过程中,通过静态定义构造出了一个task_struct接口,取名为init_task,然后在内核初始化的后期,通过rest_init()函数新建了内核init线程,kthreadd内核线程
-
内核init线程,最终执行/sbin/init进程,变为所有用户态程序的根进程(pstree命令显示),即用户空间的init进程
开始的init是有kthread_thread创建的内核线程, 他在完成初始化工作后, 转向用户空间, 并且生成所有用户进程的祖先
-
内核kthreadd内核线程,变为所有内核态其他守护线程的父线程。
它的任务就是管理和调度其他内核线程kernel_thread, 会循环执行一个kthread的函数,该函数的作用就是运行kthread_create_list全局链表中维护的kthread, 当我们调用kernel_thread创建的内核线程会被加入到此链表中,因此所有的内核线程都是直接或者间接的以kthreadd为父进程。
所以init_task决定了系统所有进程、线程的基因, 它完成初始化后, 最终演变为0号进程idle, 并且运行在内核态。
内核在初始化过程中,当创建完init和kthreadd内核线程后,内核会发生调度执行,此时内核将使用该init_task作为其task_struct结构体描述符,当系统无事可做时,会调度其执行, 此时该内核会变为idle进程,让出CPU,自己进入睡眠,不停的循环,查看init_task结构体,其comm字段为swapper,作为idle进程的描述符。
idle的运行时机
idle 进程优先级为MAX_PRIO-20。早先版本中,idle是参与调度的,所以将其优先级设低点,当没有其他进程可以运行时,才会调度执行 idle。而目前的版本中idle并不在运行队列中参与调度,而是在运行队列结构中含idle指针,指向idle进程,在调度器发现运行队列为空的时候运行,调入运行
简言之, 内核中init_task变量就是是进程0使用的进程描述符,也是Linux系统中第一个进程描述符,init_task并不是系统通过kernel_thread的方式(当然更不可能是fork)创建的, 而是由内核黑客静态创建的.
init/init_task.c
struct task_struct init_task
#ifdef CONFIG_ARCH_TASK_STRUCT_ON_STACK
__init_task_data
#endif
__aligned(L1_CACHE_BYTES)
= {
#ifdef CONFIG_THREAD_INFO_IN_TASK
.thread_info = INIT_THREAD_INFO(init_task),
.stack_refcount = REFCOUNT_INIT(1),
#endif
.__state = 0,
.stack = init_stack,
.usage = REFCOUNT_INIT(2),
.flags = PF_KTHREAD,
.prio = MAX_PRIO - 20,
.static_prio = MAX_PRIO - 20,
.normal_prio = MAX_PRIO - 20,
.policy = SCHED_NORMAL,
.cpus_ptr = &init_task.cpus_mask,
.user_cpus_ptr = NULL,
.cpus_mask = CPU_MASK_ALL,
.nr_cpus_allowed= NR_CPUS,
.mm = NULL,
.active_mm = &init_mm,
.restart_block = {
.fn = do_no_restart_syscall,
},
.se = {
.group_node = LIST_HEAD_INIT(init_task.se.group_node),
},
.rt = {
.run_list = LIST_HEAD_INIT(init_task.rt.run_list),
.time_slice = RR_TIMESLICE,
},
.tasks = LIST_HEAD_INIT(init_task.tasks),
#ifdef CONFIG_SMP
.pushable_tasks = PLIST_NODE_INIT(init_task.pushable_tasks, MAX_PRIO),
#endif
#ifdef CONFIG_CGROUP_SCHED
.sched_task_group = &root_task_group,
#endif
.ptraced = LIST_HEAD_INIT(init_task.ptraced),
.ptrace_entry = LIST_HEAD_INIT(init_task.ptrace_entry),
.real_parent = &init_task,
.parent = &init_task,
.children = LIST_HEAD_INIT(init_task.children),
.sibling = LIST_HEAD_INIT(init_task.sibling),
.group_leader = &init_task,
RCU_POINTER_INITIALIZER(real_cred, &init_cred),
RCU_POINTER_INITIALIZER(cred, &init_cred),
.comm = INIT_TASK_COMM,
.thread = INIT_THREAD,
.fs = &init_fs,
.files = &init_files,
#ifdef CONFIG_IO_URING
.io_uring = NULL,
#endif
.signal = &init_signals,
.sighand = &init_sighand,
.nsproxy = &init_nsproxy,
.pending = {
.list = LIST_HEAD_INIT(init_task.pending.list),
.signal = {{0}}
},
.blocked = {{0}},
.alloc_lock = __SPIN_LOCK_UNLOCKED(init_task.alloc_lock),
.journal_info = NULL,
INIT_CPU_TIMERS(init_task)
.pi_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
.timer_slack_ns = 50000, /* 50 usec default slack */
.thread_pid = &init_struct_pid,
.thread_group = LIST_HEAD_INIT(init_task.thread_group),
.thread_node = LIST_HEAD_INIT(init_signals.thread_head),
#ifdef CONFIG_AUDIT
.loginuid = INVALID_UID,
.sessionid = AUDIT_SID_UNSET,
#endif
#ifdef CONFIG_PERF_EVENTS
.perf_event_mutex = __MUTEX_INITIALIZER(init_task.perf_event_mutex),
.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
#endif
#ifdef CONFIG_PREEMPT_RCU
.rcu_read_lock_nesting = 0,
.rcu_read_unlock_special.s = 0,
.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
.rcu_blocked_node = NULL,
#endif
#ifdef CONFIG_TASKS_RCU
.rcu_tasks_holdout = false,
.rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list),
.rcu_tasks_idle_cpu = -1,
#endif
#ifdef CONFIG_TASKS_TRACE_RCU
.trc_reader_nesting = 0,
.trc_reader_special.s = 0,
.trc_holdout_list = LIST_HEAD_INIT(init_task.trc_holdout_list),
.trc_blkd_node = LIST_HEAD_INIT(init_task.trc_blkd_node),
#endif
#ifdef CONFIG_CPUSETS
.mems_allowed_seq = SEQCNT_SPINLOCK_ZERO(init_task.mems_allowed_seq,
&init_task.alloc_lock),
#endif
#ifdef CONFIG_RT_MUTEXES
.pi_waiters = RB_ROOT_CACHED,
.pi_top_task = NULL,
#endif
INIT_PREV_CPUTIME(init_task)
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
.vtime.seqcount = SEQCNT_ZERO(init_task.vtime_seqcount),
.vtime.starttime = 0,
.vtime.state = VTIME_SYS,
#endif
#ifdef CONFIG_NUMA_BALANCING
.numa_preferred_nid = NUMA_NO_NODE,
.numa_group = NULL,
.numa_faults = NULL,
#endif
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
.kasan_depth = 1,
#endif
#ifdef CONFIG_KCSAN
.kcsan_ctx = {
.scoped_accesses = {LIST_POISON1, NULL},
},
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
.softirqs_enabled = 1,
#endif
#ifdef CONFIG_LOCKDEP
.lockdep_depth = 0, /* no locks held yet */
.curr_chain_key = INITIAL_CHAIN_KEY,
.lockdep_recursion = 0,
#endif
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
.ret_stack = NULL,
.tracing_graph_pause = ATOMIC_INIT(0),
#endif
#if defined(CONFIG_TRACING) && defined(CONFIG_PREEMPTION)
.trace_recursion = 0,
#endif
#ifdef CONFIG_LIVEPATCH
.patch_state = KLP_UNDEFINED,
#endif
#ifdef CONFIG_SECURITY
.security = NULL,
#endif
#ifdef CONFIG_SECCOMP_FILTER
.seccomp = { .filter_count = ATOMIC_INIT(0) },
#endif
};
EXPORT_SYMBOL(init_task);
init_task是Linux内核中的第一个线程,它贯穿于整个Linux系统的初始化过程中,该进程也是Linux系统中唯一一个没有用kernel_thread()函数创建的内核态进程(内核线程)。
在init_task进程执行后期,它会调用kernel_thread()函数创建第一个核心进程kernel_init,同时init_task进程继续对Linux系统初始化。在完成初始化后,init_task会退化为cpu_idle进程,当Core 0的就绪队列中没有其它进程时,该进程将会获得CPU运行。新创建的1号进程kernel_init将会逐个启动次CPU,并最终创建用户进程。
core0上的idle进程由init_task进程退化而来,而AP的idle进程则是BSP在后面调用fork()函数逐个创建的。
init_thread_unioninit_task进程使用init_thread_union数据结构描述的内存区域作为该进程的堆栈空间,并且和自身的thread_info参数公用这一内存空间空间,
extern struct thread_info init_thread_info;
而init_thread_info则是一段体系结构相关的定义,被定义在[/arch/对应体系/include/asm/thread_info.h]中,但是他们大多数为如下定义
#define init_thread_info (init_thread_union.thread_info)
#define init_stack (init_thread_union.stack)
其中init_thread_union被定义include/linux/sched.h
union thread_union {
#ifndef CONFIG_ARCH_TASK_STRUCT_ON_STACK
struct task_struct task;
#endif
#ifndef CONFIG_THREAD_INFO_IN_TASK
struct thread_info thread_info;
#endif
unsigned long stack[THREAD_SIZE/sizeof(long)];
};
init_thread_info定义中的__init_task_data表明该内核栈所在的区域位于内核映像的init data区,我们可以通过编译完内核后所产生的System.map来看到该变量及其对应的逻辑地址。
init_task的虚拟地址空间,也采用同样的方法被定义。
由于init_task是一个运行在内核空间的内核线程, 因此其虚地址段mm为NULL, 但是必要时他还是需要使用虚拟地址的,因此avtive_mm被设置为init_mm。
init_mm被定义为mm/init-mm.c中
/*
* For dynamically allocated mm_structs, there is a dynamically sized cpumask
* at the end of the structure, the size of which depends on the maximum CPU
* number the system can see. That way we allocate only as much memory for
* mm_cpumask() as needed for the hundreds, or thousands of processes that
* a system typically runs.
*
* Since there is only one init_mm in the entire system, keep it simple
* and size this cpu_bitmask to NR_CPUS.
*/
struct mm_struct init_mm = {
.mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, init_mm.mmap_lock),
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
.write_protect_seq = SEQCNT_ZERO(init_mm.write_protect_seq),
MMAP_LOCK_INITIALIZER(init_mm)
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
#ifdef CONFIG_PER_VMA_LOCK
.mm_lock_seq = 0,
#endif
.user_ns = &init_user_ns,
.cpu_bitmap = CPU_BITS_NONE,
#ifdef CONFIG_IOMMU_SVA
.pasid = IOMMU_PASID_INVALID,
#endif
INIT_MM_CONTEXT(init_mm)
};
三、0号进程的演化
3.1、rest_init创建init进程(PID =1)和kthread进程(PID=2)
Linux在无进程概念的情况下将一直从初始化部分的代码执行到start_kernel,然后再到其最后一个函数调用rest_init。
在vmlinux的入口startup_32(head.S)中为pid号为0的原始进程设置了执行环境,然后原是进程开始执行start_kernel()完成Linux内核的初始化工作。包括初始化页表,初始化中断向量表,初始化系统时间等。
从rest_init开始,Linux开始产生进程,因为init_task是静态制造出来的,pid=0,它试图将从最早的汇编代码一直到start_kernel的执行都纳入到init_task进程上下文中。
这个函数其实是由0号进程执行的, 他就是在这个函数中, 创建了init进程和kthreadd进程。
这部分代码如下:init/main.c
noinline void __ref __noreturn rest_init(void)
{
struct task_struct *tsk;
int pid;
rcu_scheduler_starting();
/*
* We need to spawn init first so that it obtains pid 1, however
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
/*
* Pin init on the boot CPU. Task migration is not properly working
* until sched_init_smp() has been run. It will set the allowed
* CPUs for init to the non isolated CPUs.
*/
rcu_read_lock();
tsk = find_task_by_pid_ns(pid, &init_pid_ns);
tsk->flags |= PF_NO_SETAFFINITY;
set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
rcu_read_unlock();
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, NULL, CLONE_FS | CLONE_FILES);
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
/*
* Enable might_sleep() and smp_processor_id() checks.
* They cannot be enabled earlier because with CONFIG_PREEMPTION=y
* kernel_thread() would trigger might_sleep() splats. With
* CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
* already, but it's stuck on the kthreadd_done completion.
*/
system_state = SYSTEM_SCHEDULING;
complete(&kthreadd_done);
/*
* The boot idle thread must execute schedule()
* at least once to get things moving:
*/
schedule_preempt_disabled();
/* Call into cpu_idle with preempt disabled */
cpu_startup_entry(CPUHP_ONLINE);
}
1、调用kernel_thread()创建1号内核线程, 该线程随后转向用户空间, 演变为init进程。
2、调用kernel_thread()创建kthreadd内核线程。
3、init_idle_bootup_task():当前0号进程init_task最终会退化成idle进程,所以这里调用。
4、init_idle_bootup_task()函数,让init_task进程隶属到idle调度类中。
即选择idle的调度相关函数。
5、调用schedule()函数切换当前进程,在调用该函数之前,Linux系统中只有两个进程,即0号进程。
6、init_task和1号进程kernel_init,其中kernel_init进程也是刚刚被创建的。
调用该函数后,1号进程kernel_init将会运行。
7、调用cpu_idle(),0号线程进入idle函数的循环,在该循环中会周期性地检查。
在rest_init函数中,内核将通过下面的代码产生第一个真正的进程(pid=1):
pid = kernel_thread(kthreadd, NULL, NULL, CLONE_FS | CLONE_FILES);
这个进程就是pid为1的init进程,它会继续完成剩下的初始化工作,然后execve(/sbin/init), 成为系统中的其他所有进程的祖先。
调用init_post()创建用户模式1号进程。
但是这里我们发现一个问题, init进程应该是一个用户空间的进程, 但是这里却是通过kernel_thread的方式创建的, 哪岂不是式一个永远运行在内核态的内核线程么, 它是怎么演变为真正意义上用户空间的init进程的?
1号kernel_init进程完成linux的各项配置(包括启动AP)后,就会在/sbin,/etc,/bin寻找init程序来运行。该init程序会替换kernel_init进程(注意:并不是创建一个新的进程来运行init程序,而是一次变身,使用sys_execve函数改变核心进程的正文段,将核心进程kernel_init转换成用户进程init),此时处于内核态的1号kernel_init进程将会转换为用户空间内的1号进程init。户进程init将根据/etc/inittab中提供的信息完成应用程序的初始化调用。然后init进程会执行/bin/sh产生shell界面提供给用户来与Linux系统进行交互。
2、创建kthreadd
在rest_init函数中,内核将通过下面的代码产生第一个kthreadd(pid=2)
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
它的任务就是管理和调度其他内核线程kernel_thread, 会循环执行一个kthread的函数,该函数的作用就是运行kthread_create_list全局链表中维护的kthread, 当我们调用kernel_thread创建的内核线程会被加入到此链表中,因此所有的内核线程都是直接或者间接的以kthreadd为父进程
3.2、0号进程演变为idle
/*
* The boot idle thread must execute schedule()
* at least once to get things moving:
*/
schedule_preempt_disabled();
/* Call into cpu_idle with preempt disabled */
cpu_startup_entry(CPUHP_ONLINE);
因此我们回过头来看pid=0的进程,在创建了init进程后,pid=0的进程调用 cpu_idle()演变成了idle进程。
kernel/sched/idle.c
void cpu_startup_entry(enum cpuhp_state state)
{
current->flags |= PF_IDLE;
arch_cpu_idle_prepare();
cpuhp_online_idle(state);
while (1)
do_idle();
}
整个过程简单的说就是,原始进程(pid=0)创建init进程(pid=1),然后演化成idle进程(pid=0)。init进程为每个从处理器(运行队列)创建出一个idle进程(pid=0),然后演化成/sbin/init。
四、idle的运行与调度
4.1、idle的workload–cpu_idle_loop
从上面的分析我们知道,idle在系统没有其他就绪的进程可执行的时候才会被调度。不管是主处理器,还是从处理器,最后都是执行的do_idle函数
其中do_idle就是idle进程的事件循环,
因为idle进程中并不执行什么有意义的任务,所以通常考虑的是两点:
-
节能
-
低退出延迟。
其代码如下kernel/sched/idle.c
/*
* Generic idle loop implementation
*
* Called with polling cleared.
*/
static void do_idle(void)
{
int cpu = smp_processor_id();
/*
* Check if we need to update blocked load
*/
nohz_run_idle_balance(cpu);
/*
* If the arch has a polling bit, we maintain an invariant:
*
* Our polling bit is clear if we're not scheduled (i.e. if rq->curr !=
* rq->idle). This means that, if rq->idle has the polling bit set,
* then setting need_resched is guaranteed to cause the CPU to
* reschedule.
*/
__current_set_polling();
tick_nohz_idle_enter();
while (!need_resched()) {
rmb();
local_irq_disable();
if (cpu_is_offline(cpu)) {
tick_nohz_idle_stop_tick();
cpuhp_report_idle_dead();
arch_cpu_idle_dead();
}
arch_cpu_idle_enter();
rcu_nocb_flush_deferred_wakeup();
/*
* In poll mode we reenable interrupts and spin. Also if we
* detected in the wakeup from idle path that the tick
* broadcast device expired for us, we don't want to go deep
* idle as we know that the IPI is going to arrive right away.
*/
if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
tick_nohz_idle_restart_tick();
cpu_idle_poll();
} else {
cpuidle_idle_call();
}
arch_cpu_idle_exit();
}
/*
* Since we fell out of the loop above, we know TIF_NEED_RESCHED must
* be set, propagate it into PREEMPT_NEED_RESCHED.
*
* This is required because for polling idle loops we will not have had
* an IPI to fold the state for us.
*/
preempt_set_need_resched();
tick_nohz_idle_exit();
__current_clr_polling();
/*
* We promise to call sched_ttwu_pending() and reschedule if
* need_resched() is set while polling is set. That means that clearing
* polling needs to be visible before doing these things.
*/
smp_mb__after_atomic();
/*
* RCU relies on this call to be done outside of an RCU read-side
* critical section.
*/
flush_smp_call_function_queue();
schedule_idle();
if (unlikely(klp_patch_pending(current)))
klp_update_patch_state(current);
}
循环判断need_resched以降低退出延迟,用idle()来节能。
默认的idle实现是hlt指令,hlt指令使CPU处于暂停状态,等待硬件中断发生的时候恢复,从而达到节能的目的。
1、idle是一个进程,其pid为0。
2、主处理器上的idle由原始进程(pid=0)演变而来。从处理器上的idle由init进程fork得到,但是它们的pid都为0。
3、Idle进程为最低优先级,且不参与调度,只是在运行队列为空的时候才被调度。
4、Idle循环等待need_resched置位。默认使用hlt节能。
4.2、idle的调度和运行时机
Linux进程的调度顺序是按照 rt实时进程(rt调度器), normal普通进程(cfs调度器),和idel的顺序来调度的
那么可以试想如果rt和cfs都没有可以运行的任务,那么idle才可以被调度,那么他是通过怎样的方式实现的呢?
在normal的调度类,cfs公平调度器kernel/sched/fair.c, 我们可以看到
const struct sched_class fair_sched_class;
也就是说,如果系统中没有普通进程,那么会选择下个调度类优先级的进程,即使用idle_sched_class调度类进行调度的进程
当系统空闲的时候,最后就是调用idle的pick_next_task函数,被定义在kernel/sched/core.c
static struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
struct task_struct *next, *p, *max = NULL;
const struct cpumask *smt_mask;
bool fi_before = false;
bool core_clock_updated = (rq == rq->core);
unsigned long cookie;
int i, cpu, occ = 0;
struct rq *rq_i;
bool need_sync;
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);
cpu = cpu_of(rq);
/* Stopper task is switching into idle, no need core-wide selection. */
if (cpu_is_offline(cpu)) {
/*
* Reset core_pick so that we don't enter the fastpath when
* coming online. core_pick would already be migrated to
* another cpu during offline.
*/
rq->core_pick = NULL;
return __pick_next_task(rq, prev, rf);
}
/*
* If there were no {en,de}queues since we picked (IOW, the task
* pointers are all still valid), and we haven't scheduled the last
* pick yet, do so now.
*
* rq->core_pick can be NULL if no selection was made for a CPU because
* it was either offline or went offline during a sibling's core-wide
* selection. In this case, do a core-wide selection.
*/
if (rq->core->core_pick_seq == rq->core->core_task_seq &&
rq->core->core_pick_seq != rq->core_sched_seq &&
rq->core_pick) {
WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
next = rq->core_pick;
if (next != prev) {
put_prev_task(rq, prev);
set_next_task(rq, next);
}
rq->core_pick = NULL;
goto out;
}
put_prev_task_balance(rq, prev, rf);
smt_mask = cpu_smt_mask(cpu);
need_sync = !!rq->core->core_cookie;
/* reset state */
rq->core->core_cookie = 0UL;
if (rq->core->core_sibidle_count) {
if (!core_clock_updated) {
update_rq_clock(rq->core);
core_clock_updated = true;
}
sched_core_account_sibidle(rq);
/* reset after accounting force idle */
rq->core->core_sibidle_start = 0;
rq->core->core_sibidle_start_task = 0;
rq->core->core_sibidle_count = 0;
rq->core->core_sibidle_occupation = 0;
if (rq->core->core_forceidle_count) {
rq->core->core_forceidle_count = 0;
need_sync = true;
fi_before = true;
}
}
/*
* core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
*
* @task_seq guards the task state ({en,de}queues)
* @pick_seq is the @task_seq we did a selection on
* @sched_seq is the @pick_seq we scheduled
*
* However, preemptions can cause multiple picks on the same task set.
* 'Fix' this by also increasing @task_seq for every pick.
*/
rq->core->core_task_seq++;
/*
* Optimize for common case where this CPU has no cookies
* and there are no cookied tasks running on siblings.
*/
if (!need_sync) {
next = pick_task(rq);
if (!next->core_cookie) {
rq->core_pick = NULL;
/*
* For robustness, update the min_vruntime_fi for
* unconstrained picks as well.
*/
WARN_ON_ONCE(fi_before);
task_vruntime_update(rq, next, false);
goto out_set_next;
}
}
/*
* For each thread: do the regular task pick and find the max prio task
* amongst them.
*
* Tie-break prio towards the current CPU
*/
for_each_cpu_wrap(i, smt_mask, cpu) {
rq_i = cpu_rq(i);
/*
* Current cpu always has its clock updated on entrance to
* pick_next_task(). If the current cpu is not the core,
* the core may also have been updated above.
*/
if (i != cpu && (rq_i != rq->core || !core_clock_updated))
update_rq_clock(rq_i);
p = rq_i->core_pick = pick_task(rq_i);
if (!max || prio_less(max, p, fi_before))
max = p;
}
cookie = rq->core->core_cookie = max->core_cookie;
/*
* For each thread: try and find a runnable task that matches @max or
* force idle.
*/
for_each_cpu(i, smt_mask) {
rq_i = cpu_rq(i);
p = rq_i->core_pick;
if (!cookie_equals(p, cookie)) {
p = NULL;
if (cookie)
p = sched_core_find(rq_i, cookie);
if (!p)
p = idle_sched_class.pick_task(rq_i);
}
rq_i->core_pick = p;
if (p == rq_i->idle) {
rq->core->core_sibidle_count++;
if (rq_i->nr_running) {
rq->core->core_forceidle_count++;
if (!fi_before)
rq->core->core_forceidle_seq++;
}
} else {
occ++;
}
}
if (schedstat_enabled() && rq->core->core_sibidle_count) {
rq->core->core_sibidle_start = rq_clock(rq->core);
rq->core->core_sibidle_start_task = rq_clock_task(rq->core);
rq->core->core_sibidle_occupation = occ;
}
rq->core->core_pick_seq = rq->core->core_task_seq;
next = rq->core_pick;
rq->core_sched_seq = rq->core->core_pick_seq;
/* Something should have been selected for current CPU */
WARN_ON_ONCE(!next);
/*
* Reschedule siblings
*
* NOTE: L1TF -- at this point we're no longer running the old task and
* sending an IPI (below) ensures the sibling will no longer be running
* their task. This ensures there is no inter-sibling overlap between
* non-matching user state.
*/
for_each_cpu(i, smt_mask) {
rq_i = cpu_rq(i);
/*
* An online sibling might have gone offline before a task
* could be picked for it, or it might be offline but later
* happen to come online, but its too late and nothing was
* picked for it. That's Ok - it will pick tasks for itself,
* so ignore it.
*/
if (!rq_i->core_pick)
continue;
/*
* Update for new !FI->FI transitions, or if continuing to be in !FI:
* fi_before fi update?
* 0 0 1
* 0 1 1
* 1 0 1
* 1 1 0
*/
if (!(fi_before && rq->core->core_forceidle_count))
task_vruntime_update(rq_i, rq_i->core_pick, !!rq->core->core_forceidle_count);
if (rq->core->core_forceidle_count)
rq_i->core_pick->core_occupation = occ;
if (i == cpu) {
rq_i->core_pick = NULL;
continue;
}
/* Did we break L1TF mitigation requirements? */
WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick));
if (rq_i->curr == rq_i->core_pick) {
rq_i->core_pick = NULL;
continue;
}
resched_curr(rq_i);
}
out_set_next:
set_next_task(rq, next);
out:
if (rq->core->core_forceidle_count && next == rq->idle)
queue_core_balance(rq);
return next;
}
这idle进程在启动start_kernel函数的时候调用init_idle函数的时候,把当前进程(0号进程)置为每个rq运行队列的的idle上。
rq->curr = rq->idle = idle;
这里idle就是调用start_kernel函数的进程,就是0号进程。
五、总结
系统允许一个进程创建新进程,新进程即为子进程,子进程还可以创建新的子进程,形成进程树结构模型。整个linux系统的所有进程也是一个树形结构。树根是系统自动构造的(或者说是由内核黑客手动创建的),即在内核态下执行的0号进程,它是所有进程的远古先祖。
在smp系统中,每个处理器单元有独立的一个运行队列,而每个运行队列上又有一个idle进程,即有多少处理器单元,就有多少idle进程。
1、idle进程其pid=0,其前身是系统创建的第一个进程(我们称之为init_task),也是唯一一个没有通过fork或者kernel_thread产生的进程。
2、init_task是内核中所有进程、线程的task_struct雏形,它是在内核初始化过程中,通过静态定义构造出了一个task_struct接口,取名为init_task,然后在内核初始化的后期,在rest_init()函数中通过kernel_thread创建了两个内核线程内核init线程,kthreadd内核线程, 前者后来通过演变,进入用户空间,成为所有用户进程的先祖, 而后者则成为所有内核态其他守护线程的父线程, 负责接手内核线程的创建工作
3、然后init_task通过变更调度类为sched_idle等操作演变成为idle进程, 此时系统中只有0(idle), 1(init), 2(kthreadd)3个进程, 然后执行一次进程调度, 必然切换当前进程到到init