Chapter 3. Process Management
[1] The other fundamental abstraction is files.
[2] The kernel implements the wait4() system call. Linux systems, via the C library, typically provide the wait(), waitpid(), wait3() , and wait4() functions. All these functions return status about a terminated process, albeit with slightly different semantics.
Process Descriptor and the Task Structure
[3] Some texts on operating system design call this list the task array. Because the Linux implementation is a linked list and not a static array, it is called the task list.
Figure 3.1. The process descriptor and task list.
Allocating the Process Descriptor
[4] Register-impaired architectures were not the only reason for creating struct thread_info.
Figure 3.2. The process descriptor and kernel stack.
struct thread_info { struct task_struct *task; struct exec_domain *exec_domain; unsigned long flags; unsigned long status; __u32 cpu; __s32 preempt_count; mm_segment_t addr_limit; struct restart_block restart_block; unsigned long previous_esp; __u8 supervisor_stack[0]; };
Storing the Process Descriptor
[5] An opaque type is a data type whose physical representation is unknown or irrelevant.
movl $-8192, %eax andl %esp, %eax
current_thread_info()->task;
Process State
-
TASK_RUNNING The process is runnable; it is either currently running or on a runqueue waiting to run (runqueues are discussed in Chapter 4, " Scheduling"). This is the only possible state for a process executing in user-space; it can also apply to a process in kernel-space that is actively running.
TASK_UNINTERRUPTIBLE This state is identical to TASK_INTERRUPTIBLE except that it does not wake up and become runnable if it receives a signal. This is used in situations where the process must wait without interruption or when the event is expected to occur quite quickly. Because the task does not respond to signals in this state, TASK_UNINTERRUPTIBLE is less often used than TASK_INTERRUPTIBLE[6].
[6] This is why you have those dreaded unkillable processes with state D in ps(1). Because the task will not respond to signals, you cannot send it a SIGKILL signal. Further, even if you could terminate the task, it would not be wise as the task is supposedly in the middle of an important operation and may hold a semaphore.
Figure 3.3. Flow chart of process states.
Manipulating the Current Process State
set_task_state(task, state); /* set task 'task' to state 'state' */
task->state = state;
Process Context
[7] Other than process context there is interrupt context, which we discuss in Chapter 6, "Interrupts and Interrupt Handlers." In interrupt context, the system is not running on behalf of a process, but is executing an interrupt handler. There is no process tied to interrupt handlers and consequently no process context.
The Process Family Tree
struct task_struct *my_parent = current->parent;
struct task_struct *task; struct list_head *list; list_for_each(list, ¤t->children) { task = list_entry(list, struct task_struct, sibling); /* task now points to one of current's children */ }
struct task_struct *task; for (task = current; task != &init_task; task = task->parent) ; /* task now points to init */
list_entry(task->tasks.next, struct task_struct, tasks)
list_entry(task->tasks.prev, struct task_struct, tasks)
struct task_struct *task; for_each_process(task) { /* this pointlessly prints the name and PID of each task */ printk("%s[%d]/n", task->comm, task->pid); }
Process Creation
[8] By exec() I mean any member of the exec() family of functions. The kernel implements the execve() system call on top of which execlp(),execle(),execv() , and execvp() are implemented. By exec() I mean any member of the exec() family of functions. The kernel implements the execve() system call on top of which execlp(), execle(), execv() , and execvp() are implemented.
[8] By exec() I mean any member of the exec() family of functions. The kernel implements the execve() system call on top of which execlp(),execle(),execv() , and execvp() are implemented. By exec() I mean any member of the exec() family of functions. The kernel implements the execve() system call on top of which execlp(), execle(), execv() , and execvp() are implemented.
Copy-on-Write
fork()
- Back in do_fork(), if copy_process() returns successfully, the new child is woken up and run. Deliberately, the kernel runs the child process first [9] . In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.. In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.
-
It then checks that the new child will not exceed the resource limits on the number of processes for the current user.
-
Now the child needs to differentiate itself from its parent. Various members of the process descriptor are cleared or set to initial values. Members of the process descriptor that are not inherited are primarily statistically information. The bulk of the data in the process descriptor is shared.
-
Next, the child's state is set to TASK_UNINTERRUPTIBLE, to ensure that it does not yet run.
-
Now, copy_process() calls copy_flags() to update the flags member of the task_struct. The PF_SUPERPRIV flag, which denotes whether a task used super-user privileges, is cleared. The PF_FORKNOEXEC flag, which denotes a process that has not called exec(), is set.
-
Next, it calls get_pid() to assign an available PID to the new task.
-
Depending on the flags passed to clone(), copy_process() then either duplicates or shares open files, filesystem information, signal handlers, process address space, and namespace. These resources are typically shared between threads in a given process; otherwise they are unique and thus copied here.
- Back in do_fork(), if copy_process() returns successfully, the new child is woken up and run. Deliberately, the kernel runs the child process first [9] . In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.. In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.
-
Finally, copy_process() cleans up and returns to the caller a pointer to the new child.
- The Linux Implementation of Threads
" later in this chapter for more about the flags). The fork(), vfork(), and __clone() library calls all invoke the clone() system call with the requisite flags. The clone() system call, in turn, calls do_fork().
The bulk of the work in forking is handled by do_fork(), which is defined in kernel/fork.c. This function calls copy_process(), and then starts the process running. The interesting work is done by copy_process():
-
It calls dup_task_struct(), which creates a new kernel stack, thread_info structure, and task_struct for the new process. The new values are identical to those of the current task. At this point, the child and parent process descriptors are identical.
-
- Back in do_fork(), if copy_process() returns successfully, the new child is woken up and run. Deliberately, the kernel runs the child process first [9] . In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.. In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.
-
It then checks that the new child will not exceed the resource limits on the number of processes for the current user.
-
Now the child needs to differentiate itself from its parent. Various members of the process descriptor are cleared or set to initial values. Members of the process descriptor that are not inherited are primarily statistically information. The bulk of the data in the process descriptor is shared.
-
Next, the child's state is set to TASK_UNINTERRUPTIBLE, to ensure that it does not yet run.
-
Now, copy_process() calls copy_flags() to update the flags member of the task_struct. The PF_SUPERPRIV flag, which denotes whether a task used super-user privileges, is cleared. The PF_FORKNOEXEC flag, which denotes a process that has not called exec(), is set.
-
Next, it calls get_pid() to assign an available PID to the new task.
-
Depending on the flags passed to clone(), copy_process() then either duplicates or shares open files, filesystem information, signal handlers, process address space, and namespace. These resources are typically shared between threads in a given process; otherwise they are unique and thus copied here.
- Back in do_fork(), if copy_process() returns successfully, the new child is woken up and run. Deliberately, the kernel runs the child process first [9] . In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.. In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.
-
Finally, copy_process() cleans up and returns to the caller a pointer to the new child.
-
It then checks that the new child will not exceed the resource limits on the number of processes for the current user.
-
Now the child needs to differentiate itself from its parent. Various members of the process descriptor are cleared or set to initial values. Members of the process descriptor that are not inherited are primarily statistically information. The bulk of the data in the process descriptor is shared.
-
Next, the child's state is set to TASK_UNINTERRUPTIBLE, to ensure that it does not yet run.
-
Now, copy_process() calls copy_flags() to update the flags member of the task_struct. The PF_SUPERPRIV flag, which denotes whether a task used super-user privileges, is cleared. The PF_FORKNOEXEC flag, which denotes a process that has not called exec(), is set.
-
Next, it calls get_pid() to assign an available PID to the new task.
-
Depending on the flags passed to clone(), copy_process() then either duplicates or shares open files, filesystem information, signal handlers, process address space, and namespace. These resources are typically shared between threads in a given process; otherwise they are unique and thus copied here.
-
Next, the remaining timeslice between the parent and its child is split between the two (this is discussed inChapter 4 ).
- Back in do_fork(), if copy_process() returns successfully, the new child is woken up and run. Deliberately, the kernel runs the child process first [9] . In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.. In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.Chapter 4 ).
-
Finally, copy_process() cleans up and returns to the caller a pointer to the new child.
Back in do_fork(), if copy_process() returns successfully, the new child is woken up and run. Deliberately, the kernel runs the child process first [9] . In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.. In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.[9] . In the common case of the child simply calling exec() immediately, this eliminates any copy-on-write overhead that would occur if the parent ran first and began writing to the address space.vfork()
The vfork() system call has the same effect as fork(), except that the page table entries of the parent process are not copied. Instead, the child executes as the sole thread in the parent's address space, and the parent is blocked until the child either calls exec() or exits. The child is not allowed to write to the address space. This was a welcome optimization in the old days of 3BSD when the call was introduced because at the time copy-on-write pages were not used to implement fork(). Today, with copy-on-write and child-runs-first semantics, the only benefit to vfork() is not copying the parent page tables entries. If Linux one day gains copy-on-write page table entries there will no longer be any benefit [10] . Because the semantics of vfork() are tricky (what, for example, happens if the exec() fails?) it would be nice if vfork() died a slow painful death. It is entirely possible to implement vfork() as a normal fork()in fact, this is what Linux did until 2.2.. Because the semantics of vfork() are tricky (what, for example, happens if the exec() fails?) it would be nice if vfork() died a slow painful death. It is entirely possible to implement vfork() as a normal fork()in fact, this is what Linux did until 2.2.[10] . Because the semantics of vfork() are tricky (what, for example, happens if the exec() fails?) it would be nice if vfork() died a slow painful death. It is entirely possible to implement vfork() as a normal fork()in fact, this is what Linux did until 2.2.The vfork() system call is implemented via a special flag to the clone() system call:In fact, there are currently patches to add this functionality to Linux. In time, this feature will most likely find its way into the mainline Linux kernel.-
In copy_process(), the task_struct member vfork_done is set to NULL.
-
In do_fork(), if the special flag was given, vfork_done is pointed at a specific address.
-
After the child is first run, the parentinstead of returningwaits for the child to signal it through the vfork_done pointer.
-
In the mm_release() function, which is used when a task exits a memory address space, vfork_done is checked to see whether it is NULL. If it is not, the parent is signaled.
-
Back in do_fork(), the parent wakes up and returns.
If this all goes as planned, the child is now executing in a new address space and the parent is again executing in its original address space. The overhead is lower, but the design is not pretty.The vfork() system call is implemented via a special flag to the clone() system call:In fact, there are currently patches to add this functionality to Linux. In time, this feature will most likely find its way into the mainline Linux kernel.-
In copy_process(), the task_struct member vfork_done is set to NULL.
-
In do_fork(), if the special flag was given, vfork_done is pointed at a specific address.
-
After the child is first run, the parentinstead of returningwaits for the child to signal it through the vfork_done pointer.
-
In the mm_release() function, which is used when a task exits a memory address space, vfork_done is checked to see whether it is NULL. If it is not, the parent is signaled.
-
Back in do_fork(), the parent wakes up and returns.
If this all goes as planned, the child is now executing in a new address space and the parent is again executing in its original address space. The overhead is lower, but the design is not pretty.The vfork() system call is implemented via a special flag to the clone() system call:-
In copy_process(), the task_struct member vfork_done is set to NULL.
-
In do_fork(), if the special flag was given, vfork_done is pointed at a specific address.
-
After the child is first run, the parentinstead of returningwaits for the child to signal it through the vfork_done pointer.
-
In the mm_release() function, which is used when a task exits a memory address space, vfork_done is checked to see whether it is NULL. If it is not, the parent is signaled.
-
Back in do_fork(), the parent wakes up and returns.
If this all goes as planned, the child is now executing in a new address space and the parent is again executing in its original address space. The overhead is lower, but the design is not pretty.-
In copy_process(), the task_struct member vfork_done is set to NULL.
-
In do_fork(), if the special flag was given, vfork_done is pointed at a specific address.
-
After the child is first run, the parentinstead of returningwaits for the child to signal it through the vfork_done pointer.
-
In the mm_release() function, which is used when a task exits a memory address space, vfork_done is checked to see whether it is NULL. If it is not, the parent is signaled.
-
Back in do_fork(), the parent wakes up and returns.
If this all goes as planned, the child is now executing in a new address space and the parent is again executing in its original address space. The overhead is lower, but the design is not pretty.The Linux Implementation of Threads
Threads are a popular modern programming abstraction. They provide multiple threads of execution within the same program in a shared memory address space. They can also share open files and other resources. Threads allow for concurrent programming and, on multiple processor systems, true parallelism.Linux has a unique implementation of threads. To the Linux kernel, there is no concept of a thread. Linux implements all threads as standard processes. The Linux kernel does not provide any special scheduling semantics or data structures to represent threads. Instead, a thread is merely a process that shares certain resources with other processes. Each thread has a unique task_struct and appears to the kernel as a normal process (which just happens to share resources, such as an address space, with other processes).This approach to threads contrasts greatly with operating systems such as Microsoft Windows or Sun Solaris, which have explicit kernel support for threads (and sometimes call threads lightweight processes). The name "lightweight process" sums up the difference in philosophies between Linux and other systems. To these other operating systems, threads are an abstraction to provide a lighter, quicker execution unit than the heavy process. To Linux, threads are simply a manner of sharing resources between processes (which are already quite lightweight) [11] . For example, assume you have a process that consists of four threads. On systems with explicit thread support, there might exist one process descriptor that in turn points to the four different threads. The process descriptor describes the shared resources, such as an address space or open files. The threads then describe the resources they alone possess. Conversely, in Linux, there are simply four processes and thus four normal task_struct structures. The four processes are set up to share certain resources.. For example, assume you have a process that consists of four threads. On systems with explicit thread support, there might exist one process descriptor that in turn points to the four different threads. The process descriptor describes the shared resources, such as an address space or open files. The threads then describe the resources they alone possess. Conversely, in Linux, there are simply four processes and thus four normal task_struct structures. The four processes are set up to share certain resources.Threads are created like normal tasks, with the exception that the clone() system call is passed flags corresponding to specific resources to be shared:clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
The previous code results in behavior identical to a normal fork(), except that the address space, filesystem resources, file descriptors, and signal handlers are shared. In other words, the new task and its parent are what are popularly called threads.As an example, benchmark process creation time in Linux versus process (or even thread!) creation time in these other operating systems. The results are quite nice.Linux has a unique implementation of threads. To the Linux kernel, there is no concept of a thread. Linux implements all threads as standard processes. The Linux kernel does not provide any special scheduling semantics or data structures to represent threads. Instead, a thread is merely a process that shares certain resources with other processes. Each thread has a unique task_struct and appears to the kernel as a normal process (which just happens to share resources, such as an address space, with other processes).This approach to threads contrasts greatly with operating systems such as Microsoft Windows or Sun Solaris, which have explicit kernel support for threads (and sometimes call threads lightweight processes). The name "lightweight process" sums up the difference in philosophies between Linux and other systems. To these other operating systems, threads are an abstraction to provide a lighter, quicker execution unit than the heavy process. To Linux, threads are simply a manner of sharing resources between processes (which are already quite lightweight) [11] . For example, assume you have a process that consists of four threads. On systems with explicit thread support, there might exist one process descriptor that in turn points to the four different threads. The process descriptor describes the shared resources, such as an address space or open files. The threads then describe the resources they alone possess. Conversely, in Linux, there are simply four processes and thus four normal task_struct structures. The four processes are set up to share certain resources.. For example, assume you have a process that consists of four threads. On systems with explicit thread support, there might exist one process descriptor that in turn points to the four different threads. The process descriptor describes the shared resources, such as an address space or open files. The threads then describe the resources they alone possess. Conversely, in Linux, there are simply four processes and thus four normal task_struct structures. The four processes are set up to share certain resources.Threads are created like normal tasks, with the exception that the clone() system call is passed flags corresponding to specific resources to be shared:clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
The previous code results in behavior identical to a normal fork(), except that the address space, filesystem resources, file descriptors, and signal handlers are shared. In other words, the new task and its parent are what are popularly called threads.As an example, benchmark process creation time in Linux versus process (or even thread!) creation time in these other operating systems. The results are quite nice.This approach to threads contrasts greatly with operating systems such as Microsoft Windows or Sun Solaris, which have explicit kernel support for threads (and sometimes call threads lightweight processes). The name "lightweight process" sums up the difference in philosophies between Linux and other systems. To these other operating systems, threads are an abstraction to provide a lighter, quicker execution unit than the heavy process. To Linux, threads are simply a manner of sharing resources between processes (which are already quite lightweight) [11] . For example, assume you have a process that consists of four threads. On systems with explicit thread support, there might exist one process descriptor that in turn points to the four different threads. The process descriptor describes the shared resources, such as an address space or open files. The threads then describe the resources they alone possess. Conversely, in Linux, there are simply four processes and thus four normal task_struct structures. The four processes are set up to share certain resources.. For example, assume you have a process that consists of four threads. On systems with explicit thread support, there might exist one process descriptor that in turn points to the four different threads. The process descriptor describes the shared resources, such as an address space or open files. The threads then describe the resources they alone possess. Conversely, in Linux, there are simply four processes and thus four normal task_struct structures. The four processes are set up to share certain resources.[11] . For example, assume you have a process that consists of four threads. On systems with explicit thread support, there might exist one process descriptor that in turn points to the four different threads. The process descriptor describes the shared resources, such as an address space or open files. The threads then describe the resources they alone possess. Conversely, in Linux, there are simply four processes and thus four normal task_struct structures. The four processes are set up to share certain resources.Threads are created like normal tasks, with the exception that the clone() system call is passed flags corresponding to specific resources to be shared:clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
The previous code results in behavior identical to a normal fork(), except that the address space, filesystem resources, file descriptors, and signal handlers are shared. In other words, the new task and its parent are what are popularly called threads.As an example, benchmark process creation time in Linux versus process (or even thread!) creation time in these other operating systems. The results are quite nice.Threads are created like normal tasks, with the exception that the clone() system call is passed flags corresponding to specific resources to be shared:clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
The previous code results in behavior identical to a normal fork(), except that the address space, filesystem resources, file descriptors, and signal handlers are shared. In other words, the new task and its parent are what are popularly called threads.As an example, benchmark process creation time in Linux versus process (or even thread!) creation time in these other operating systems. The results are quite nice.Threads are created like normal tasks, with the exception that the clone() system call is passed flags corresponding to specific resources to be shared:clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
The previous code results in behavior identical to a normal fork(), except that the address space, filesystem resources, file descriptors, and signal handlers are shared. In other words, the new task and its parent are what are popularly called threads.In contrast, a normal fork() can be implemented asclone(SIGCHLD, 0);
And vfork() is implemented asclone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);
clone(SIGCHLD, 0);
And vfork() is implemented asclone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);
The flags provided to clone() help specify the behavior of the new process and detail what resources the parent and child will share. Table 3.1 lists the clone flags, which are defined in <linux/sched.h>, and their effect.Table 3.1 lists the clone flags, which are defined in <linux/sched.h>, and their effect.Table 3.1. clone() Flags
FlagMeaningCLONE_FILESParent and child share open files.CLONE_FSParent and child share filesystem information.CLONE_IDLETASKSet PID to zero (used only by the idle tasks).CLONE_NEWNSCreate a new namespace for the child.CLONE_PARENTChild is to have same parent as its parent.CLONE_PTRACEContinue tracing child.CLONE_SETTIDWrite the TID back to user-space.CLONE_SETTLSCreate a new TLS for the child.CLONE_SIGHANDParent and child share signal handlers and blocked signals.CLONE_SYSVSEMParent and child share System V SEM_UNDO semantics.CLONE_THREADParent and child are in the same thread group.CLONE_VFORKvfork() was used and the parent will sleep until the child wakes it.CLONE_UNTRACEDDo not let the tracing process force CLONE_PTRACE on the child.CLONE_STOPStart process in the TASK_STOPPED state.CLONE_SETTLSCreate a new TLS (thread-local storage) for the child.CLONE_CHILD_CLEARTIDClear the TID in the child.CLONE_CHILD_SETTIDSet the TID in the child.CLONE_PARENT_SETTIDSet the TID in the parent.CLONE_VMParent and child share address space.
Kernel Threads
It is often useful for the kernel to perform some operations in the background. The kernel accomplishes this via kernel threadsstandard processes that exist solely in kernel-space. The significant difference between kernel threads and normal processes is that kernel threads do not have an address space (in fact, their mm pointer is NULL). They operate only in kernel-space and do not context switch into user-space. Kernel threads are, however, schedulable and preemptable as normal processes.Linux delegates several tasks to kernel threads, most notably the pdflush task and the ksoftirqd task. These threads are created on system boot by other kernel threads. Indeed, a kernel thread can be created only by another kernel thread. The interface for spawning a new kernel thread from an existing one isint kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
The new task is created via the usual clone() system call with the specified flags argument. On return, the parent kernel thread exits with a pointer to the child's task_struct. The child executes the function specified by fn with the given argument arg. A special clone flag, CLONE_KERNEL, specifies the usual flags for kernel threads: CLONE_FS, CLONE_FILES, and CLONE_SIGHAND. Most kernel threads pass this for their flags parameter.Typically, a kernel thread continues executing its initial function forever (or at least until the system reboots, but with Linux you never know). The initial function usually implements a loop in which the kernel thread wakes up as needed, performs its duties, and then returns to sleep.We will discuss specific kernel threads in more detail in later chapters.Linux delegates several tasks to kernel threads, most notably the pdflush task and the ksoftirqd task. These threads are created on system boot by other kernel threads. Indeed, a kernel thread can be created only by another kernel thread. The interface for spawning a new kernel thread from an existing one isint kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
The new task is created via the usual clone() system call with the specified flags argument. On return, the parent kernel thread exits with a pointer to the child's task_struct. The child executes the function specified by fn with the given argument arg. A special clone flag, CLONE_KERNEL, specifies the usual flags for kernel threads: CLONE_FS, CLONE_FILES, and CLONE_SIGHAND. Most kernel threads pass this for their flags parameter.Typically, a kernel thread continues executing its initial function forever (or at least until the system reboots, but with Linux you never know). The initial function usually implements a loop in which the kernel thread wakes up as needed, performs its duties, and then returns to sleep.We will discuss specific kernel threads in more detail in later chapters.int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
The new task is created via the usual clone() system call with the specified flags argument. On return, the parent kernel thread exits with a pointer to the child's task_struct. The child executes the function specified by fn with the given argument arg. A special clone flag, CLONE_KERNEL, specifies the usual flags for kernel threads: CLONE_FS, CLONE_FILES, and CLONE_SIGHAND. Most kernel threads pass this for their flags parameter.Typically, a kernel thread continues executing its initial function forever (or at least until the system reboots, but with Linux you never know). The initial function usually implements a loop in which the kernel thread wakes up as needed, performs its duties, and then returns to sleep.We will discuss specific kernel threads in more detail in later chapters.Process Termination
It is sad, but eventually processes must die. When a process terminates, the kernel releases the resources owned by the process and notifies the child's parent of its unfortunate demise.Typically, process destruction occurs when the process calls the exit() system call, either explicitly when it is ready to terminate or implicitly on return from the main subroutine of any program (that is, the C compiler places a call to exit() after main() returns). A process can also terminate involuntarily. This occurs when the process receives a signal or exception it cannot handle or ignore. Regardless of how a process terminates, the bulk of the work is handled by do_exit(), which completes a number of chores:-
First, it set the PF_EXITING flag in the flags member of the task_struct.
-
Second, it calls del_timer_sync() to remove any kernel timers. Upon return, it is guaranteed that no timer is queued and that no timer handler is running.
-
Next, if BSD process accounting is enabled, do_exit() calls acct_process() to write out accounting information.
-
Now it calls __exit_mm() to release the mm_struct held by this process. If no other process is using this address space (in other words, if it is not shared), then deallocate it.
-
Next, it calls exit_sem(). If the process is queued waiting for an IPC semaphore, it is dequeued here.
-
It then calls __exit_files(), __ exit_fs(), exit_namespace(), and exit_sighand() to decrement the usage count of objects related to file descriptors, filesystem data, the process namespace, and signal handlers, respectively. If any usage counts reach zero, the object is no longer in use by any process and it is removed.
-
Subsequently, it sets the task's exit code, stored in the exit_code member of the task_struct, to the code provided by exit() or whatever kernel mechanism forced the termination. The exit code is stored here for optional retrieval by the parent.
-
-
-
Finally, do_exit() calls schedule() to switch to a new process (see Chapter 4 ). Because TASK_ZOMBIE tasks are never scheduled, this is the last code the task will ever execute.The code for do_exit() is defined in kernel/exit.c.At this point, all objects associated with the task (assuming the task was the sole user) are freed. The task is not runnable (and in fact no longer has an address space in which to run) and is in the TASK_ZOMBIE state. The only memory it occupies is its kernel stack, the thread_info structure, and the task_struct structure. The task exists solely to provide information to its parent. After the parent retrieves the information, or notifies the kernel that it is uninterested, the remaining memory held by the process is freed and returned to the system for use.
-
It then calls exit_notify() to send signals to the task's parent, reparents any of the task's children to another thread in their thread group or the init process, and sets the task's state to TASK_ZOMBIE.
-
Finally, do_exit() calls schedule() to switch to a new process (see Chapter 4 ). Because TASK_ZOMBIE tasks are never scheduled, this is the last code the task will ever execute.The code for do_exit() is defined in kernel/exit.c.At this point, all objects associated with the task (assuming the task was the sole user) are freed. The task is not runnable (and in fact no longer has an address space in which to run) and is in the TASK_ZOMBIE state. The only memory it occupies is its kernel stack, the thread_info structure, and the task_struct structure. The task exists solely to provide information to its parent. After the parent retrieves the information, or notifies the kernel that it is uninterested, the remaining memory held by the process is freed and returned to the system for use.
-
Typically, process destruction occurs when the process calls the exit() system call, either explicitly when it is ready to terminate or implicitly on return from the main subroutine of any program (that is, the C compiler places a call to exit() after main() returns). A process can also terminate involuntarily. This occurs when the process receives a signal or exception it cannot handle or ignore. Regardless of how a process terminates, the bulk of the work is handled by do_exit(), which completes a number of chores:
-
First, it set the PF_EXITING flag in the flags member of the task_struct.
-
Second, it calls del_timer_sync() to remove any kernel timers. Upon return, it is guaranteed that no timer is queued and that no timer handler is running.
-
Next, if BSD process accounting is enabled, do_exit() calls acct_process() to write out accounting information.
-
Now it calls __exit_mm() to release the mm_struct held by this process. If no other process is using this address space (in other words, if it is not shared), then deallocate it.
-
Next, it calls exit_sem(). If the process is queued waiting for an IPC semaphore, it is dequeued here.
-
It then calls __exit_files(), __ exit_fs(), exit_namespace(), and exit_sighand() to decrement the usage count of objects related to file descriptors, filesystem data, the process namespace, and signal handlers, respectively. If any usage counts reach zero, the object is no longer in use by any process and it is removed.
-
Subsequently, it sets the task's exit code, stored in the exit_code member of the task_struct, to the code provided by exit() or whatever kernel mechanism forced the termination. The exit code is stored here for optional retrieval by the parent.
-
-
Finally, do_exit() calls schedule() to switch to a new process (see Chapter 4 ). Because TASK_ZOMBIE tasks are never scheduled, this is the last code the task will ever execute.The code for do_exit() is defined in kernel/exit.c.At this point, all objects associated with the task (assuming the task was the sole user) are freed. The task is not runnable (and in fact no longer has an address space in which to run) and is in the TASK_ZOMBIE state. The only memory it occupies is its kernel stack, the thread_info structure, and the task_struct structure. The task exists solely to provide information to its parent. After the parent retrieves the information, or notifies the kernel that it is uninterested, the remaining memory held by the process is freed and returned to the system for use.
-
It then calls exit_notify() to send signals to the task's parent, reparents any of the task's children to another thread in their thread group or the init process, and sets the task's state to TASK_ZOMBIE.
-
Finally, do_exit() calls schedule() to switch to a new process (see Chapter 4 ). Because TASK_ZOMBIE tasks are never scheduled, this is the last code the task will ever execute.The code for do_exit() is defined in kernel/exit.c.At this point, all objects associated with the task (assuming the task was the sole user) are freed. The task is not runnable (and in fact no longer has an address space in which to run) and is in the TASK_ZOMBIE state. The only memory it occupies is its kernel stack, the thread_info structure, and the task_struct structure. The task exists solely to provide information to its parent. After the parent retrieves the information, or notifies the kernel that it is uninterested, the remaining memory held by the process is freed and returned to the system for use.
-
-
It then calls exit_notify() to send signals to the task's parent, reparents any of the task's children to another thread in their thread group or the init process, and sets the task's state to TASK_ZOMBIE.
-
-
Finally, do_exit() calls schedule() to switch to a new process (seeChapter 4 ). Because TASK_ZOMBIE tasks are never scheduled, this is the last code the task will ever execute.The code for do_exit() is defined in kernel/exit.c.At this point, all objects associated with the task (assuming the task was the sole user) are freed. The task is not runnable (and in fact no longer has an address space in which to run) and is in the TASK_ZOMBIE state. The only memory it occupies is its kernel stack, the thread_info structure, and the task_struct structure. The task exists solely to provide information to its parent. After the parent retrieves the information, or notifies the kernel that it is uninterested, the remaining memory held by the process is freed and returned to the system for use.
- Chapter 4
). Because TASK_ZOMBIE tasks are never scheduled, this is the last code the task will ever execute.
The code for do_exit() is defined in kernel/exit.c.At this point, all objects associated with the task (assuming the task was the sole user) are freed. The task is not runnable (and in fact no longer has an address space in which to run) and is in the TASK_ZOMBIE state. The only memory it occupies is its kernel stack, the thread_info structure, and the task_struct structure. The task exists solely to provide information to its parent. After the parent retrieves the information, or notifies the kernel that it is uninterested, the remaining memory held by the process is freed and returned to the system for use.
Removal of the Process Descriptor
After do_exit() completes, the process descriptor for the terminated process still exists but the process is a zombie and is unable to run. As discussed, this allows the system to obtain information about a child process after it has terminated. Consequently, the acts of cleaning up after a process and removing its process descriptor are separate. After the parent has obtained information on its terminated child, or signified to the kernel that it does not care, the child's task_struct is deallocated.The wait() family of functions are implemented via a single (and complicated) system call, wait4(). The standard behavior is to suspend execution of the calling task until one of its children exits, at which time the function returns with the PID of the exited child. Additionally, a pointer is provided to the function that on return holds the exit code of the terminated child.When it is time to finally deallocate the process descriptor, release_task() is invoked. It does the following:-
First, it calls free_uid() to decrement the usage count of the process's user. Linux keeps a per-user cache of information related to how many processes and files a user has opened. If the usage count reaches zero, the user has no more open processes or files and the cache is destroyed.
-
Second, release_task() calls unhash_process() to remove the process from the pidhash and remove the process from the task list.
-
Next, if the task was ptraced, release_task() reparents the task to its original parent and removes it from the ptrace list.
-
Ultimately, release_task(), calls put_task_struct() to free the pages containing the process's kernel stack and thread_info structure and deallocate the slab cache containing the task_struct.
The wait() family of functions are implemented via a single (and complicated) system call, wait4(). The standard behavior is to suspend execution of the calling task until one of its children exits, at which time the function returns with the PID of the exited child. Additionally, a pointer is provided to the function that on return holds the exit code of the terminated child.When it is time to finally deallocate the process descriptor, release_task() is invoked. It does the following:-
First, it calls free_uid() to decrement the usage count of the process's user. Linux keeps a per-user cache of information related to how many processes and files a user has opened. If the usage count reaches zero, the user has no more open processes or files and the cache is destroyed.
-
Second, release_task() calls unhash_process() to remove the process from the pidhash and remove the process from the task list.
-
Next, if the task was ptraced, release_task() reparents the task to its original parent and removes it from the ptrace list.
-
Ultimately, release_task(), calls put_task_struct() to free the pages containing the process's kernel stack and thread_info structure and deallocate the slab cache containing the task_struct.
At this point, the process descriptor and all resources belonging solely to the process have been freed.The Dilemma of the Parentless Task
If a parent exits before its children, some mechanism must exist to reparent the child tasks to a new process, or else parentless terminated processes would forever remain zombies, wasting system memory. The solution, hinted upon previously, is to reparent a task's children on exit to either another process in the current thread group or, if that fails, the init process. In do_exit(), notify_parent() is invoked, which calls forget_original_parent() to perform the reparenting:struct task_struct *p, *reaper = father; struct list_head *list; if (father->exit_signal != -1) reaper = prev_thread(reaper); else reaper = child_reaper; if (reaper == father) reaper = child_reaper;
This code sets reaper to another task in the process's thread group. If there is not another task in the thread group, it sets reaper to child_reaper, which is the init process. Now that a suitable new parent for the children is found, each child needs to be located and reparented to reaper:list_for_each(list, &father->children) { p = list_entry(list, struct task_struct, sibling); reparent_thread(p, reaper, child_reaper); } list_for_each(list, &father->ptrace_children) { p = list_entry(list, struct task_struct, ptrace_list); reparent_thread(p, reaper, child_reaper); }
This code iterates over two lists: the child list and the ptraced child list, reparenting each child. The rationale behind having both lists is interesting; it is a new feature in the 2.6 kernel. When a task is ptraced, it is temporarily reparented to the debugging process. When the task's parent exits, however, it must be reparented along with its other siblings. In previous kernels, this resulted in a loop over every process in the system looking for children. The solution, as noted previously, is simply to keep a separate list of a process's children that are being ptracedreducing the search for one's children from every process to just two relatively small lists.With the process successfully reparented, there is no risk of stray zombie processes. The init process routinely calls wait() on its children, cleaning up any zombies assigned to it.struct task_struct *p, *reaper = father; struct list_head *list; if (father->exit_signal != -1) reaper = prev_thread(reaper); else reaper = child_reaper; if (reaper == father) reaper = child_reaper;
This code sets reaper to another task in the process's thread group. If there is not another task in the thread group, it sets reaper to child_reaper, which is the init process. Now that a suitable new parent for the children is found, each child needs to be located and reparented to reaper:list_for_each(list, &father->children) { p = list_entry(list, struct task_struct, sibling); reparent_thread(p, reaper, child_reaper); } list_for_each(list, &father->ptrace_children) { p = list_entry(list, struct task_struct, ptrace_list); reparent_thread(p, reaper, child_reaper); }
This code iterates over two lists: the child list and the ptraced child list, reparenting each child. The rationale behind having both lists is interesting; it is a new feature in the 2.6 kernel. When a task is ptraced, it is temporarily reparented to the debugging process. When the task's parent exits, however, it must be reparented along with its other siblings. In previous kernels, this resulted in a loop over every process in the system looking for children. The solution, as noted previously, is simply to keep a separate list of a process's children that are being ptracedreducing the search for one's children from every process to just two relatively small lists.With the process successfully reparented, there is no risk of stray zombie processes. The init process routinely calls wait() on its children, cleaning up any zombies assigned to it.With the process successfully reparented, there is no risk of stray zombie processes. The init process routinely calls wait() on its children, cleaning up any zombies assigned to it.Process Wrap Up
In this chapter, we looked at the famed operating system abstraction of the process. We discussed the generalities of the process, why it is important, and the relationship between processes and threads. We then discussed how Linux stores and represents processes (with task_struct and thread_info), how processes are created (via clone() and fork()), how new executable images are loaded into address spaces (via the exec() family of system calls), the hierarchy of processes, how parents glean information about their deceased children (via the wait() family of system calls), and how processes ultimately die (forcefully or intentionally via exit()).The process is a fundamental and crucial abstraction, at the heart of every modern operating system, and ultimately the reason we have operating systems altogether (to run programs).The next chapter discusses process scheduling, which is the delicate and interesting manner in which the kernel decides which processes to run, at what time, and in what order. -