Sleeping in the Kernel

本文探讨了Linux内核中进程安全睡眠及被唤醒的方法,包括使用schedule()函数、解决失醒问题、利用等待队列及定时睡眠等高级机制。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Kernel Korner - Sleeping in the Kernel

link:  http://www.linuxjournal.com/article/8144

The old sleep_on() function won't work reliably in an age of SMP systems and hyperthreaded processors. Here's how to make a process sleep in a safe, cross-platform way.

In Linux kernel programming, there are numerousoccasions when processes wait until somethingoccurs or when sleeping processes need to be woken up to getsome work done. There are different ways to achievethese things.

All of the discussion in this article refers to kernel modeexecution. A reference to a process means executionin kernel space in the context of that process.

Some kernel code examples have been reformatted tofit this print format. Line numbers refer to lines in theoriginal file.

The schedule() Function

In Linux, the ready-to-run processes are maintainedon a run queue. A ready-to-run process has the stateTASK_RUNNING. Once the timeslice of a runningprocess is over, the Linux scheduler picks upanother appropriate process from the run queue andallocates CPU power to that process.

A process also voluntarily can relinquish theCPU. The schedule() function could be used by aprocess to indicate voluntarily to the scheduler thatit can schedule some other process on the processor.

Once the process is scheduled back again, executionbegins from the point where the process had stopped—that is, executionbegins from the call to the schedule() function.

At times, processes want to wait until a certain eventoccurs, such as a device to initialise, I/O tocomplete or a timer to expire. In such a case,the process is said to sleep on that event. A processcan go to sleep using the schedule() function. Thefollowing code puts the executing process to sleep:

sleeping_task = current;
set_current_state(TASK_INTERRUPTIBLE);
schedule();
func1();
/* The rest of the code */


Now, let's take a look at what is happening in there.In the first statement, we store a reference to thisprocess' task structure. current, which really is amacro, gives a pointer to the executing process'task_structure.set_current_state changes the state of thecurrently executing process from TASK_RUNNING toTASK_INTERRUPTIBLE. In this case, as mentionedabove, the schedule()function simply should schedule another process. Butthat happens only if the state of the task isTASK_RUNNING. When the schedule() function is calledwith the state as TASK_INTERRUPTIBLE orTASK_UNINTERRUPTIBLE, an additional step isperformed: the currently executing process is movedoff the run queue before another process isscheduled. The effect of this is the executingprocess goes to sleep, as it no longer is on therun queue. Hence, it never is scheduled by thescheduler. And, that is how a process can sleep.

Now let's wake it up. Given a reference to a taskstructure, the process could be woken up by calling:

wake_up_process(sleeping_task);


As you might have guessed, this sets the taskstate to TASK_RUNNING and puts the task back on therun queue. Of course, the process runs only whenthe scheduler looks at it the next time around.

So now you know the simplest way of sleeping andwaking in the kernel.

Interruptible and Uninterruptible Sleep

A process can sleep in two different modes,interruptible and uninterruptible. In aninterruptible sleep, the process could be woken upfor processing of signals. In anuninterruptible sleep, the process could not bewoken up other than by issuing an explicit wake_up.Interruptible sleep is the preferred way ofsleeping, unless there is a situation in which signalscannot be handled at all, such as device I/O.

Lost Wake-Up Problem

Almost always, processes go to sleep after checkingsome condition. The lost wake-up problem arisesout of a race condition that occurs whilea process goes to conditional sleep. It is a classic problem inoperating systems.

Consider two processes, A and B. Process A isprocessing from a list, consumer, while the processB is adding to this list, producer. When the listis empty, process A sleeps. Process B wakes A up whenit appends anything to the list. The code looks like this:

Process A:
1  spin_lock(&list_lock);
2  if(list_empty(&list_head)) {
3      spin_unlock(&list_lock);
4      set_current_state(TASK_INTERRUPTIBLE);
5      schedule();
6      spin_lock(&list_lock);
7  }
8
9  /* Rest of the code ... */
10 spin_unlock(&list_lock);

Process B:
100  spin_lock(&list_lock);
101  list_add_tail(&list_head, new_node);
102  spin_unlock(&list_lock);
103  wake_up_process(processa_task);


There is one problem with this situation. It may happenthat after process A executes line 3 but before it executesline 4, process B is scheduled on another processor. In this timeslice,process B executes all its instructions, 100 through103. Thus, it performs a wake-up on process A, whichhas not yet gone to sleep. Now, process A,wrongly assuming that it safely has performed thecheck for list_empty, sets the state toTASK_INTERRUPTIBLE and goes to sleep.

Thus, a wake up from process B is lost. This isknown as the lost wake-up problem. Process A sleeps,even though there are nodes available on the list.

This problem could be avoided by restructuring thecode for process A in the following manner:

Process A:

1  set_current_state(TASK_INTERRUPTIBLE);
2  spin_lock(&list_lock);
3  if(list_empty(&list_head)) {
4         spin_unlock(&list_lock);
5         schedule();
6         spin_lock(&list_lock);
7  }
8  set_current_state(TASK_RUNNING);
9
10 /* Rest of the code ... */
11 spin_unlock(&list_lock);


This code avoids the lost wake-up problem. How? Wehave changed our current state to TASK_INTERRUPTIBLE,before we test the condition. So, what has changed?The change is that whenever a wake_up_process iscalled for a process whose state is TASK_INTERRUPTIBLEor TASK_UNINTERRUPTIBLE, and the process has notyet called schedule(), the state of the process ischanged back to TASK_RUNNING.

Thus, in the above example, even if a wake-up isdelivered by process B at any point after the checkfor list_empty is made, the state of Aautomatically is changed to TASK_RUNNING. Hence, thecall to schedule() does not put process A to sleep; itmerely schedules it out for a while, asdiscussed earlier. Thus, the wake-up no longer islost.

Here is a code snippet of a real-life example fromthe Linux kernel(linux-2.6.11/kernel/sched.c: 4254):

4253  /* Wait for kthread_stop */
4254  set_current_state(TASK_INTERRUPTIBLE);
4255  while (!kthread_should_stop()) {
4256          schedule();
4257          set_current_state(TASK_INTERRUPTIBLE);
4258  }
4259  __set_current_state(TASK_RUNNING);
4260 return 0;


This code belongs to the migration_thread. Thethread cannot exit until the kthread_should_stop()function returns 1. The thread sleeps while waitingfor the function to return 0.

As can be seen from the code, the check for thekthread_should_stop condition is made only after thestate is TASK_INTERRUPTIBLE. Hence, the wake-upreceived after the condition check but before thecall to schedule() function is not lost.


Wait Queues

Wait queues are a higher-level mechanism used toput processes to sleep and wake them up. In mostinstances, you use wait queues. They are needed when more than one processwants to sleep on the occurrence of one or more thanone event.

A wait queue for an event is a list of nodes. Eachnode points to a process waiting for that event. Anindividual node in this list is called a wait queueentry. Processes that want to sleep while the eventoccurs add themselves to this list before going tosleep. On the occurrence of the event, one or moreprocesses on the list are woken up. Upon waking up, theprocesses remove themselves from the list.

A wait queue could be defined and initialised in thefollowing manner:

wait_queue_head_t my_event;
init_waitqueue_head(&my_event);

The same effect could be achieved by using this macro:

DECLARE_WAIT_QUEUE_HEAD(my_event);

Any process that wants to wait on my_event coulduse either of the following options:

  1. wait_event(&my_event, (event_present == 1) );

  2. wait_event_interruptible(&my_event, (event_present == 1) );

The interruptible version 2 of the options aboveputs the process to an interruptible sleep, whereasthe other (option 1) puts the process into an uninterruptiblesleep.

In most instances, a process goes to sleeponly after checking some condition for theavailability of the resource. To facilitate that,both these functions take an expression as thesecond argument. The process goes to sleep only ifthe expression evaluates to false. Care is taken toavoid the lost wake-up problem.

Old kernel versions used the functions sleep_on()and interruptible_sleep_on(), but those two functionscan introduce bad race conditions and should not be used.

Let's now take a look at some of the calls forwaking up process sleeping on a wait queue:

  1. wake_up(&my_event);: wakes up only one process from the wait queue.

  2. wake_up_all(&my_event);: wakes up all the processes on the waitqueue.

  3. wake_up_interruptible(&my_event);: wakes up only one process fromthe wait queue that is in interruptible sleep.

Wait Queues: Putting It Together

Let us look at a real-life example of how wait queuesare used. smbiod is the I/O thread thatperforms I/O operations for the SMB filesystem.Here is a code snippet for the smbiod thread(linux-2.6.11/fs/smbfs/smbiod.c: 291):

291 static int smbiod(void *unused)
292 {
293     daemonize("smbiod");
294
295     allow_signal(SIGKILL);
296
297     VERBOSE("SMB Kernel thread starting "
                "(%d)...\n", current->pid);
298
299     for (;;) {
300             struct smb_sb_info *server;
301             struct list_head *pos, *n;
302
303             /* FIXME: Use poll? */
304             wait_event_interruptible(smbiod_wait,
305                     test_bit(SMBIOD_DATA_READY,
                                 &smbiod_flags));
...
...             /* Some processing */
312
313             clear_bit(SMBIOD_DATA_READY,
                          &smbiod_flags);
314
...             /* Code to perform the requested I/O */
...
...
337     }
338
339     VERBOSE("SMB Kernel thread exiting (%d)...\n",
                current->pid);
340     module_put_and_exit(0);
341 }
342


As is clear from the code, smbiod is a threadthat runs in a continuous loop as it processes I/Orequests. When there are no I/O requests to process, the threadgoes to sleep on the wait queue smbiod_wait. Thisis achieved by calling wait_event_interruptible(line 304). This call causes the smbiod to sleeponly if the DATA_READY bit is set. As mentionedearlier, wait_event_interruptible takes care toavoid the lost wake-up problem.

Now, when a process wants to get some I/O done, itsets the DATA_READY bit in the smbiod_flags andwakes up the smbiod thread to perform I/O. This canbe seen in the following code snippet(linux-2.6.11/fs/smbfs/smbiod.c: 57):

57 void smbiod_wake_up(void)
58 {
59     if (smbiod_state == SMBIOD_DEAD)
60         return;
61     set_bit(SMBIOD_DATA_READY, &smbiod_flags);
62     wake_up_interruptible(&smbiod_wait);
63 }


wake_up_interruptible wakes up one process that wassleeping on the smbiod_wait waitqueue. The functionsmb_add_request (linux-2.6.11/fs/smbfs/request.c:279) calls the smbiod_wake_up function when it addsnew requests for processing.

Thundering Herd Problem

Another classical operating system problem arises dueto the use of the wake_up_all function. Let us considera scenario in which a set of processes are sleeping ona wait queue, wanting to acquire a lock.

Once the process that has acquired the lockis done with it, it releases the lock and wakesup all the processes sleeping on the wait queue. Allthe processes try to grab the lock. Eventually,only one of these acquires the lock and the restgo back to sleep.

This behavior is not good for performance. If wealready know that only one process isgoing to resume while the rest of the processesgo back to sleep again, why wake them upin the first place? It consumes valuable CPU cyclesand incurs context-switching overheads. Thisproblem is called the thundering herd problem.That is why using the wake_up_all functionshould be done carefully, only when you know that itis required. Otherwise, go ahead and use the wake_upfunction that wakes up only one process at a time.

So, when would the wake_up_all function be used? Itis used in scenarios when processes want to take ashared lock on something. For example, processeswaiting to read data on a page could all be wokenup at the same moment.

Time-Bound Sleep

You frequently may want to delay the execution ofyour process for a given amount of time. It may berequired to allow the hardware to catch up or tocarry out an activity after specified time intervals, suchas polling a device, flushing data to disk or retransmittinga network request. This can be achieved by the functionschedule_timeout(timeout), a variant of schedule(). Thisfunction puts the process to sleep until timeout jiffieshave elapsed. jiffies is a kernel variable that isincremented for every timer interrupt.

As with schedule(), the state of the process has tobe changed to TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE before calling this function. Ifthe process is woken up earlier than timeoutjiffies have elapsed, the number of jiffies left isreturned; otherwise, zero is returned.

Let us take a look at a real-life example(linux-2.6.11/arch/i386/kernel/apm.c: 1415):

1415  set_current_state(TASK_INTERRUPTIBLE);
1416  for (;;) {
1417     schedule_timeout(APM_CHECK_TIMEOUT);
1418     if (exit_kapmd)
1419         break;
1421      * Ok, check all events, check for idle
....      * (and mark us sleeping so as not to
....      * count towards the load average)..
1423      */
1424      set_current_state(TASK_INTERRUPTIBLE);
1425      apm_event_handler();
1426  }


This code belongs to the APM thread. The threadpolls the APM BIOS for events at intervals ofAPM_CHECK_TIMEOUT jiffies. As can be seen from thecode, the thread calls schedule_timeout() to sleepfor the given duration of time, after which it callsapm_event_handler() to process any events.

You also may use a more convenient API, with whichyou can specify time in milliseconds and seconds:

  1. msleep(time_in_msec);

  2. msleep_interruptible(time_in_msec);

  3. ssleep(time_in_sec);

msleep(time_in_msec); and msleep_interruptible(time_in_msec);accept the time to sleep in milliseconds, while ssleep(time_in_sec); accepts the time tosleep in seconds. These higher-level routinesinternally convert the time into jiffies,appropriately change the state of the process andcall schedule_timeout(), thus making the processsleep.

I hope that you now have a basic understanding ofhow processes safely can sleep and wake up in thekernel. To understand the internal working of waitqueues and advanced uses, look at the implementationsof init_waitqueue_head, as well as variants of wait_eventand wake_up.

Acknowledgement

Greg Kroah-Hartman reviewed a draftof this article and contributed valuable suggestions.

Kedar Sovani (www.geocities.com/kedarsovani) works for KernelCorporation as a kernel developer. His areas of interest includesecurity, filesystems and distributed systems.


The /proc/PID/stat file contains various statistics about the process with the specified PID. The file is a single line of text, with each field separated by a space. The fields in the file are as follows: 1. pid - process ID 2. comm - name of the command that started the process 3. state - current state of the process (e.g. running, sleeping, etc.) 4. ppid - parent process ID 5. pgrp - process group ID 6. session - session ID 7. tty_nr - controlling terminal of the process 8. tpgid - ID of the foreground process group of the controlling terminal 9. flags - process flags (e.g. whether it is being traced, etc.) 10. minflt - number of minor faults (i.e. page faults that could be resolved without disk IO) 11. cminflt - number of minor faults of child processes 12. majflt - number of major faults (i.e. page faults that required disk IO) 13. cmajflt - number of major faults of child processes 14. utime - amount of time the process has spent in user mode 15. stime - amount of time the process has spent in kernel mode 16. cutime - amount of time waited-for children have been in user mode 17. cstime - amount of time waited-for children have been in kernel mode 18. priority - priority of the process 19. nice - nice value of the process 20. num_threads - number of threads in the process 21. starttime - time the process started, in jiffies (1/100th of a second) 22. vsize - virtual memory size of the process 23. rss - resident set size of the process (i.e. amount of physical memory it is using) 24. rsslim - current limit on the resident set size 25. startcode - address of the start of the code segment 26. endcode - address of the end of the code segment 27. startstack - address of the start of the stack 28. kstkesp - current value of the stack pointer 29. kstkeip - current value of the instruction pointer 30. signal - bitmap of pending signals for the process 31. blocked - bitmap of blocked signals for the process 32. sigignore - bitmap of ignored signals for the process 33. sigcatch - bitmap of caught signals for the process 34. wchan - address of the kernel function the process is currently waiting in 35. nswap - number of pages swapped to disk 36. cnswap - number of pages swapped to disk of child processes 37. exit_signal - signal to be sent to the parent process when the process exits 38. processor - CPU the process last ran on 39. rt_priority - real-time priority of the process 40. policy - scheduling policy of the process Note: Some fields may be displayed as 0 if they are not applicable to the process or if they are not available. Also, the values for utime, stime, cutime, and cstime are given in clock ticks, which can be converted to seconds by dividing by the value of the system variable HZ (which is typically 100 on most systems).
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值