Please indicate the source: http://blog.youkuaiyun.com/gaoxiangnumber1
Welcome to my github: https://github.com/gaoxiangnumber1
63.1 Overview
- Traditional blocking I/O model: A process performs I/O on one file descriptor at a time, and each I/O system call blocks until the data is transferred.
- Disk files are a special case. The kernel employs the buffer cache to speed disk I/O requests.
- A write() to a disk returns as soon as the requested data has been transferred to the kernel buffer cache, rather than waiting until the data is written to disk(unless O_SYNC flag was specified when opening the file).
- A read() transfers data from the buffer cache to a user buffer, and if the required data is not in the buffer cache, then the kernel puts the process to sleep while a disk read is performed.
- Some applications need to able to do one or both of the following:
- Check whether I/O is possible on a file descriptor without blocking if it is not possible.
- Monitor multiple file descriptors to see if I/O is possible on any of them.
- Three techniques that partially address these needs: nonblocking I/O and the use of multiple processes or threads.
- If we place a file descriptor in nonblocking mode by enabling the O_NONBLOCK open file status flag, then an I/O system call that can’t be immediately completed returns an error instead of blocking. Nonblocking I/O can be employed with pipes, FIFOs, sockets, terminals, pseudo-terminals, and some other types of devices. Nonblocking I/O allows us to periodically check(“poll”) whether I/O is possible on a file descriptor.
- If we don’t want a process to block when performing I/O on a file descriptor, we can create a new process to perform the I/O. The parent process can then carry on to perform other tasks, while the child process blocks until the I/O is complete. If we need to handle I/O on multiple file descriptors, we can create one child for each descriptor. The problems are expense and complexity. Creating and maintaining processes places a load on the system, and the child processes will need to use some form of IPC to inform the parent about the status of I/O operations.
- Using multiple threads instead of processes is less demanding of resources, but the threads will probably still need to communicate information to one another about the status of I/O operations, and the programming can be complex, especially if we are using thread pools to minimize the number of threads used to handle large numbers of simultaneous clients.(One place where threads can be useful is if the application needs to call a third-party library that performs blocking I/O. An application can avoid blocking in this case by making the library call in a separate thread.)
- Because of the limitations of both nonblocking I/O and the use of multiple threads or processes, one of the following alternatives is preferable:
- I/O multiplexing allows a process to simultaneously monitor multiple file descriptors to find out whether I/O is possible on any of them. select() and poll() perform I/O multiplexing.
- Signal-driven I/O is a technique whereby a process requests that the kernel send it a signal when input is available or data can be written on a specified file descriptor. The process can then carry on performing other activities, and is notified when I/O becomes possible via receipt of the signal. When monitoring large numbers of file descriptors, signal-driven I/O provides better performance than select() and poll().
- epoll is a Linux-specific feature.
Like the I/O multiplexing APIs, epoll allows a process to monitor multiple file descriptors to see if I/O is possible on any of them.
Like signal-driven I/O, epoll provides better performance when monitoring large numbers of file descriptors.
- I/O multiplexing, signal-driven I/O, and epoll are all methods of achieving the same result: monitoring one or several file descriptors simultaneously to see if they are ready to perform I/O(to be precise, to see whether an I/O system call could be performed without blocking). The transition of a file descriptor into a ready state is triggered by some type of I/O event(the arrival of input, the completion of a socket connection and so on). None of these techniques performs I/O. They merely tell us that a file descriptor is ready.
- One I/O model that we don’t describe in this chapter is POSIX asynchronous I/O(AIO). POSIX AIO allows a process to queue an I/O operation to a file and then later be notified when the operation is complete.
Advantage: The initial I/O call returns immediately, so that the process is not tied up waiting for data to be transferred to the kernel or for the operation to complete. This allows the process to perform other tasks in parallel with the I/O(which may include queuing further I/O requests).
Which technique?
- select() and poll() are standard interfaces that have been present on UNIX for many years.
- Advantage: portability.
- Disadvantage: they don’t scale well when monitoring large numbers(hundreds or thousands) of file descriptors.
- Advantage of epoll: it allows an application to efficiently monitor large numbers of file descriptors.
Disadvantage: it is only on Linux. - Signal-driven I/O allows an application to efficiently monitor large numbers of file descriptors. But epoll provides advantages over signal-driven I/O:
- Avoid the complexities of dealing with signals.
- Ability to specify the kind of monitoring that we want to perform(e.g., ready for reading/writing).
- Ability to select either level-triggered or edge-triggered notification(Section 63.1.1).
- select() and poll() are more portable, while signal-driven I/O and epoll deliver better performance. For some applications, it is worthwhile writing an abstract software layer for monitoring file descriptor events. With such a layer, portable programs can employ epoll on Linux, and fall back to the use of select() or poll() on other systems.
- libevent is a software layer that provides an abstraction for monitoring file descriptor events. It can employ any of the techniques: select(), poll(), signal-driven I/O, or epoll, as well as the Solaris specific /dev/poll interface or the BSD kqueue interface.
63.1.1 Level-Triggered and Edge-Triggered Notification
- Level-triggered notification: A file descriptor is considered to be ready if it is possible to perform an I/O system call without blocking.
- Edge-triggered notification: Notification is provided if there is I/O activity(e.g., new input) on a file descriptor since it was last monitored.
- epoll can employ both level-triggered notification(the default) and edge-triggered notification.
How different notification model affects the way we design a program?
- When we employ level-triggered notification, we can check the readiness of a file descriptor at any time. This means that when we determine that a file descriptor is ready(e.g., it has input available), we can perform I/O on the descriptor, and then repeat the monitoring operation to check if the descriptor is still ready(e.g., it still has more input available), in which case we can perform more I/O, and so on.
Because the level-triggered model allows us to repeat the I/O monitoring operation at any time, it is not necessary to perform as much I/O as possible(e.g., read as many bytes as possible) on the file descriptor(or even perform any I/O at all) each time we are notified that a file descriptor is ready. - When we employ edge-triggered notification, we receive notification only when an I/O event occurs. We don’t receive any further notification until another I/O event occurs. Furthermore, when an I/O event is notified for a file descriptor, we usually don’t know how much I/O is possible(e.g., how many bytes are available for reading). Therefore, programs that employ edge-triggered notification are usually designed according to the following rules:
- After notification of an I/O event, the program should(at some point) perform as much I/O as possible(e.g., read as many bytes as possible) on the corresponding file descriptor. If the program fails to do this, then it might miss the opportunity to perform some I/O, because it would not be aware of the need to operate on the file descriptor until another I/O event occurred. This could lead to spurious data loss or blockages in a program.
We said “at some point” because sometimes it may not be desirable to perform all of the I/O immediately after we determine that the file descriptor is ready. The problem is that we may starve other file descriptors of attention if we perform a large amount of I/O on one file descriptor(Section 63.4.6). - If the program employs a loop to perform as much I/O as possible on the file descriptor, and the descriptor is marked as blocking, then eventually an I/O system call will block when no more I/O is possible. For this reason, each monitored file descriptor is normally placed in nonblocking mode, and after notification of an I/O event, I/O operations are performed repeatedly until the relevant system call(e.g., read() or write()) fails with the error EAGAIN or EWOULDBLOCK.
- After notification of an I/O event, the program should(at some point) perform as much I/O as possible(e.g., read as many bytes as possible) on the corresponding file descriptor. If the program fails to do this, then it might miss the opportunity to perform some I/O, because it would not be aware of the need to operate on the file descriptor until another I/O event occurred. This could lead to spurious data loss or blockages in a program.
63.1.2 Employing Nonblocking I/O with Alternative I/O Models
- Nonblocking I/O(the O_NONBLOCK flag) is often used in conjunction with the I/O models described in this chapter. Examples of why this can be useful are:
- As explained in the previous section, nonblocking I/O is usually employed in conjunction with I/O models that provide edge-triggered notification of I/O events.
- If multiple processes(or threads) are performing I/O on the same open file descriptions, then, from a particular process’s point of view, a descriptor’s readiness may change between the time the descriptor was notified as being ready and the time of the subsequent I/O call. Consequently, a blocking I/O call could block, thus preventing the process from monitoring other file descriptors.(This can occur for all of the I/O models that we describe in this chapter, regardless of whether they employ level-triggered or edge-triggered notification.)
- Even after a level-triggered API such as select() or poll() informs us that a file descriptor for a stream socket is ready for writing, if we write a large enough block of data in a single write() or send(), then the call will nevertheless block.
- In rare cases, level-triggered APIs such as select() and poll() can return spurious readiness notifications—they can falsely inform us that a file descriptor is ready. This could be caused by a kernel bug or be expected behavior in an uncommon scenario.
- Section 16.6 of UNP describes one example of spurious readiness notifications on BSD systems for a listening socket. If a client connects to a server’s listening socket and then resets the connection, a select() performed by the server between these two events will indicate the listening socket as being readable, but a subsequent accept() that is performed after the client’s reset will block.
63.2 I/O Multiplexing
- I/O multiplexing allows us to simultaneously monitor multiple file descriptors to see if I/O is possible on any of them. We can perform I/O multiplexing using select()/ poll() to monitor file descriptors for regular files, terminals, pseudo-terminals, pipes, FIFOs, sockets, and some types of character devices.
63.2.1 The select() System Call
#include <sys/time.h>
#include <sys/select.h>
#include <sys/types.h>
#include <unistd.h>
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);
Return: number of ready file descriptors, 0 on timeout, -1 on error
- nfds, readfds, writefds, and exceptfds arguments specify the file descriptors that select() is to monitor.
- timeout can be used to set an upper limit on the time for which select() will block.
File descriptor sets
- readfds, writefds, and exceptfds are pointers to file descriptor sets that use the data type fd_set. These arguments are used as follows:
- readfds is the set of file descriptors to be tested to see if input is possible;
- writefds is the set of file descriptors to be tested to see if output is possible;
- exceptfds is the set of file descriptors to be tested to see if an exceptional condition has occurred. An exceptional condition occurs in just two circumstances on Linux:
-1- A state change occurs on a pseudo-terminal slave connected to a master that is in packet mode(Section 64.5).
-2- Out-of-band data is received on a stream socket(Section 61.13.1).
- fd_set data type is implemented as a bit mask. All manipulation of file descriptor sets is done via four macros: FD_ZERO(), FD_SET(), FD_CLR(), and FD_ISSET().
#include <sys/time.h>
#include <sys/select.h>
#include <sys/types.h>
#include <unistd.h>
void FD_ZERO(fd_set *fdset);
void FD_SET(int fd, fd_set *fdset);
void FD_CLR(int fd, fd_set *fdset);
int FD_ISSET(int fd, fd_set *fdset);
Return: true(1) if fd is in fdset, or false(0) otherwise
- FD_ZERO() initializes the set pointed to by fdset to be empty.
FD_SET() adds the file descriptor fd to the set pointed to by fdset.
FD_CLR() removes the file descriptor fd from the set pointed to by fdset.
FD_ISSET() returns true if the file descriptor fd is a member of the set pointed to by fdset. - A file descriptor set has a maximum size FD_SETSIZE, which is 1024 on Linux. If we want to change this limit, we must modify the definition in the glibc header files. If we need to monitor large numbers of descriptors, then using epoll is preferable to the use of select().
- readfds, writefds, and exceptfds are all value-result. Before the call to select(), the fd_set structures pointed to by these arguments must be initialized(using FD_ZERO() and FD_SET()) to contain the set of file descriptors of interest. select() modifies each of these structures and on return, they contain the set of file descriptors that are ready. The structures can then be examined using FD_ISSET().
- If we are not interested in a particular class of events, then the corresponding fd_set argument can be specified as NULL.
- nfds is set one greater than the highest file descriptor number included in any of the three file descriptor sets. This argument allows select() to be efficient since the kernel knows not to check whether file descriptor numbers higher than this value are part of each file descriptor set.
The timeout argument
- timeout can be specified as
- NULL: select() blocks indefinitely;
- A pointer to a timeval structure.
struct timeval
{
time_t tv_sec; /* Seconds */
suseconds_t tv_usec; /* Microseconds(long int) */
};
- If both fields of timeout are 0, then select() doesn’t block; it polls the specified file descriptors to see which ones are ready and returns immediately. Otherwise, timeout specifies an upper limit on the time for which select() is to wait.
- Although the timeval structure affords microsecond precision, the accuracy of the call is limited by the granularity of the software clock(Section 10.6).
- When timeout is NULL, or points to a structure containing nonzero fields, select() blocks until one of the following occurs:
- at least one of the file descriptors specified in readfds, writefds, or exceptfds becomes ready;
- the call is interrupted by a signal handler;
- the amount of time specified by timeout has passed.
- On Linux, if select() returns because one or more file descriptors became ready, and if timeout was non-NULL, then select() updates the structure to which timeout points to indicate how much time remained until the call would have timed out. Most other UNIX systems don’t modify this structure. Portable applications that employ select() within a loop should always ensure that the structure pointed to by timeout is initialized before each select(), and should ignore the information returned in the structure after the call.
- On Linux, if select() is interrupted by a signal handler(so that it fails with the error EINTR), then the structure is