poll和select

最新推荐文章于 2025-03-30 17:36:29 发布

原创最新推荐文章于 2025-03-30 17:36:29 发布 · 1.9k 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#poll和select

linux I/O机制及异步通知专栏收录该内容

2 篇文章

订阅专栏

本文详细介绍了Linux内核中的poll机制，包括poll函数、数据结构、驱动中的poll实现和中断机制，以及poll系统调用的分析。通过poll_wait函数，进程在没有文件描述符可用时进入等待，并在中断或事件发生时被唤醒。内核通过do_sys_poll函数处理poll调用，涉及信号处理、内存分配和等待队列的管理。

一、概述

应用程序可以使用poll，select，epoll三种形式，其中poll和select由两个不同的Unix团队分别实现的：select在BSD Unix中引入，而poll由System V引入。epoll，它用于将poll函数扩展到能够处理数千个文件描述符。所有三个系统调用均通过驱动程序的poll方法提供，原型如下：
unsigned int (*poll) (struct file *filp, poll_table *wait);
该方法分两步处理：
1.如果当前没有文件描述符可用来执行I/O，则内核将使进程在传递到该系统调用的所有文件描述符对应的等待队列上等待。
2.返回一个用来描述操作是否可以立即无阻塞执行的位掩码。

二、poll相关数据结构描述

1. poll_table数据结构

typedef struct poll_table_struct {
	poll_queue_proc _qproc;
	unsigned long _key;
} poll_table;

这个数据结构里面最重要的就是_qproc函数指针，在poll_wait函数中会调用

2. poll_table_entry数据结构

struct poll_table_entry {
	struct file *filp;
	unsigned long key;
	wait_queue_t wait;
	wait_queue_head_t *wait_address;
};

该数据结构包含一个打开的设备文件的指针，poll_table中的key，一个等待队列入口（元素），一个等待队列头，这个结构体相当重要

3. poll_wqueues数据结构

struct poll_wqueues {
	poll_table pt;
	struct poll_table_page *table;
	struct task_struct *polling_task;
	int triggered;
	int error;
	int inline_index;
	struct poll_table_entry inline_entries[N_INLINE_POLL_ENTRIES];
};

该数据结构包含以上的poll_table和poll_table_entry数据结构，polling_task进程数据结构

三、驱动中的poll函数及中断机制

1.poll函数实现

static unsigned int xxx_poll(struct file *fp, poll_table * wait)
{
    unsigned int mask =0; 
    poll_wait(fp, &button_wait, wait);
    //中断事件标志, 1:退出休眠状态     0:进入休眠状态 
    //当超时,就返给应用层为0 ,被唤醒了就返回POLLIN | POLLRDNORM ;
    if(even_press)
    mask |= POLLIN | POLLRDNORM ;
    return mask;     
}

2.中断的实现

static irqreturn_t  buttons_irq (int irq, void *dev_id) 
{
    ...
    even_press=1;
    wake_up_interruptible(&button_wait);
    ...
}

3.驱动机制描述

当有中断产生，比如按键按下，cpu会立即执行中断处理函数，在这里设置event_press变量，同时唤醒挂在等待队列button_wait上的所有进程。
当xxx_poll函数执行的时候，首先调用poll_wait函数，实际调用wait->_qproc函数，主要作用是将当前进程（调用poll系统调用的进程）添加进button_wait等待队列。然后判断event_press变量是否置位，如果置位则表示中断已产生，则返回相应的掩码给应用程序判断数据可读。

四、poll系统调用及内核poll深入分析

1.应用程序poll函数解析

poll系统调用原型

 int poll(struct pollfd fds[], nfds_t nfds, int timeout)；
 
 struct pollfd {
 	int fd;
 	short events;
 	short revents;
 };

这个结构中fd表示文件描述符，events表示请求检测的事件，revents表示检测之后返回的事件，如果当某个文件描述符有状态变化时，revents的值就不为空。

poll系统调用参数说明

fds：存放需要被检测状态的文件描述符；与select不同（select函数在调用之后，会清空检测文件描述符的数组），每当调用这个函数之后，系统不会清空这个数组，而是将有状态变化的描述符结构的revents变量状态变化，操作起来比较方便；
nfds：用于标记数组fds中的struct pollfd结构元素的总数量；
timeout：poll函数调用阻塞的时间，单位是ms（毫秒）

poll系统调用返回值

大于0：表示数组fds中有socket描述符的状态发生变化，或可以读取、或可以写入、或出错。并且返回的值表示这些状态有变化的socket描述符的总数量；此时可以对fds数组进行遍历，以寻找那些revents不空的socket描述符，然后判断这个里面有哪些事件以读取数据。
等于0：表示没有socket描述符有状态变化，并且调用超时。
小于0：此时表示有错误发生，此时全局变量errno保存错误码。

2.内核poll实现机制

poll的系统调用是sys_poll函数，但是在fs/select.c中只看到如下形式：

 SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,
 		int, timeout_msecs)
 {
 	struct timespec end_time, *to = NULL;
  	int ret;

 	if (timeout_msecs >= 0) {
 		to = &end_time;
 		poll_select_set_timeout(to, timeout_msecs / MSEC_PER_SEC,
 			NSEC_PER_MSEC * (timeout_msecs % MSEC_PER_SEC));
 	}

 	ret = do_sys_poll(ufds, nfds, to);

 	if (ret == -EINTR) {
 		struct restart_block *restart_block;

 		restart_block = &current_thread_info()->restart_block;
 		restart_block->fn = do_restart_poll;
 		restart_block->poll.ufds = ufds;
 		restart_block->poll.nfds = nfds;
 
 		if (timeout_msecs >= 0) {
 			restart_block->poll.tv_sec = end_time.tv_sec;
 			restart_block->poll.tv_nsec = end_time.tv_nsec;
 			restart_block->poll.has_timeout = 1;
 		} else
 			restart_block->poll.has_timeout = 0;
 
 		ret = -ERESTART_RESTARTBLOCK;
 	}
 	return ret;
 }

这里的SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfd, int, timeout_msecs)函数名是一个宏定义，可在include/linux/syscall.c文件中查找到出处，经过层层转换，实际上就等价于：

asmlinkage long sys_poll(struct pollfd __user *ufds, unsigned int nfds, int timeout_msecs)

这个函数其实就是sys_poll函数。
主要作用是：

函数调用超时阻塞时间转换，根据内核的软时钟设置频率将超时时间设置为jiffies标准时间。
调用do_sys_poll，这里完成主要的工作。
如果当前进程有待处理的信号，则先处理信号，这是根据do_sys_poll返回来决定的，事实上在这个调用中会检查当前的进程是否有未处理信号，如果有，就会返回EINTR以处理信号，然后返回-ERESTART_RESTARTBLOCK，这会导致重新调用。
如果当前进程有待处理的信号，则先处理信号，这是根据do_sys_poll返回来决定的，事实上在这个调用中会检查当前的进程是否有未处理信号，如果有，就会返回EINTR以处理信号，然后返回-ERESTART_RESTARTBLOCK，这会导致重新调用

do_sys_poll()函数

 int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 	struct timespec *end_time)
 {
 	struct poll_wqueues table;
  	int err = -EFAULT, fdcount, len, size;
 	/* Allocate small arguments on the stack to save memory and be
 	   faster - use long to make sure the buffer is aligned properly
 	   on 64 bit archs to avoid unaligned access */
 	long stack_pps[POLL_STACK_ALLOC/sizeof(long)];
 	struct poll_list *const head = (struct poll_list *)stack_pps;
  	struct poll_list *walk = head;
  	unsigned long todo = nfds;
 
 	if (nfds > rlimit(RLIMIT_NOFILE))
 		return -EINVAL;

 	len = min_t(unsigned int, nfds, N_STACK_PPS);
 	for (;;) {
 	walk->next = NULL;
 	walk->len = len;
 	if (!len)
 		break;

 	if (copy_from_user(walk->entries, ufds + nfds-todo,
 				sizeof(struct pollfd) * walk->len))
 		goto out_fds;

 	todo -= walk->len;
 	if (!todo)
 		break;

 	len = min(todo, POLLFD_PER_PAGE);
 	size = sizeof(struct poll_list) + sizeof(struct pollfd) * len;
 	walk = walk->next = kmalloc(size, GFP_KERNEL);
 	if (!walk) {
 		err = -ENOMEM;
 		goto out_fds;
 		}
 	}

 	poll_initwait(&table);
 	fdcount = do_poll(nfds, head, &table, end_time);
 	poll_freewait(&table);

 	for (walk = head; walk; walk = walk->next) {
 	struct pollfd *fds = walk->entries;
 	int j;

 	for (j = 0; j < walk->len; j++, ufds++)
 		if (__put_user(fds[j].revents, &ufds->revents))
 			goto out_fds;
   	}

 	err = fdcount;
 	out_fds:
 		walk = head->next;
 	while (walk) {
 		struct poll_list *pos = walk;
 		walk = walk->next;
 		kfree(pos);
 	}

 	return err;
 }

由于代码比较庞大，按行数来分析：
9，分配一个数组stack_pps，大小为POLL_STACK_ALLOC * sizeof(long)个字节，其中POLL_STACK_ALLOC为256，数组这段内存是分配在栈上，因为poll的设计本来就预算监听的文件描述符不会很多，所以内存不需要恨到，使用栈上的内存可以提高访问速度。
14，RLIMIT_NOFILE为7，可知poll系统调用最多打开7个设备文件。
17，N_STACK_PPS为(sizeof(stack_pps) - sizeof(struct poll_list)) / sizeof(struct pollfd)，即刚刚在栈上分配的内存大小能够存储多少个pollfd，而一个pollfd的结构在上面介绍过，一共占用8个字节，至于这里你能有些疑惑，先看看poll_list的数据结构吧，如下所示。其中struct pollfd entries[0]在这个结构体中不占用内存，这时c语言的技巧，表示结构体后面的地址。再看看后面那张图片你就知道了，所以计算这段内存上能存储多少个pollfd的公式就理解了。那么这一行的代码意思就看看预分配的栈内存能够存储下用户空间传下来的所有pollfd。

 struct poll_list {
 	struct poll_list *next;
 	int len;
 	struct pollfd entries[0];
 };

poll_list结构体
19，for循环开始后，先将next置为空，即目前只有一个poll_list
20，len赋值为这段内存上能存储的所有pollfd的个数。
24，从用户空间拷贝len个pollfd到内核空间
29，拷贝完后，如果todo -= walk->len;不为0，说明分配的栈空间不够存储用户空间所有的pollfd
32，POLLFD_PER_PAGE为(PAGE_SIZE-sizeof(struct poll_list)) / sizeof(struct pollfd)为一页内存上能存储多少个pollfd，一页内存的大小为4KB，而一个pollfd的大小为8Byte，所以一页内存能存多少个pollfd如上述公式。这一行的意思就是判断一页内存是否足以存储剩下所有的pollfd，如果还不够，那么将继续分配，因为这是一个for循环。
33，使用kmallloc分配所需的物理内存，这块内存是在堆上分配，并赋值给next，如下图，这样新分配的内存都使用链表进行管理。在这之后，就会形成一个以stack_pps存储空间为头，然后一页一页分配的内存为接点的链表。直到todo为0，退出for循环。
在这里插入图片描述

poll_initwait()函数

void poll_initwait(struct poll_wqueues *pwq)
{
 	init_poll_funcptr(&pwq->pt, __pollwait);
 	pwq->polling_task = current;
 	pwq->triggered = 0;
 	pwq->error = 0;
 	pwq->table = NULL;
 	pwq->inline_index = 0;
}

static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
 {
 	pt->_qproc = qproc;
 	pt->_key   = ~0UL; /* all events enabled */
 }

1.先将poll_wqueues传进去，然后将_pollwait函数传给poll_wqqueues的poll_table成员的_qproc函数指针。
2.将当前进程保存在poll_wqueues的polling_task中，然后就是初始化一些成员变量

do_poll()函数

 static int do_poll(unsigned int nfds,  struct poll_list *list,
 	   struct poll_wqueues *wait, struct timespec *end_time)
 {
 	poll_table* pt = &wait->pt;
 	ktime_t expire, *to = NULL;
 	int timed_out = 0, count = 0;
 	unsigned long slack = 0;

 	/* Optimise the no-wait case */
 	if (end_time && !end_time->tv_sec && !end_time->tv_nsec) {
 		pt->_qproc = NULL;
 		timed_out = 1;
 	}

 	if (end_time && !timed_out)
 		slack = select_estimate_accuracy(end_time);

 	for (; ; ) {
 		struct poll_list *walk;

 		for (walk = list; walk != NULL; walk = walk->next) {
 			struct pollfd * pfd, * pfd_end;

 			pfd = walk->entries;
 			pfd_end = pfd + walk->len;
 			for (; pfd != pfd_end; pfd++) {
 				/*
 				 * Fish for events. If we found one, record it
 				 * and kill poll_table->_qproc, so we don't
 				 * needlessly register any other waiters after
 				 * this. They'll get immediately deregistered
 				 * when we break out and return.
 				 */
 				if (do_pollfd(pfd, pt)) {
 					count++;
 					pt->_qproc = NULL;
 				}
 			}
 		}
 		/*
 		 * All waiters have already been registered, so don't provide
 		 * a poll_table->_qproc to them on the next loop iteration.
 		 */
 		pt->_qproc = NULL;
 		if (!count) {
 			count = wait->error;
 			if (signal_pending(current))
 				count = -EINTR;
 		}
 		if (count || timed_out)
 			break;

 		/*
 		 * If this is the first loop and we have a timeout
 		 * given, then we convert to ktime_t and set the to
 		 * pointer to the expiry value.
 		 */
 		if (end_time && !to) {
 			expire = timespec_to_ktime(*end_time);
 			to = &expire;
 		}
 
 		if (!poll_schedule_timeout(wait, TASK_INTERRUPTIBLE, to, slack))
 			timed_out = 1;
 	}
 	return count;
 }

由于代码庞大，按行数分析：
10，判断是否超时，如果超时，设置timeout = 1；后面会在for循环中退出，返回用户空间。
18，是一个for的无限循环
21，walk即刚刚传进来的链表，如刚刚分析，这个for循环就是遍历整个链表
24和25，赋初值，为下一个for循环做准备
26，for循环，循环每一个链表中的pollfd，如果不好理解，看看上面两张图就明白了
34，do_pollfd()函数如下：

 static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
 {
 	unsigned int mask;
 	int fd;
 	mask = 0;
 	fd = pollfd->fd;
 	if (fd >= 0) {
 		struct fd f = fdget(fd);
 		mask = POLLNVAL;
 		if (f.file) {
 			mask = DEFAULT_POLLMASK;
 			if (f.file->f_op && f.file->f_op->poll) {
 				pwait->_key = pollfd->events|POLLERR|POLLHUP;
 				mask = f.file->f_op->poll(f.file, pwait);
 			}
 			/* Mask out unneeded events. */
 			mask &= pollfd->events | POLLERR | POLLHUP;
 			fdput(f);
 		}
 	}
 	pollfd->revents = mask;
 	return mask;
 }

这段代码不难理解，首先取得pollfd中的fd，然后找到file指针（对应打开的设备文件指针），接着就调用驱动中的poll函数，驱动中的poll函数在上面已经介绍过了，如果驱动中的poll函数返回的mask中包含pollfd->events(用户希望响应的事件)，则mask不为0，否则即使mask有值，不是用户期望的，mask返回给上一层时仍然为0，最后，将正确的mask事件赋值为pollfd->revents，返回给用户。另外在驱动中的poll机制操作不再赘述，参见上面的解释。但是驱动中的poll函数会调用poll_wait，而poll_wait函数调用poll_table中的_qproc函数，如果读者细心，会发现传给驱动poll函数的poll_table在之前分析过，他是poll_wqueues中的成员，而这个成员的_qproc函数指针指向__pollwait函数。

__pollwait()函数
```
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 						poll_table *p)
 {
 		struct poll_wqueues *pwq = container_of(p, struct poll_wqueues, pt);
 		struct poll_table_entry *entry = poll_get_entry(pwq);
 		if (!entry)
 			return;
 		entry->filp = get_file(filp);
 		entry->wait_address = wait_address;
 		entry->key = p->_key;
 		init_waitqueue_func_entry(&entry->wait, pollwake);
 		entry->wait.private = pwq;
 		add_wait_queue(wait_address, &entry->wait);
 }
```
这段代码也不难理解，主要就是根据传进来的poll_table，通过container_of函数获取poll_wqueues，然后给poll_wqueues中的poll_table_entry结构体分配内存，获取poll_table_entry，最关键的是将entry->wait元素加入wait_address等待队列，这个等待队列是驱动中初始化并传进来的。此刻当前进程被赋值给了entry->wait.private->polling_task，所以即把当前系统调用的进程挂载到驱动中初始化的等待队列上，而该等待队列由驱动中的中断函数唤醒。
poll_get_entry()函数
```
 static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
 {
 	struct poll_table_page *table = p->table;

 	if (p->inline_index < N_INLINE_POLL_ENTRIES)
 		return p->inline_entries + p->inline_index++;

 	if (!table || POLL_TABLE_FULL(table)) {
 		struct poll_table_page *;

 		new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
 		if (!new_table) {
 			p->error = -ENOMEM;
 			return NULL;
 		}
 		new_table->entry = new_table->entries;
 		new_table->next = table;
 		p->table = new_table;
 		table = new_table;
 	}
 	
 	return table->entry++;
 }
```
由于poll_wait函数会进入多次，比如第一次，所有pollfd都会进来一次，当poll被唤醒后，所有pollfd会再次全部执行一次。这里的函数主要就是每次进来，都会产生一个新的table->entry，但是我的理解是，分配new_table内存部分应该是只执行了一次，因为我们的poll_table指针从来没有改变过。正是由于poll_wait会被重复调用，所以当pollfd很大时，select和poll系统调用的效率就低很多了，所以epoll就出现了。

最后，让我们再次回到do_poll()函数，如果第一次的内部for循环没有一个do_pollfd()函数返回mask或用户需要的mask，那么count值将不会增加，最终在waittimeout也没有超时的情况下，可以知道do_syspoll()函数最终将会执行poll_schedule_timeout函数，该函数将当前进程状态设置为TASK_INTERRUPTIBLE，然后就真正进入睡眠。在被唤醒的时候，有两种情况，一种是被设备驱动中的中断唤醒，或者是timeout超时了，这个时候将重新获取cpu，current进程将从某个设备驱动中取消等待队列，该进程将继续poll_schedule_timeout之后的代码，设置进程为TASK_RUNNING状态。从for(, ,)循环可知，将再次重新循环一次，也就是驱动poll，poll_wait重新执行一次，这次驱动poll可能返回mask，因为中断来了，也就意味着资源来了，mask可能不为0，那么此时count会++，所以接下来会break。也可能count仍然为0，而是timeout了。

最后，让我们再次回到do_syspoll()函数，因为do_poll()函数都返回了，接下来就是拷贝资源，释放资源了。首先是poll_freewait()函数：
```
 static void free_poll_entry(struct poll_table_entry *entry)
 {
 	remove_wait_queue(entry->wait_address, &entry->wait);
 	fput(entry->filp);
 }

 void poll_freewait(struct poll_wqueues *pwq)
 {
 	struct poll_table_page * p = pwq->table;
 	int i;
 	for (i = 0; i < pwq->inline_index; i++)
 		free_poll_entry(pwq->inline_entries + i);
 	while (p) {
 		struct poll_table_entry * entry;
 		struct poll_table_page *old;
 	
 	entry = p->entry;
 	do {
 		entry--;
 		free_poll_entry(entry);
 	} while (entry > p->entries);
 	old = p;
 	p = p->next;
 	free_page((unsigned long) old);
 }
```
}
还记得do_poll()函数退出之前做了一次多余的操作，将所有entry->wait再次加入所有等待队列，而循环执行remove_wait_queue(entry->wait_address, &entry->wait);将可以从所有等待队列中去除当前进程，然后继续释放poll_table_page分配的内存。

接下来就是拷贝数据给用户空间，如：
```
 for (walk = head; walk; walk = walk->next) {
 	struct pollfd *fds = walk->entries;
 	int j;

 for (j = 0; j < walk->len; j++, ufds++)
 	if (__put_user(fds[j].revents, &ufds->revents))
 		goto out_fds;
```
}
这个就不再赘述了，一样的双for循环，将所有的revents返还给用户

最后释放pollfd的堆空间，如：
```
 walk = head->next;
 	while (walk) {
 		struct poll_list *pos = walk;
 		walk = walk->next;
 		kfree(pos);
 	}
```
因为第一个poll_list是分配在栈上的，所有直接head->next，剩下的都是在堆上分配的内存，依次kfree。到这里，整个poll系统调用在内核的旅游就结束了，最后返回到用户空间。