Haproxy中的I/O模型：sepoll深入解析-优快云博客

本文详细介绍了Haproxy中的sepoll I/O模型，它基于epoll，理论上效率更高，减少了系统调用。sepoll称为speculative I/O，能提升新连接的读写效率。文章讨论了sepoll的优缺点，特别是如何处理饥饿问题，并详细阐述了sepoll处理流程和初始化过程。

数据结构

[include/types/fd.h]
/* info about one given fd */
struct fdtab {
	struct {
		int (*f)(int fd);          /* read/write function */
		struct buffer *b;         /* read/write buffer */
	} cb[DIR_SIZE];
	void *owner;               /* the session (or proxy) associated with this fd */
	struct {                    /* used by pollers which support speculative polling */
		unsigned char e;        /* read and write events status. 4 bits*/
		unsigned int s1;         /* Position in spec list+1. 0=not in list. */
	} spec;
	unsigned short flags;         /* various flags precising the exact status of this fd */
	unsigned char state;          /* the state of this fd */
	unsigned char ev;            /* event seen in return of poll() : FD_POLL_* */
};

struct poller {
	void   *private;                 /* any private data for the poller */
	int REGPRM2 (*is_set)(const int fd, int dir); /* check if <fd> is being polled for dir <dir> */
	int  REGPRM2    (*set)(const int fd, int dir);    /* set   polling on <fd> for <dir> */
	int  REGPRM2    (*clr)(const int fd, int dir);       /* clear polling on <fd> for <dir> */
	int  REGPRM2 (*cond_s)(const int fd, int dir); * set   polling on <fd> for <dir> if unset */
	int  REGPRM2 (*cond_c)(const int fd, int dir); /* clear polling on <fd> for <dir> if set */
	void REGPRM1    (*rem)(const int fd);      /* remove any polling on <fd> */
	void REGPRM1    (*clo)(const int fd);      /* mark <fd> as closed */
    	void REGPRM2   (*poll)(struct poller *p, int exp);   /* the poller itself */
	int  REGPRM1   (*init)(struct poller *p);            /* poller initialization */
	void REGPRM1   (*term)(struct poller *p);            /* termination of this poller */
	int  REGPRM1   (*test)(struct poller *p);            /* pre-init check of the poller */
	int  REGPRM1   (*fork)(struct poller *p);            /* post-fork re-opening */
	const char   *name;                                  /* poller name */
	int    pref;                           /* try pollers with higher preference first */
};

[src/fd.c]
struct fdtab *fdtab = NULL;     /* array of all the file descriptors */
struct fdinfo *fdinfo = NULL;   /* less-often used infos for file descriptors */
int maxfd;                      /* # of the highest fd + 1 */
int totalconn;                  /* total # of terminated sessions */
int actconn;                    /* # of active sessions */

struct poller pollers[MAX_POLLERS];
struct poller cur_poller;
int nbpollers = 0;

在看到fdtab和poller的结构体时，然后查看ev_epoll.c的时候可能会奇怪为什么会设置成这样。但是如果先查看ev_sepoll.c的话可能很多疑惑都没有了。

sepoll

在Haproxy中，作者在epoll上将模型推进至sepoll(我不知道是否在此之前就有人提出或者使用这种方法)，从理论上来说，这种模型的总体效率应该比epoll更好，虽然说它是基于epoll的，因为它能够减少较多与epoll相关的昂贵的系统调用。

sepoll，作者在代码注释中称为speculative I/O。Sepoll的原理就是，对于刚accept完的套接字描述符，一般都是直接能够读取导数据的；对于connect完的描述符，一般都是可写的；即使是对于在传输数据的链接，它也是能提升效率的，因为假设对于某一条链接的某端已经处于epoll的等待队列中，那么另一端也是需要做出反应的，要么发送数据，要么接收数据，这依赖于(读/写)缓冲区的水位。

当然，作者也描述了sepoll的缺点，那就是这可能会导致在epoll队列中的可用事件缺少而变得饥饿(starve the polled events)(我对此处饥饿的理解是，有足够资源的时候不给予需要的人；poll本来就是用于处理多个描述符专用，假设只处理几个描述符，那么poll根本就提升不了多少性能，因为它本身也是系统调用，因此需要保持poll队列含有一定数量的fd，否则就是出现饥饿情况)，作者说实验证明，当epoll队列出现饥饿的情况时，压力会转到spec I/O上面，此时由于每次去读取或者写入，但是都失败，陷入恶性循环，会严重的降低系统性能(spec list描述符较多，一直轮询肯定会导致性能问题)。用于解决此问题的方法，可以通过减少epoll一次处理的事件来解决这个问题（对spec list的不能使用这个方法，因为实验显示，spec list中2/3的fd是新的，只有1/3的fd是老的)。作者说这是基于以下两点事实，第一，对于位于spec list的fd，不能也将它们注册在epoll中等待；第二是，即使在系统压力非常大的时候，我们基本上也不会同时对同一个fd进行读与写的流操作。作者所说的后面一个事实我认为是这样的，对于客户端，一个请求都是将请求数据发送完成之后，后端才会对其进行响应；对于服务器，都是接收玩请求之后，才会发回响应数据。

作者说第一个事实意味着在饥饿期间，poll等待队列中不会有超过一半的fd。否则的话，说明spec list中的fd比poll list少，那么也就没有饥饿情况。第二个事实意味着我们只对最大数量描述符的一半事件感兴趣(每个描述符要么读，要么写)。

减少poll list一次处理的数量用于解决poll list饥饿的情况，可以这么理解，假设每个fd经过一次读和一次写之后就被销毁，那么对于第二个事实，在进行读的时候，poll list的fd不会减少，影响不大，但是在写的时候，由于读与写都已经完成了，那么可能这一次会导致大量的fd被移除，而补充又跟不上，这就可能会导致饥饿；但是由于第一个事实限制每次可处理的最大数量，那么一次读写完成被撤掉的fd数量就减少了，而且把poll list中的fd分成了两部分，错开了它们移出poll list的时间，减少了一次被移除的fd数量，那么就应该能够使后续的fd补充跟上。

那么对于fd本来就不多，导致poll list分配到的很少导致的饥饿怎么办？此时由于fd不多，spec list的fd也不多，，对性能的影响不是很大，基本上忽略了。

作者最后说明，如果我们能够在负载高峰时段保证poll list拥有maxsock/2/2数量的事件，这意味着我们应该给poll list分配maxsock/4的事件，就不会受饥饿的影响。Maxsock/2/2来源作者没有明确说明，不过从上面的的解释来看，第一除2应该是表示如果poll list如果有不小于maxsock/2的fd，那么就不会受饥饿的影响；第二个除2暂时还不能确定，假如是根据第二个事实来的，那也不是很合理，因为一个sock肯定包含两个事件，一次处理只做一个事件的话，那么时间数量也是和sock数量本身一样的。

接下来看看sepoll的处理流程。

[src/ev_sepoll.c]
#define FD_EV_IN_SL	1
#define FD_EV_IN_PL	4

#define FD_EV_IDLE	0
#define FD_EV_SPEC	(FD_EV_IN_SL)
#define FD_EV_WAIT	(FD_EV_IN_PL)
#define FD_EV_STOP	(FD_EV_IN_SL|FD_EV_IN_PL)

/* Those match any of R or W for Spec list or Poll list */
#define FD_EV_RW_SL	(FD_EV_IN_SL | (FD_EV_IN_SL << 1))
#define FD_EV_RW_PL	(FD_EV_IN_PL | (FD_EV_IN_PL << 1))
#define FD_EV_MASK_DIR	(FD_EV_IN_SL|FD_EV_IN_PL)

#define FD_EV_IDLE_R	0
#define FD_EV_SPEC_R	(FD_EV_IN_SL)
#define FD_EV_WAIT_R	(FD_EV_IN_PL)
#define FD_EV_STOP_R	(FD_EV_IN_SL|FD_EV_IN_PL)
#define FD_EV_MASK_R	(FD_EV_IN_SL|FD_EV_IN_PL)

#define FD_EV_IDLE_W	(FD_EV_IDLE_R << 1)
#define FD_EV_SPEC_W	(FD_EV_SPEC_R << 1)
#define FD_EV_WAIT_W	(FD_EV_WAIT_R << 1)
#define FD_EV_STOP_W	(FD_EV_STOP_R << 1)
#define FD_EV_MASK_W	(FD_EV_MASK_R << 1)

#define FD_EV_MASK	(FD_EV_MASK_W | FD_EV_MASK_R)

从以上宏定义可以看出，对于位于spec list的读写事件分别对应的最低两位；对于位于poll list的读写事件位于第三、四位。

[src/ev_sepoll.c]_do_poll()
REGPRM2 static void _do_poll(struct poller *p, int exp)
{
	static unsigned int last_skipped;
	static unsigned int spec_processed;
	int status, eo;
	int fd, opcode;
	int count;
	int spec_idx;
	int wait_time;
	int looping = 0;

 re_poll_once:
	/* Here we have two options :
	 * - either walk the list forwards and hope to match more events
	 * - or walk it backwards to minimize the number of changes and
	 *   to make better use of the cache.
	 * Tests have shown that walking backwards improves perf by 0.2%.
	 */

首先处理的是位于spec list的fd，作者说从后面遍历spec list能够提高0.2%的效率，这是因为spec list总是把最新的fd存储在最后，而对于最新的fd，基本上很可能是直接可读或者可写的。

[src/ev_sepoll.c]
	status = 0;
	spec_idx = nbspec;
	while (likely(spec_idx > 0)) {
		int done;

		spec_idx--;
		fd = spec_list[spec_idx];
		eo = fdtab[fd].spec.e;  /* save old events */

		if (looping && --fd_created < 0) {
			/* we were just checking the newly created FDs */
			break;
		}

拿到fd，然后根据fd从fdtab中拿到对应的信息。如果这是第二次处理循环，只是为了检查由于listen fd进行accept之后新创建的fd，因此作者专门使用一个变量fd_created用于记录新创建的fd数量，当新的fd处理完成之后，直接跳出循环了。

[src/ev_sepoll.c]_do_poll()
		/*
		 * Process the speculative events.
		 *
		 * Principle: events which are marked FD_EV_SPEC are processed
		 * with their assigned function. If the function returns 0, it
		 * means there is nothing doable without polling first. We will
		 * then convert the event to a pollable one by assigning them
		 * the WAIT status.
		 */

作者说明规则是处理标志了FD_EV_SPEC事件的，并且调用他们指定的函数，如果函数返回0，那么表示现在没有任何事可做，我们应该先对其进行一个poll等待先。

[src/ev_sepoll.c]_do_poll()
#ifdef DEBUG_DEV
		if (fdtab[fd].state == FD_STCLOSE) {
			fprintf(stderr,"fd=%d, fdtab[].ev=%x, fdtab[].spec.e=%x, .s=%d, idx=%d\n",
				fd, fdtab[fd].ev, fdtab[fd].spec.e, fdtab[fd].spec.s1, spec_idx);
		}
#endif
		done = 0;
		fdtab[fd].ev &= FD_POLL_STICKY;
		if ((eo & FD_EV_MASK_R) == FD_EV_SPEC_R) {
			/* The owner is interested in reading from this FD */
			if (fdtab[fd].state != FD_STERROR) {
				/* Pretend there is something to read */
				fdtab[fd].ev |= FD_POLL_IN;
				if (!fdtab[fd].cb[DIR_RD].f(fd))
					fdtab[fd].spec.e ^= (FD_EV_WAIT_R ^ FD_EV_SPEC_R);
				else
					done = 1;
			}
		}
		else if ((eo & FD_EV_MASK_R) == FD_EV_STOP_R) {
			/* This FD was being polled and is now being removed. */
			fdtab[fd].spec.e &= ~FD_EV_MASK_R;
		}

		if ((eo & FD_EV_MASK_W) == FD_EV_SPEC_W) {
			/* The owner is interested in writing to this FD */
			if (fdtab[fd].state != FD_STERROR) {
				/* Pretend there is something to write */
				fdtab[fd].ev |= FD_POLL_OUT;
				if (!fdtab[fd].cb[DIR_WR].f(fd))
					fdtab[fd].spec.e ^= (FD_EV_WAIT_W ^ FD_EV_SPEC_W);
				else
					done = 1;
			}
		}
		else if ((eo & FD_EV_MASK_W) == FD_EV_STOP_W) {
			/* This FD was being polled and is now being removed. */
			fdtab[fd].spec.e &= ~FD_EV_MASK_W;
	}

对于位于spec fd的读事件，当函数返回0时，去掉FD_EV_SPEC_R事件，转为FD_EV_SPEC_WAIT_R事件，表示这个描述符应该放入poll等待队列。函数返回不为0，那么表示此次spec处理时成功的，那么依然将其留在spec队列中，记录成功标志。在处理相应事件的时候还用fdtab[fd].ev记录下了相应fd被处理的事件。

对于被标志为停止了的fd，那么将其相应的读事件全部清空。

写事件的处理与读事件的处理相同。

[src/ev_sepoll.c]_do_poll()
		status += done;
		/* one callback might already have closed the fd by itself */
		if (fdtab[fd].state == FD_STCLOSE)
		continue;

前面只要读或者写成功，那么表示此次的spec处理是成功的，因此对其进行数量统计，当然有可能对应的fd在其相应的读或者写函数中已经关闭，那么以下的事情就没必要做了。

[src/ev_sepoll.c]_do_poll()
		/* Now, we will adjust the event in the poll list. Indeed, it
		 * is possible that an event which was previously in the poll
		 * list now goes out, and the opposite is possible too. We can
		 * have opposite changes for READ and WRITE too.
		 */
		if ((eo ^ fdtab[fd].spec.e) & FD_EV_RW_PL) {
			/* poll status changed*/
			if ((fdtab[fd].spec.e & FD_EV_RW_PL) == 0) {
				/* fd removed from poll list */
				opcode = EPOLL_CTL_DEL;
			}
			else if ((eo & FD_EV_RW_PL) == 0) {
				/* new fd in the poll list */
				opcode = EPOLL_CTL_ADD;
			}
			else {
				/* fd status changed */
				opcode = EPOLL_CTL_MOD;
			}

			/* construct the epoll events based on new state */
			ev.events = 0;
			if (fdtab[fd].spec.e & FD_EV_WAIT_R)
				ev.events |= EPOLLIN;

			if (fdtab[fd].spec.e & FD_EV_WAIT_W)
				ev.events |= EPOLLOUT;

			ev.data.fd = fd;
			epoll_ctl(epoll_fd, opcode, fd, &ev);
		}

对于此处的表达式结果，结合以上三种情况即可知道其结果。首先是对于done的情况，此时o^fdtab[fd].spec.e==0，所以不会进入分支；接着是对于函数返回值为0的情况，这种情况下，FD_EV_SPEC的事件被清除，FD_EV_POLL的事件被设置，因此结果为不为0，会进入分支，进入分支后，易知内部分支会进入第二分支，也就是将fd加到epoll中；第三种是FD_EV_STOP类型导致事件被清空，计算结果不为0，进入分支，由于spec.e被清零，因此进入第一个分支，也就是从epoll list中移除fd。

在进行操作判断之后，然后对poll list的fd进行相应的操作。

[src/ev_sepoll.c]_do_poll()
		if (!(fdtab[fd].spec.e & FD_EV_RW_SL)) {
			/* This fd switched to combinations of either WAIT or
			 * IDLE. It must be removed from the spec list.
			 */
			release_spec_entry(fd);
			continue;
		}
	}

在对poll list更新之后，还需要检查fd新的事件中是否已经不再包含spec的事件，如果是，那么需要将fd从fdtab中移除。至此spec的循环处理已经结束。

总结一下上面的流程。从后往前遍历spec list，根据对fd有兴趣的事件调用相应函数进行数据的输入和输出(所有的fd都是非阻塞形式的)，如果调用成功，那么相应的fd仍然保留于spec list中，并统计在spec中成功处理的fd数量；若失败，那么需要将其放入poll list去等待，因为在等待数据到来之前在spec list中并不能做什么；如果描述符已经被停止使用，那么将会从poll list或者spec list中移除。

[src/ev_sepoll.c]_do_poll()
	/* It may make sense to immediately return here if there are enough
	 * processed events, without passing through epoll_wait() because we
	 * have exactly done a poll.
	 * Measures have shown a great performance increase if we call the
	 * epoll_wait() only the second time after speculative accesses have
	 * succeeded. This reduces the number of unsucessful calls to
	 * epoll_wait() by a factor of about 3, and the total number of calls
	 * by about 2.
	 * However, when we do that after having processed too many events,
	 * events waiting in epoll() starve for too long a time and tend to
	 * become themselves eligible for speculative polling. So we try to
	 * limit this practise to reasonable situations.
	*/

	spec_processed += status;

	if (looping) {
		last_skipped++;
		return;
	}

	if (status >= MIN_RETURN_EVENTS && spec_processed < absmaxevents) {
		/* We have processed at least MIN_RETURN_EVENTS, it's worth
		 * returning now without checking epoll_wait().
		 */
		if (++last_skipped <= 1) {
			tv_update_date(0, 1);
			return;
		}
	}
	last_skipped = 0;

如果是第二次处理，也就是再回来处理新建的fd，那将last_skipped++并返回，这是为什么呢？因为之前作者描述过，一次对poll队列处理的数量减少点，既然要减少，之前做过一次了，那么这次就不再检查了。

last_skipped是用来标志当处理数量符合最小可返回数量时是否返回，如果本次返回是由于第二次处理而导致返回，那么下次出现处理数量达到最小可返回数量时不再返回。

如果spec处理成功的数量超过最小可以返回的数量并且spec_proc处理的数量不超过poll list最大的事件数，那么要是之前没设置跳过标志则返回。

第一次__do_poll循环，并且已处理数量不足以返回，那么将下次跳过标志清空。

以下流程除了最后的判断是否新建了fd而决定是否跳转到__do_poll开始再做一次spec处理之外，其他的流程和其他的I/O模型基本上是一致，因此对于其他的I/O模型不再解释。

[src/ev_sepoll.c]_do_poll()
	if (nbspec || status || run_queue || signal_queue_len) {
		/* Maybe we have processed some events that we must report, or
		 * maybe we still have events in the spec list, or there are
		 * some tasks left pending in the run_queue, so we must not
		 * wait in epoll() otherwise we will delay their delivery by
		 * the next timeout.
		 */
		wait_time = 0;
	}
	else {
		if (!exp)
			wait_time = MAX_DELAY_MS;
		else if (tick_is_expired(exp, now_ms))
			wait_time = 0;
		else {
			wait_time = TICKS_TO_MS(tick_remain(now_ms, exp)) + 1;
			if (wait_time > MAX_DELAY_MS)
				wait_time = MAX_DELAY_MS;
		}
}

要是之前的处理没有能够返回，那么接下来就需要真正的对epoll进行处理了，但是在处理之前则需要计算epoll_wait调用应该等待的时间。

如果spec队列还有fd(事件)存在，或者是spec已经有处理成功需要回去报告，或者是任务可执行队列有需要执行的任务，或者是信号队列有未决信号需要处理，那么对于epoll_wait的操作使用无阻塞的。

如果没有设置超时时间，那么将等待时间设置为程序允许的最大值。

如果给定的超时时间已经到期，那么对epoll_wait的调用也是无阻塞。

如果给定的超时时间还没到，那么计算余下的时间，如果余下的时间比程序允许的最大值还大那么将其设置为程序允许的最大值。

[src/ev_sepoll.c]_do_poll()
	/* now let's wait for real events. We normally use maxpollevents as a
	 * high limit, unless <nbspec> is already big, in which case we need
	 * to compensate for the high number of events processed there.
	 */
	fd = MIN(absmaxevents, spec_processed);
	fd = MAX(global.tune.maxpollevents, fd);
	fd = MIN(maxfd, fd);
	/* we want to detect if an accept() will create new speculative FDs here */
	//从此处可以看出，listen fd是放在epoll等待队列中的。
	fd_created = 0;
	spec_processed = 0;
	status = epoll_wait(epoll_fd, epoll_events, fd, wait_time);
	tv_update_date(wait_time, status);

对于wait使用的数量，作者说明一般是使用maxpollevents作为限制的，除非spec list已经非常大了，那么才需要对其大处理量进行补偿。

epoll_wait返回之后需要对时间进行更新，如果是超时返回，那么需要将等待时间加上，否则根据返回值适当调整。

[src/ev_sepoll.c]_do_poll()
	for (count = 0; count < status; count++) {
		int e = epoll_events[count].events;
		fd = epoll_events[count].data.fd;

		/* it looks complicated but gcc can optimize it away when constants
		 * have same values.
		 */
		DPRINTF(stderr, "%s:%d: fd=%d, ev=0x%08x, e=0x%08x\n",
			__FUNCTION__, __LINE__,
			fd, fdtab[fd].ev, e);

		fdtab[fd].ev &= FD_POLL_STICKY;
		fdtab[fd].ev |= 
			((e & EPOLLIN ) ? FD_POLL_IN  : 0) |
			((e & EPOLLPRI) ? FD_POLL_PRI : 0) |
			((e & EPOLLOUT) ? FD_POLL_OUT : 0) |
			((e & EPOLLERR) ? FD_POLL_ERR : 0) |
			((e & EPOLLHUP) ? FD_POLL_HUP : 0);
		
		if ((fdtab[fd].spec.e & FD_EV_MASK_R) == FD_EV_WAIT_R) {
			if (fdtab[fd].state == FD_STCLOSE || fdtab[fd].state == FD_STERROR)
				continue;
			if (fdtab[fd].ev & (FD_POLL_IN|FD_POLL_HUP|FD_POLL_ERR))
				fdtab[fd].cb[DIR_RD].f(fd);
		}

		if ((fdtab[fd].spec.e & FD_EV_MASK_W) == FD_EV_WAIT_W) {
			if (fdtab[fd].state == FD_STCLOSE || fdtab[fd].state == FD_STERROR)
				continue;
			if (fdtab[fd].ev & (FD_POLL_OUT|FD_POLL_ERR))
				fdtab[fd].cb[DIR_WR].f(fd);
		}
	}

	if (fd_created) {
		/* we have created some fds, certainly in return of an accept(),
		 * and they're marked as speculative. If we can manage to perform
		 * a read(), we're almost sure to collect all the request at once
		 * and avoid several expensive wakeups. So let's try now. Anyway,
		 * if we fail, the tasks are still woken up, and the FD gets marked
		 * for poll mode.
		 */

		looping = 1;
		goto re_poll_once;
	}
}

与spec list的处理一样，对于出现的事件会在fdtab[fd].ev中保存下来。

对epoll的相应事件处理完成之后。因为读事件中包括accept，因此可能创建了新的链接。如果创建了新的fd，那么可以转回去直接用spec对他们进行处理，这第二次轮询只处理新建的fd。

在epoll_wait调用之前将fd_created置为了0，那么是什么地方对其进行更改呢？是在poller的fd_set函数中，fd_set函数会在event_accept()函数中调用。后者源代码位于src/client.c中。

Poller的初始化

之前看完了poller的处理流程，那么看看poller是如何初始化的。

[src/fd.c]
int init_pollers()
{
	int p;
	struct poller *bp;


	do {
		bp = NULL;
		for (p = 0; p < nbpollers; p++)
			if (!bp || (pollers[p].pref > bp->pref))
				bp = &pollers[p];

		if (!bp || bp->pref == 0)
			break;

		if (bp->init(bp)) {
			memcpy(&cur_poller, bp, sizeof(*bp));
			return 1;
		}
	} while (!bp || bp->pref == 0);
	return 0;
}

很简单的代码，仅仅是遍历poller全局数组pollers来查找pref值最大的一个poller，并将其设置为cur_poller。

那么pollers的值如何来的呢？通过查看ev_*.c的代码可知，每一个文件均有如下函数，

__attribute__((constructor))
static void _do_register(void)
{
	...
}

这是使用了GCC的特性。GCC编译之后的代码将会在main函数运行之前将带有此特性的函数先运行。因此，pollers数组就是通过每个I/O模型的_do_register函数来初始化的。

三、I/O模型

数据结构

sepoll

Poller的初始化