9 异步io的实现

最新推荐文章于 2024-07-18 14:36:20 发布

原创最新推荐文章于 2024-07-18 14:36:20 发布 · 1.3k 阅读

0 ·

CC 4.0 BY-SA版权

mysql 专栏收录该内容

18 篇文章

订阅专栏

本文深入剖析InnoDB存储引擎的异步IO(AIO)机制，包括AIO流程、os_aio_func函数详解、os_aio_array_reserve_slot及os_aio_linux_dispatch函数的工作原理，以及如何确保log写入文件后再进行后续操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

部分参考：《InnoDB异步IO(AIO)实现详解》何登成

一、流程

buf_page_get_gen -> buf_read_page -> buf_read_page_low，在buf_read_page_low中：

	......
	bpage = buf_page_init_for_read(err, mode, space, zip_size, unzip,
				       tablespace_version, offset);//Sets the io_fix flag to BUF_IO_READ and sets a non-recursive exclusive lock（非递归独占锁） on the buffer frame. The io-handler must take care that the flag is cleared and the lock released later，这样该线程就可以等待这个锁，从而在读IO完成之后继续运行。
	......
	if (sync) {	/*!< in: TRUE if synchronous aio is desired */
		thd_wait_begin(NULL, THD_WAIT_DISKIO);
	}

	if (zip_size) {
		*err = fil_io(OS_FILE_READ | wake_later,
			      sync, space, zip_size, offset, 0, zip_size,
			      bpage->zip.data, bpage);
	} else {
		ut_a(buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);

		*err = fil_io(OS_FILE_READ | wake_later,
			      sync, space, 0, offset, 0, UNIV_PAGE_SIZE,
			      ((buf_block_t*) bpage)->frame, bpage);
	}

	if (sync) {
		thd_wait_end(NULL);
	}
	......

fil_io()------>os_aio()----->os_aio_func()

二、os_aio_func(type, mode, ...)

对于mode=OS_AIO_SYNC：如果是linux，则不使用array，直接调用os_file_read_func或os_file_write_func（这实际上就是调用同步的read/write，或pread/pwrite(原子操作)），os_aio_linux_handle用于等待aio结束；如果是windows，则使用 os_aio_sync_array，调用ReadFile或WriteFile，os_aio_windows_handle用于等待aio结束。

对于mode=其他：使用os_aio_read_array或os_aio_write_array或 os_aio_ibuf_array或os_aio_log_array（这些array由os_aio_array_create创建，位于os_aio_init()）。在linux中，使用os_aio_linux_dispatch()；windows同上。

在本次执行中，有4个读线程、4个写线程、1个log线程、一个insert buf线程，也就是说os_aio_read_array分为四段，.......

备注：array的类型为os_aio_array_t，具体为：

typedef struct os_aio_array_struct os_aio_array_t;

struct os_aio_array_struct{
	os_mutex_t	mutex;	/*!< the mutex protecting the aio array */
	os_event_t	not_full;
				/*!< The event which is set to the
				signaled state when there is space in
				the aio outside the ibuf segment */
	os_event_t	is_empty;
				/*!< The event which is set to the
				signaled state when there are no
				pending i/os in this array */
	ulint		n_slots;/*!< Total number of slots in the aio
				array.  This must be divisible by
				n_threads. */
	ulint		n_segments;
				/*!< Number of segments in the aio
				array of pending aio requests. A
				thread can wait separately for any one
				of the segments. */
	ulint		cur_seg;/*!< We reserve IO requests in round
				robin fashion to different segments.
				This points to the segment that is to
				be used to service next IO request. */
	ulint		n_reserved;
				/*!< Number of reserved slots in the
				aio array outside the ibuf segment */
	os_aio_slot_t*	slots;	/*!< Pointer to the slots in the array */
#ifdef __WIN__
	HANDLE*		handles;
				/*!< Pointer to an array of OS native
				event handles where we copied the
				handles from slots, in the same
				order. This can be used in
				WaitForMultipleObjects; used only in
				Windows */
#endif

#if defined(LINUX_NATIVE_AIO)
	io_context_t*		aio_ctx;
				/* completion queue for IO. There is 
				one such queue per segment. Each thread
				will work on one ctx exclusively. */
	struct io_event*	aio_events;
				/* The array to collect completed IOs.
				There is one such event for each
				possible pending IO. The size of the
				array is equal to n_slots. */
#endif
};

其中os_ioa_slot_t *slots即struct os_aio_slot_struct组成的数组。

/** The asynchronous i/o array slot structure */
struct os_aio_slot_struct{
	ibool		is_read;	/*!< TRUE if a read operation */
	ulint		pos;		/*!< index of the slot in the aio
					array */
	ibool		reserved;	/*!< TRUE if this slot is reserved */
	time_t		reservation_time;/*!< time when reserved */
	ulint		len;		/*!< length of the block to read or
					write */
	byte*		buf;		/*!< buffer used in i/o */
	ulint		type;		/*!< OS_FILE_READ or OS_FILE_WRITE */
	ulint		offset;		/*!< 32 low bits of file offset in
					bytes */
	ulint		offset_high;	/*!< 32 high bits of file offset */
	os_file_t	file;		/*!< file where to read or write */
	const char*	name;		/*!< file name or path */
	ibool		io_already_done;/*!< used only in simulated aio:
					TRUE if the physical i/o already
					made and only the slot message
					needs to be passed to the caller
					of os_aio_simulated_handle */
	fil_node_t*	message1;	/*!< message which is given by the */
	void*		message2;	/*!< the requester of an aio operation
					and which can be used to identify
					which pending aio operation was
					completed */
#ifdef WIN_ASYNC_IO
	HANDLE		handle;		/*!< handle object we need in the
					OVERLAPPED struct */
	OVERLAPPED	control;	/*!< Windows control block for the
					aio request */
#elif defined(LINUX_NATIVE_AIO)
	struct iocb	control;	/* Linux control block for aio */
	int		n_bytes;	/* bytes written/read. */
	int		ret;		/* AIO return code */
#endif
};

三、os_aio_array_reserve_slot

在调用os_aio_linux_dispatch之前会先调用os_aio_array_reserve_slot在aio array中定位一个空闲slot，aio前期准备工作：

array已满：

native aio：os_wait_event（array->not_full)等待not_full信号；

非native aio：os_aio_simulated_wake_handler_threads()模拟唤醒。

array未满：

windows：设置OVERLAPPED结构；ResetEvent(slot->handle)；

linux aio：设置iocb结构，然后根据type调用io_prep_pread(iocb, file, buf, len, aio_offset)或io_prep_pwrite设置iocb，且iocb->data=(void*)slot。

备注：关于前面os_aio_array_t中的os_event_t not_full域：

os_event_create()

os_event_wait()--->os_event_wait_low----->while(条件){...; pthread_cond_wait(cond, mutex); ...}

os_event_set()/os_event_reset()

os_event_free()

os_event_t
os_event_create(
/*============*/
	const char*	name)	/*!< in: the name of the event, if NULL
				the event is created without a name */
{
	os_event_t	event;
	.... //for windows
	{
		UT_NOT_USED(name);

		event = ut_malloc(sizeof(struct os_event_struct));

		os_fast_mutex_init(&(event->os_mutex));

		os_cond_init(&(event->cond_var));

		event->is_set = FALSE;

		/* We return this value in os_event_reset(), which can then be
		be used to pass to the os_event_wait_low(). The value of zero
		is reserved in os_event_wait_low() for the case when the
		caller does not want to pass any signal_count value. To
		distinguish between the two cases we initialize signal_count
		to 1 here. */
		event->signal_count = 1;
	}

	/* The os_sync_mutex can be NULL because during startup an event
	can be created [ because it's embedded in the mutex/rwlock ] before
	this module has been initialized */
	if (os_sync_mutex != NULL) {
		os_mutex_enter(os_sync_mutex);
	}

	/* Put to the list of events */
	UT_LIST_ADD_FIRST(os_event_list, os_event_list, event);

	os_event_count++;

	if (os_sync_mutex != NULL) {
		os_mutex_exit(os_sync_mutex);
	}

	return(event);
}

四、os_aio_linux_dispatch

static
ibool
os_aio_linux_dispatch(
/*==================*/
	os_aio_array_t*	array,	/*!< in: io request array. */
	os_aio_slot_t*	slot)	/*!< in: an already reserved slot. */
{
	int		ret;
	ulint		io_ctx_index;
	struct iocb*	iocb;

	ut_ad(slot != NULL);
	ut_ad(array);

	ut_a(slot->reserved);

	/* Find out what we are going to work with.
	The iocb struct is directly in the slot.
	The io_context is one per segment. */

	iocb = &slot->control;
	io_ctx_index = (slot->pos * array->n_segments) / array->n_slots;
	//一个segment、一个线程、一个aio_ctx（类型为io_context_t）
	ret = io_submit(array->aio_ctx[io_ctx_index], 1, &iocb);//1表示iocb数组的大小

	/* io_submit returns number of successfully
	queued requests or -errno. */
	if (UNIV_UNLIKELY(ret != 1)) {
		errno = -ret;
		return(FALSE);
	}

	return(TRUE);
}

如果需要等待异步读/写完成才可以往后执行，可以这样做：

如对于log，在异步读/写（调用fil_io）之前，调用rw_lock_x_lock_gen(&(log_sys->checkpoint_lock), LOG_CHECKPOINT)产生一个独占锁，之后再调用rw_lock_s_lock(&(log_sys->checkpoint_lock)); rw_lock_s_unlock(&(log_sys->checkpoint_lock))，这样必须等待独占锁被释放，才可以获得共享锁，并继续执行；独占锁何时释放呢，猜想是在IO线程完成实际读写之后释放的。

更具体点：

添加独占锁：innobase_start_or_create_for_mysql---->recv_recovery_from_checkpoint_start_func--->recv_synchronize_groups---->log_groups_write_checkpoint_info---->log_group_checkpoint-----> pfs_rw_lock_x_lock_func；

添加独占锁之后，启动异步io，并试图添加共享锁，阻塞；

IO线程：异步IO进行写数据；写完数据，释放独占锁：io_handler_thread----> fil_aio_wait--->log_io_complete----> log_io_complete_checkpoint---> log_complete_checkpoint（调用时机是log_sys->n_pending_checkpoint_writes == 0，该函数设计到LSN的推进）--> pfs_rw_lock_x_unlock_func。释放之后，前面阻塞的线程就可以继续了。这样就可以保证进一步操作之前，log已经写入文件中。

以下为fil_aio_wait()中的部分代码：（位于os_aio_linux_handle之后）

	/* Do the i/o handling */
	/* IMPORTANT: since i/o handling for reads will read also the insert
	buffer in tablespace 0, you have to be very careful not to introduce
	deadlocks in the i/o system. We keep tablespace 0 data files always
	open, and use a special i/o thread to serve insert buffer requests. */

	if (fil_node->space->purpose == FIL_TABLESPACE) {
		srv_set_io_thread_op_info(segment, "complete io for buf page");
		buf_page_io_complete(message);
	} else {
		srv_set_io_thread_op_info(segment, "complete io for log");
		log_io_complete(message);
	}

五、os_aio_linux_handle

io_handler_thread------>fil_aio_wait---------> os_aio_linux_handle(segment, &fil_node, &message, &type)

可以看出该函数是在IO线程中调用的，用于等待aio完成。注意上面的参数segment是指the number of the segment in the aio array。

1、无限循环，遍历array，直到定位到一个完成的I/O操作(slot->io_already_done)为止；

2、若当前没有完成的I/O，同时有I/O请求，则进入os_aio_linux_collect函数：调用io_getevents函数，进入忙等，等待超时设置为OS_AIO_REAP_TIMEOUT；若io_getevents函数返回ret > 0，说明有完成的I/O，进行一些设置，最主要是将slot->io_already_done设置为TRUE。（若系统I/O处于空闲状态，那么io_thread线程的主要时间，都在io_getevents函数中消耗。）

static
void
os_aio_linux_collect(
/*=================*/
	os_aio_array_t* array,		/*!< in/out: slot array. */
	ulint		segment,	/*!< in: local segment no. */
	ulint		seg_size)	/*!< in: segment size. */
{
	int			i;
	int			ret;
	ulint			start_pos;
	ulint			end_pos;
	struct timespec		timeout;
	struct io_event*	events;
	struct io_context*	io_ctx;

	/* sanity checks. */
	ut_ad(array != NULL);
	ut_ad(seg_size > 0);
	ut_ad(segment < array->n_segments);

	/* Which part of event array we are going to work on. */
	events = &array->aio_events[segment * seg_size];

	/* Which io_context we are going to use. */
	io_ctx = array->aio_ctx[segment];

	/* Starting point of the segment we will be working on. */
	start_pos = segment * seg_size;

	/* End point. */
	end_pos = start_pos + seg_size;

retry:

	/* Initialize the events. The timeout value is arbitrary.
	We probably need to experiment with it a little. */
	memset(events, 0, sizeof(*events) * seg_size);
	timeout.tv_sec = 0;
	timeout.tv_nsec = OS_AIO_REAP_TIMEOUT;

	ret = io_getevents(io_ctx, 1, seg_size, events, &timeout);//events中会保存用户之前传入的iocb（io_submit传入）

	if (ret > 0) {
		for (i = 0; i < ret; i++) {
			os_aio_slot_t*	slot;
			struct iocb*	control;

			control = (struct iocb *)events[i].obj;//events[i].obj就是用户传入的iocb
			ut_a(control != NULL);

			slot = (os_aio_slot_t *) control->data;//data域为用户传入的自定义参数，本文最后有提及

			/* Some sanity checks. */
			ut_a(slot != NULL);
			ut_a(slot->reserved);

			/* We are not scribbling previous segment. */
			ut_a(slot->pos >= start_pos);

			/* We have not overstepped to next segment. */
			ut_a(slot->pos < end_pos);

			/* Mark this request as completed. The error handling
			will be done in the calling function. */
			os_mutex_enter(array->mutex);
			slot->n_bytes = events[i].res;
			slot->ret = events[i].res2;
			slot->io_already_done = TRUE;  //关键之处，这样os_aio_linux_handle就可以据此找到一个已完成的IO操作
			os_mutex_exit(array->mutex);
		}
		return;
	}

	if (UNIV_UNLIKELY(srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS)) {
		return;
	}

	/* This error handling is for any error in collecting the
	IO requests. The errors, if any, for any particular IO
	request are simply passed on to the calling routine. */

	switch (ret) {
	case -EAGAIN:
		/* Not enough resources! Try again. */
	case -EINTR:
		/* Interrupted! I have tested the behaviour in case of an
		interrupt. If we have some completed IOs available then
		the return code will be the number of IOs. We get EINTR only
		if there are no completed IOs and we have been interrupted. */
	case 0:
		/* No pending request! Go back and check again. */
		goto retry;
	}

	/* All other errors should cause a trap for now. */
	ut_print_timestamp(stderr);
	fprintf(stderr,
		"  InnoDB: unexpected ret_code[%d] from io_getevents()!\n",
		ret);
	ut_error;
}

关于aio：

1、aio的第一步是创建AIO上下文，AIO上下文用于跟踪进程请求的异步IO的运行情况。由int io_setup(unsigned nr_events, aio_context_t *ctxp);创建。在mysql中，前文已经出现os_aio_array_create创建array，同时os_aio_array_create----->os_aio_linux_create_io_ctx(n/n_segments, &array->aio_ctx[i])------>io_setup()。
2、通过系统调用io_submit()提交异步IO请求，进一步调用io_submit_one对每个iocb分配一个kiocb对象，加入到AIO上下文kioctx的IO请求队列run_list；然后调用aio_run_iocb发起IO操作。

3、收集完成的IO请求：io_getevents，其中参数io_event* events中会保存用户之前传入的iocb（io_submit传入），而iocb的data域（用户传入数据）。如果至少有min_nr个完成IO事件(或者超时)，则将完成的io_event拷贝到events，并返回io_event的个数或者错误；否则，将进程本身加入到kiocxt的等待队列，挂起进程。

参考：http://www.cnblogs.com/hustcat/archive/2013/02/05/2893488.html