部分参考:《InnoDB异步IO(AIO)实现详解》何登成
一、流程
buf_page_get_gen -> buf_read_page -> buf_read_page_low,在buf_read_page_low中:
......
bpage = buf_page_init_for_read(err, mode, space, zip_size, unzip,
tablespace_version, offset);//Sets the io_fix flag to BUF_IO_READ and sets a non-recursive exclusive lock(非递归独占锁) on the buffer frame. The io-handler must take care that the flag is cleared and the lock released later,这样该线程就可以等待这个锁,从而在读IO完成之后继续运行。
......
if (sync) { /*!< in: TRUE if synchronous aio is desired */
thd_wait_begin(NULL, THD_WAIT_DISKIO);
}
if (zip_size) {
*err = fil_io(OS_FILE_READ | wake_later,
sync, space, zip_size, offset, 0, zip_size,
bpage->zip.data, bpage);
} else {
ut_a(buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);
*err = fil_io(OS_FILE_READ | wake_later,
sync, space, 0, offset, 0, UNIV_PAGE_SIZE,
((buf_block_t*) bpage)->frame, bpage);
}
if (sync) {
thd_wait_end(NULL);
}
......
fil_io()------>os_aio()----->os_aio_func()二、os_aio_func(type, mode, ...)
对于mode=OS_AIO_SYNC:如果是linux,则不使用array,直接调用os_file_read_func或os_file_write_func(这实际上就是调用同步的read/write,或pread/pwrite(原子操作)),os_aio_linux_handle用于等待aio结束;如果是windows,则使用 os_aio_sync_array,调用ReadFile或WriteFile,os_aio_windows_handle用于等待aio结束。
对于mode=其他:使用os_aio_read_array或os_aio_write_array或 os_aio_ibuf_array或os_aio_log_array(这些array由os_aio_array_create创建,位于os_aio_init())。在linux中,使用os_aio_linux_dispatch();windows同上。
在本次执行中,有4个读线程、4个写线程、1个log线程、一个insert buf线程,也就是说os_aio_read_array分为四段,.......
备注:array的类型为os_aio_array_t,具体为:
typedef struct os_aio_array_struct os_aio_array_t;
struct os_aio_array_struct{
os_mutex_t mutex; /*!< the mutex protecting the aio array */
os_event_t not_full;
/*!< The event which is set to the
signaled state when there is space in
the aio outside the ibuf segment */
os_event_t is_empty;
/*!< The event which is set to the
signaled state when there are no
pending i/os in this array */
ulint n_slots;/*!< Total number of slots in the aio
array. This must be divisible by
n_threads. */
ulint n_segments;
/*!< Number of segments in the aio
array of pending aio requests. A
thread can wait separately for any one
of the segments. */
ulint cur_seg;/*!< We reserve IO requests in round
robin fashion to different segments.
This points to the segment that is to
be used to service next IO request. */
ulint n_reserved;
/*!< Number of reserved slots in the
aio array outside the ibuf segment */
os_aio_slot_t* slots; /*!< Pointer to the slots in the array */
#ifdef __WIN__
HANDLE* handles;
/*!< Pointer to an array of OS native
event handles where we copied the
handles from slots, in the same
order. This can be used in
WaitForMultipleObjects; used only in
Windows */
#endif
#if defined(LINUX_NATIVE_AIO)
io_context_t* aio_ctx;
/* completion queue for IO. There is
one such queue per segment. Each thread
will work on one ctx exclusively. */
struct io_event* aio_events;
/* The array to collect completed IOs.
There is one such event for each
possible pending IO. The size of the
array is equal to n_slots. */
#endif
};
其中os_ioa_slot_t *slots即struct os_aio_slot_struct组成的数组。/** The asynchronous i/o array slot structure */
struct os_aio_slot_struct{
ibool is_read; /*!< TRUE if a read operation */
ulint pos; /*!< index of the slot in the aio
array */
ibool reserved; /*!< TRUE if this slot is reserved */
time_t reservation_time;/*!< time when reserved */
ulint len; /*!< length of the block to read or
write */
byte* buf; /*!< buffer used in i/o */
ulint type; /*!< OS_FILE_READ or OS_FILE_WRITE */
ulint offset; /*!< 32 low bits of file offset in
bytes */
ulint offset_high; /*!< 32 high bits of file offset */
os_file_t file; /*!< file where to read or write */
const char* name; /*!< file name or path */
ibool io_already_done;/*!< used only in simulated aio:
TRUE if the physical i/o already
made and only the slot message
needs to be passed to the caller
of os_aio_simulated_handle */
fil_node_t* message1; /*!< message which is given by the */
void* message2; /*!< the requester of an aio operation
and which can be used to identify
which pending aio operation was
completed */
#ifdef WIN_ASYNC_IO
HANDLE handle; /*!< handle object we need in the
OVERLAPPED struct */
OVERLAPPED control; /*!< Windows control block for the
aio request */
#elif defined(LINUX_NATIVE_AIO)
struct iocb control; /* Linux control block for aio */
int n_bytes; /* bytes written/read. */
int ret; /* AIO return code */
#endif
};
三、os_aio_array_reserve_slot
在调用os_aio_linux_dispatch之前会先调用os_aio_array_reserve_slot在aio array中定位一个空闲slot,aio前期准备工作:
array已满:
native aio:os_wait_event(array->not_full)等待not_full信号;
非native aio:os_aio_simulated_wake_handler_threads()模拟唤醒。
array未满:
windows:设置OVERLAPPED结构;ResetEvent(slot->handle);
linux aio:设置iocb结构,然后根据type调用io_prep_pread(iocb, file, buf, len, aio_offset)或io_prep_pwrite设置iocb,且iocb->data=(void*)slot。
备注:关于前面os_aio_array_t中的os_event_t not_full域:
os_event_create()
os_event_wait()--->os_event_wait_low----->while(条件){...; pthread_cond_wait(cond, mutex); ...}
os_event_set()/os_event_reset()
os_event_free()
os_event_t
os_event_create(
/*============*/
const char* name) /*!< in: the name of the event, if NULL
the event is created without a name */
{
os_event_t event;
.... //for windows
{
UT_NOT_USED(name);
event = ut_malloc(sizeof(struct os_event_struct));
os_fast_mutex_init(&(event->os_mutex));
os_cond_init(&(event->cond_var));
event->is_set = FALSE;
/* We return this value in os_event_reset(), which can then be
be used to pass to the os_event_wait_low(). The value of zero
is reserved in os_event_wait_low() for the case when the
caller does not want to pass any signal_count value. To
distinguish between the two cases we initialize signal_count
to 1 here. */
event->signal_count = 1;
}
/* The os_sync_mutex can be NULL because during startup an event
can be created [ because it's embedded in the mutex/rwlock ] before
this module has been initialized */
if (os_sync_mutex != NULL) {
os_mutex_enter(os_sync_mutex);
}
/* Put to the list of events */
UT_LIST_ADD_FIRST(os_event_list, os_event_list, event);
os_event_count++;
if (os_sync_mutex != NULL) {
os_mutex_exit(os_sync_mutex);
}
return(event);
}
四、os_aio_linux_dispatch
static
ibool
os_aio_linux_dispatch(
/*==================*/
os_aio_array_t* array, /*!< in: io request array. */
os_aio_slot_t* slot) /*!< in: an already reserved slot. */
{
int ret;
ulint io_ctx_index;
struct iocb* iocb;
ut_ad(slot != NULL);
ut_ad(array);
ut_a(slot->reserved);
/* Find out what we are going to work with.
The iocb struct is directly in the slot.
The io_context is one per segment. */
iocb = &slot->control;
io_ctx_index = (slot->pos * array->n_segments) / array->n_slots;
//一个segment、一个线程、一个aio_ctx(类型为io_context_t)
ret = io_submit(array->aio_ctx[io_ctx_index], 1, &iocb);//1表示iocb数组的大小
/* io_submit returns number of successfully
queued requests or -errno. */
if (UNIV_UNLIKELY(ret != 1)) {
errno = -ret;
return(FALSE);
}
return(TRUE);
}
如果需要等待异步读/写完成才可以往后执行,可以这样做:
如对于log,在异步读/写(调用fil_io)之前,调用rw_lock_x_lock_gen(&(log_sys->checkpoint_lock), LOG_CHECKPOINT)产生一个独占锁,之后再调用rw_lock_s_lock(&(log_sys->checkpoint_lock)); rw_lock_s_unlock(&(log_sys->checkpoint_lock)),这样必须等待独占锁被释放,才可以获得共享锁,并继续执行;独占锁何时释放呢,猜想是在IO线程完成实际读写之后释放的。
更具体点:
添加独占锁:innobase_start_or_create_for_mysql---->recv_recovery_from_checkpoint_start_func--->recv_synchronize_groups---->log_groups_write_checkpoint_info---->log_group_checkpoint-----> pfs_rw_lock_x_lock_func;
添加独占锁之后,启动异步io,并试图添加共享锁,阻塞;
IO线程:异步IO进行写数据;写完数据,释放独占锁:io_handler_thread----> fil_aio_wait--->log_io_complete----> log_io_complete_checkpoint---> log_complete_checkpoint(调用时机是log_sys->n_pending_checkpoint_writes == 0,该函数设计到LSN的推进)--> pfs_rw_lock_x_unlock_func。释放之后,前面阻塞的线程就可以继续了。这样就可以保证进一步操作之前,log已经写入文件中。
以下为fil_aio_wait()中的部分代码:(位于os_aio_linux_handle之后)
/* Do the i/o handling */
/* IMPORTANT: since i/o handling for reads will read also the insert
buffer in tablespace 0, you have to be very careful not to introduce
deadlocks in the i/o system. We keep tablespace 0 data files always
open, and use a special i/o thread to serve insert buffer requests. */
if (fil_node->space->purpose == FIL_TABLESPACE) {
srv_set_io_thread_op_info(segment, "complete io for buf page");
buf_page_io_complete(message);
} else {
srv_set_io_thread_op_info(segment, "complete io for log");
log_io_complete(message);
}
五、os_aio_linux_handle
io_handler_thread------>fil_aio_wait---------> os_aio_linux_handle(segment, &fil_node, &message, &type)
可以看出该函数是在IO线程中调用的,用于等待aio完成。注意上面的参数segment是指the number of the segment in the aio array。
1、无限循环,遍历array,直到定位到一个完成的I/O操作(slot->io_already_done)为止;
2、若当前没有完成的I/O,同时有I/O请求,则进入os_aio_linux_collect函数:调用io_getevents函数,进入忙等,等待超时设置为OS_AIO_REAP_TIMEOUT;若io_getevents函数返回ret > 0,说明有完成的I/O,进行一些设置,最主要是将slot->io_already_done设置为TRUE。(若系统I/O处于空闲状态,那么io_thread线程的主要时间,都在io_getevents函数中消耗。)
static
void
os_aio_linux_collect(
/*=================*/
os_aio_array_t* array, /*!< in/out: slot array. */
ulint segment, /*!< in: local segment no. */
ulint seg_size) /*!< in: segment size. */
{
int i;
int ret;
ulint start_pos;
ulint end_pos;
struct timespec timeout;
struct io_event* events;
struct io_context* io_ctx;
/* sanity checks. */
ut_ad(array != NULL);
ut_ad(seg_size > 0);
ut_ad(segment < array->n_segments);
/* Which part of event array we are going to work on. */
events = &array->aio_events[segment * seg_size];
/* Which io_context we are going to use. */
io_ctx = array->aio_ctx[segment];
/* Starting point of the segment we will be working on. */
start_pos = segment * seg_size;
/* End point. */
end_pos = start_pos + seg_size;
retry:
/* Initialize the events. The timeout value is arbitrary.
We probably need to experiment with it a little. */
memset(events, 0, sizeof(*events) * seg_size);
timeout.tv_sec = 0;
timeout.tv_nsec = OS_AIO_REAP_TIMEOUT;
ret = io_getevents(io_ctx, 1, seg_size, events, &timeout);//events中会保存用户之前传入的iocb(io_submit传入)
if (ret > 0) {
for (i = 0; i < ret; i++) {
os_aio_slot_t* slot;
struct iocb* control;
control = (struct iocb *)events[i].obj;//events[i].obj就是用户传入的iocb
ut_a(control != NULL);
slot = (os_aio_slot_t *) control->data;//data域为用户传入的自定义参数,本文最后有提及
/* Some sanity checks. */
ut_a(slot != NULL);
ut_a(slot->reserved);
/* We are not scribbling previous segment. */
ut_a(slot->pos >= start_pos);
/* We have not overstepped to next segment. */
ut_a(slot->pos < end_pos);
/* Mark this request as completed. The error handling
will be done in the calling function. */
os_mutex_enter(array->mutex);
slot->n_bytes = events[i].res;
slot->ret = events[i].res2;
slot->io_already_done = TRUE; //关键之处,这样os_aio_linux_handle就可以据此找到一个已完成的IO操作
os_mutex_exit(array->mutex);
}
return;
}
if (UNIV_UNLIKELY(srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS)) {
return;
}
/* This error handling is for any error in collecting the
IO requests. The errors, if any, for any particular IO
request are simply passed on to the calling routine. */
switch (ret) {
case -EAGAIN:
/* Not enough resources! Try again. */
case -EINTR:
/* Interrupted! I have tested the behaviour in case of an
interrupt. If we have some completed IOs available then
the return code will be the number of IOs. We get EINTR only
if there are no completed IOs and we have been interrupted. */
case 0:
/* No pending request! Go back and check again. */
goto retry;
}
/* All other errors should cause a trap for now. */
ut_print_timestamp(stderr);
fprintf(stderr,
" InnoDB: unexpected ret_code[%d] from io_getevents()!\n",
ret);
ut_error;
}
关于aio:
1、aio的第一步是创建AIO上下文,AIO上下文用于跟踪进程请求的异步IO的运行情况。由int io_setup(unsigned nr_events,
aio_context_t *ctxp);创建。在mysql中,前文已经出现os_aio_array_create创建array,同时os_aio_array_create----->os_aio_linux_create_io_ctx(n/n_segments, &array->aio_ctx[i])------>io_setup()。
2、通过系统调用io_submit()提交异步IO请求,进一步调用io_submit_one对每个iocb分配一个kiocb对象,加入到AIO上下文kioctx的IO请求队列run_list;然后调用aio_run_iocb发起IO操作。
3、收集完成的IO请求:io_getevents,其中参数io_event* events中会保存用户之前传入的iocb(io_submit传入),而iocb的data域(用户传入数据)。如果至少有min_nr个完成IO事件(或者超时),则将完成的io_event拷贝到events,并返回io_event的个数或者错误;否则,将进程本身加入到kiocxt的等待队列,挂起进程。
参考:http://www.cnblogs.com/hustcat/archive/2013/02/05/2893488.html