众所周知mysql的innodb引擎支持事务,并通过redolog和binlog的两阶段提交保证一致性,废话不多说直接开始看代码,这里的源码版本为percona mysql8.0.26版本:
首先找到事务提交的入口函数 trans_commit,进入ha_commit_trans为具体业务逻辑处理:
/**
Commit the current transaction, making its changes permanent.
@param[in] thd Current thread
@param[in] ignore_global_read_lock Allow commit to complete even if a
global read lock is active. This can be
used to allow changes to internal tables
(e.g. slave status tables, analyze
table).
@retval false Success
@retval true Failure
*/
bool trans_commit(THD *thd, bool ignore_global_read_lock) {
...
thd->server_status &=
~(SERVER_STATUS_IN_TRANS | SERVER_STATUS_IN_TRANS_READONLY);
DBUG_PRINT("info", ("clearing SERVER_STATUS_IN_TRANS"));
res = ha_commit_trans(thd, true, ignore_global_read_lock);
if (res == false)
if (thd->rpl_thd_ctx.session_gtids_ctx().notify_after_transaction_commit(
thd))
LogErr(WARNING_LEVEL, ER_TRX_GTID_COLLECT_REJECT);
...
}
ha_commit_trans函数很长,这里重点关注prepare(写redolog to redo buffer)和commit(redo和binglog组提交)这两个阶段的业务处理逻辑,即图中tc_log->prepare和tc_log->commit两个函数:
int ha_commit_trans(THD *thd, bool all, bool ignore_global_read_lock) {
int error = 0;
THD_STAGE_INFO(thd, stage_waiting_for_handler_commit);
bool run_slave_post_commit = false;
bool need_clear_owned_gtid = false;
...
if (ha_info && !error) {
uint rw_ha_count = 0;
bool rw_trans;
...
if (!trn_ctx->no_2pc(trx_scope) && (trn_ctx->rw_ha_count(trx_scope) > 1))
error = tc_log->prepare(thd, all);
}
...
if (error || (error = tc_log->commit(thd, all))) {
ha_rollback_trans(thd, all);
error = 1;
goto end;
}
...
}
Prepare过程
我们先分析prepare过程的代码,即redolog写入redo buffer的过程,prepare函数原型在binlog.cc文件中的 MYSQL_BIN_LOG::prepare,具体业务逻辑在子函数ha_prepare_low:
int MYSQL_BIN_LOG::prepare(THD *thd, bool all) {
...
int error = ha_prepare_low(thd, all);
return error;
}
具体prepare的实现是各个存储引擎实现的,这里的ht->prepare的函数原型需要注意,我们分析的是innodb引擎所以应该找innodb引擎初始化时注册的函数原型,即innobase_xa_prepare:
int ha_prepare_low(THD *thd, bool all) {
...
if (ha_info) {
for (; ha_info && !error; ha_info = ha_info->next()) {
int err = 0;
handlerton *ht = ha_info->ht();
...
if ((err = ht->prepare(ht, thd, all))) {
char errbuf[MYSQL_ERRMSG_SIZE];
my_error(ER_ERROR_DURING_COMMIT, MYF(0), err,
my_strerror(errbuf, MYSQL_ERRMSG_SIZE, err));
error = 1;
}
...
}
}
return error;
}
static int innodb_init(void *p) {
handlerton *innobase_hton = (handlerton *)p;
innodb_hton_ptr = innobase_hton;
...
innobase_hton->commit = innobase_commit;
innobase_hton->rollback = innobase_rollback;
innobase_hton->prepare = innobase_xa_prepare;
innobase_hton->flush_logs = innobase_flush_logs;
...
}
这里展示了prepare、commit、flush_logs的注册函数原型,方便后面分析,接着我们继续分析innobase_xa_prepare函数,核心处理逻辑在trx_prepare_for_mysql->trx_prepare->trx_prepare_low,这里调用层数比较多,通过三层调用我们进入分析trx_prepare_low,重点分析mtr_commit
static lsn_t trx_prepare_low(
trx_t *trx, /*!< in/out: transaction */
trx_undo_ptr_t *undo_ptr, /*!< in/out: pointer to rollback
segment scheduled for prepare. */
bool noredo_logging) /*!< in: turn-off redo logging. */
{
if (undo_ptr->insert_undo != nullptr || undo_ptr->update_undo != nullptr) {
...
/*--------------*/
/* This mtr commit makes the transaction prepared in
file-based world. */
mtr_commit(&mtr);
/*--------------*/
...
}
return 0;
}
mtr_commit实现原型为mtr_t::commit()函数,直接分析cmd.execute()函数:
/** Commit a mini-transaction. */
void mtr_t::commit() {
DBUG_EXECUTE_IF("mtr_commit_crash", DBUG_SUICIDE(););
Command cmd(this);
if (m_impl.m_n_log_recs > 0 ||
(m_impl.m_modifications && m_impl.m_log_mode == MTR_LOG_NO_REDO)) {
ut_ad(!srv_read_only_mode || m_impl.m_log_mode == MTR_LOG_NO_REDO);
cmd.execute();
} else {
cmd.release_all();
cmd.release_resources();
}
}
如果你有幸看到这里,那么恭喜你,重点来了,我们终于要开始写redo日志了,这里的mtr表示数据页级别的原子事务组,具体分析可自行搜索,在mysql8.0中写redo buffer使用无锁并行的设计优化了写redo的性能,prepare阶段只是将redolog无锁并行拷贝到redo buffer中,并将相关修改的脏页挂到FlushList中,prepare阶段就结束了。至于redo buffer中的数据是由后台的log_writer和log_flusher线程进行fwrite写文件缓存和fsync刷盘。
void mtr_t::Command::execute() {
ut_ad(m_impl->m_log_mode != MTR_LOG_NONE);
#ifndef UNIV_HOTBACKUP
ulint len = prepare_write(); //获取要写的redo日志字节数
if (len > 0) {
mtr_write_log_t write_log;
write_log.m_left_to_write = len;
//计算起始lsn、结束lsn、获取redo buffer空间
auto handle = log_buffer_reserve(*log_sys, len);
//拷贝redo日志到redo buffer空间
m_impl->m_log.for_each_block(write_log);
//将相关修改的脏页添加到FlushList的头部,FlushList从头到尾按照lsn从大到小链接
add_dirty_blocks_to_flush_list(handle.start_lsn, handle.end_lsn);
} else {
DEBUG_SYNC_C("mtr_noredo_before_add_dirty_blocks");
add_dirty_blocks_to_flush_list(0, 0);
}
#endif /* !UNIV_HOTBACKUP */
}
Commit过程
接下来我们继续分析commit的过程,也就是前面tc_log->commit里面的代码,tc_log->commit函数原型为MYSQL_BIN_LOG::commit,核心逻辑在commit->order_commit中,这里重点看下order_commit的核心逻辑:
- stage0:如有必要等待SQL线程回放的事务提交完成。这里也很好理解,如果节点刚从slave提升为master节点,可能存在主从延时relay log没有回放完的情况,所以在做新的事务提交的时候必须等relay log 回放完保证数据的一致性。
- stage1:等待redolog持久化,并进行binlog组提交
- stage2:将fwirte后的binlog通过fsync调用持久化到磁盘中
- stage3:存储引擎层commit,通过跟踪代码process_commit_stage_queue->ha_commit_low->ht.commit,这里的ht->commit原型为innodb_init里面指定的innobase_commit,继续跟踪innobase_commit->innobase_commit_low->trx_commit_for_mysql->trx_commit->trx_commit_low->trx_commit_in_memory,发现存储引擎层commit是一个内存操作,即在trx_commit_in_memory调用trx_release_impl_and_expl_locks将事务状态更新为TRX_STATE_COMMITTED_IN_MEMORY, 到这里事务提交就完成了,如果flush_log_at_trx_commit设置为1,还会在trx_flush_log_if_needed将已提交的事务的lsn写到redolog的结果并进行持久化
int MYSQL_BIN_LOG::ordered_commit(THD *thd, bool all, bool skip_commit) {
...
/*
Stage #0: ensure slave threads commit order as they appear in the slave's
relay log for transactions flushing to binary log.
等待SQL线程回放事务提交,如果有的话
*/
if (Commit_order_manager::wait_for_its_turn_before_flush_stage(thd) ||
ending_trans(thd, all) ||
Commit_order_manager::get_rollback_status(thd)) {
if (Commit_order_manager::wait(thd)) {
return thd->commit_error;
}
}
/*
Stage #1: flushing transactions to binary log
这里做两个事情:
1.等待redolog持久化,前面提到redolog在prepare阶段只写到buffer中,持久化的过程是由异步线程log_writer和log_flusher完成的,这里只需要等待日志持久化完成并更新当前lsn位置
2.binlog组提交,写文件系统缓存,下一步会持久化
*/
...
DEBUG_SYNC(thd, "waiting_in_the_middle_of_flush_stage");
flush_error =
process_flush_stage_queue(&total_bytes, &do_rotate, &wait_queue); // redo持久化
if (flush_error == 0 && total_bytes > 0)
flush_error = flush_cache_to_file(&flush_end_pos);
DBUG_EXECUTE_IF("crash_after_flush_binlog", DBUG_SUICIDE(););
update_binlog_end_pos_after_sync = (get_sync_period() == 1);
...
/*
Stage #2: Syncing binary log file to disk
binlog 调用fsync持久化
*/
...
if (flush_error == 0 && total_bytes > 0) {
DEBUG_SYNC(thd, "before_sync_binlog_file");
std::pair<bool, bool> result = sync_binlog_file(false); // binlog持久化
sync_error = result.first;
}
...
/*
Stage #3: Commit all transactions in order.
存储引擎层commit
*/
commit_stage:
/* Clone needs binlog commit order. */
if ((opt_binlog_order_commits || Clone_handler::need_commit_order()) &&
(sync_error == 0 || binlog_error_action != ABORT_SERVER)) {
if (flush_error == 0 && sync_error == 0)
sync_error = call_after_sync_hook(commit_queue);
// 存储引擎commit
process_commit_stage_queue(thd, commit_queue);
mysql_mutex_unlock(&LOCK_commit);
...
} else {
if (leave_mutex_before_commit_stage)
mysql_mutex_unlock(leave_mutex_before_commit_stage);
if (flush_error == 0 && sync_error == 0)
sync_error = call_after_sync_hook(final_queue);
}
...
}