【MySQL内核系列】MySQL8.0 InnoDB事务两阶段提交源码解读

本文链接：https://blog.youkuaiyun.com/joychenwenyu/article/details/130248839

文章详细解析了MySQLInnoDB引擎中事务提交的流程，包括事务入口函数trans_commit，ha_commit_trans函数中的prepare和commit阶段，以及redo日志和binlog的写入。在prepare阶段，redolog被写入redobuffer，而在commit阶段，等待redolog持久化并进行binlog组提交，最后在存储引擎层进行事务的内存提交。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

众所周知mysql的innodb引擎支持事务，并通过redolog和binlog的两阶段提交保证一致性，废话不多说直接开始看代码，这里的源码版本为percona mysql8.0.26版本：

首先找到事务提交的入口函数 trans_commit，进入ha_commit_trans为具体业务逻辑处理：

/**
  Commit the current transaction, making its changes permanent.

  @param[in] thd                       Current thread
  @param[in] ignore_global_read_lock   Allow commit to complete even if a
                                       global read lock is active. This can be
                                       used to allow changes to internal tables
                                       (e.g. slave status tables, analyze
  table).

  @retval false  Success
  @retval true   Failure
*/

bool trans_commit(THD *thd, bool ignore_global_read_lock) { 
  ...

  thd->server_status &=
      ~(SERVER_STATUS_IN_TRANS | SERVER_STATUS_IN_TRANS_READONLY);
  DBUG_PRINT("info", ("clearing SERVER_STATUS_IN_TRANS"));
  res = ha_commit_trans(thd, true, ignore_global_read_lock);
  if (res == false)
    if (thd->rpl_thd_ctx.session_gtids_ctx().notify_after_transaction_commit(
            thd))
      LogErr(WARNING_LEVEL, ER_TRX_GTID_COLLECT_REJECT);
  ...
}

ha_commit_trans函数很长，这里重点关注prepare（写redolog to redo buffer）和commit（redo和binglog组提交）这两个阶段的业务处理逻辑，即图中tc_log->prepare和tc_log->commit两个函数：

int ha_commit_trans(THD *thd, bool all, bool ignore_global_read_lock) {
  int error = 0;
  THD_STAGE_INFO(thd, stage_waiting_for_handler_commit);
  bool run_slave_post_commit = false;
  bool need_clear_owned_gtid = false;
  
  ...

  if (ha_info && !error) {
    uint rw_ha_count = 0;
    bool rw_trans;
    
    ...

    if (!trn_ctx->no_2pc(trx_scope) && (trn_ctx->rw_ha_count(trx_scope) > 1))
      error = tc_log->prepare(thd, all);
  }
    
  ...

  if (error || (error = tc_log->commit(thd, all))) {
    ha_rollback_trans(thd, all);
    error = 1;
    goto end;
  }
  
  ...
}

Prepare过程

我们先分析prepare过程的代码，即redolog写入redo buffer的过程，prepare函数原型在binlog.cc文件中的 MYSQL_BIN_LOG::prepare，具体业务逻辑在子函数ha_prepare_low：

int MYSQL_BIN_LOG::prepare(THD *thd, bool all) {
  
  ...

  int error = ha_prepare_low(thd, all);

  return error;
}

具体prepare的实现是各个存储引擎实现的，这里的ht->prepare的函数原型需要注意，我们分析的是innodb引擎所以应该找innodb引擎初始化时注册的函数原型，即innobase_xa_prepare：

int ha_prepare_low(THD *thd, bool all) {
  ...
  if (ha_info) {
    for (; ha_info && !error; ha_info = ha_info->next()) {
      int err = 0;
      handlerton *ht = ha_info->ht();
      ...
      if ((err = ht->prepare(ht, thd, all))) {
        char errbuf[MYSQL_ERRMSG_SIZE];
        my_error(ER_ERROR_DURING_COMMIT, MYF(0), err,
                 my_strerror(errbuf, MYSQL_ERRMSG_SIZE, err));
        error = 1;
      }
      ...
    }
  }
  return error;
}

static int innodb_init(void *p) {

  handlerton *innobase_hton = (handlerton *)p;
  innodb_hton_ptr = innobase_hton;
  ...
  innobase_hton->commit = innobase_commit; 
  innobase_hton->rollback = innobase_rollback;
  innobase_hton->prepare = innobase_xa_prepare;
  innobase_hton->flush_logs = innobase_flush_logs;
  ...
}

这里展示了prepare、commit、flush_logs的注册函数原型，方便后面分析，接着我们继续分析innobase_xa_prepare函数，核心处理逻辑在trx_prepare_for_mysql->trx_prepare->trx_prepare_low,这里调用层数比较多，通过三层调用我们进入分析trx_prepare_low，重点分析mtr_commit

static lsn_t trx_prepare_low(
    trx_t *trx,               /*!< in/out: transaction */
    trx_undo_ptr_t *undo_ptr, /*!< in/out: pointer to rollback
                              segment scheduled for prepare. */
    bool noredo_logging)      /*!< in: turn-off redo logging. */
{
  if (undo_ptr->insert_undo != nullptr || undo_ptr->update_undo != nullptr) {
    ...
    /*--------------*/
    /* This mtr commit makes the transaction prepared in
    file-based world. */
    mtr_commit(&mtr);
    /*--------------*/
    ...
  }

  return 0;
}

mtr_commit实现原型为mtr_t::commit()函数，直接分析cmd.execute()函数：

/** Commit a mini-transaction. */
void mtr_t::commit() {
  DBUG_EXECUTE_IF("mtr_commit_crash", DBUG_SUICIDE(););

  Command cmd(this);

  if (m_impl.m_n_log_recs > 0 ||
      (m_impl.m_modifications && m_impl.m_log_mode == MTR_LOG_NO_REDO)) {
    ut_ad(!srv_read_only_mode || m_impl.m_log_mode == MTR_LOG_NO_REDO);

    cmd.execute();
  } else {
    cmd.release_all();
    cmd.release_resources();
  }
}

如果你有幸看到这里，那么恭喜你，重点来了，我们终于要开始写redo日志了，这里的mtr表示数据页级别的原子事务组，具体分析可自行搜索，在mysql8.0中写redo buffer使用无锁并行的设计优化了写redo的性能，prepare阶段只是将redolog无锁并行拷贝到redo buffer中，并将相关修改的脏页挂到FlushList中，prepare阶段就结束了。至于redo buffer中的数据是由后台的log_writer和log_flusher线程进行fwrite写文件缓存和fsync刷盘。

void mtr_t::Command::execute() {
  ut_ad(m_impl->m_log_mode != MTR_LOG_NONE);

#ifndef UNIV_HOTBACKUP
  ulint len = prepare_write(); //获取要写的redo日志字节数

  if (len > 0) {
    mtr_write_log_t write_log;
    write_log.m_left_to_write = len;

    //计算起始lsn、结束lsn、获取redo buffer空间
    auto handle = log_buffer_reserve(*log_sys, len); 

    //拷贝redo日志到redo buffer空间
    m_impl->m_log.for_each_block(write_log);

    //将相关修改的脏页添加到FlushList的头部，FlushList从头到尾按照lsn从大到小链接
    add_dirty_blocks_to_flush_list(handle.start_lsn, handle.end_lsn);
  } else {
    DEBUG_SYNC_C("mtr_noredo_before_add_dirty_blocks");

    add_dirty_blocks_to_flush_list(0, 0);
  }
#endif /* !UNIV_HOTBACKUP */
}

Commit过程

接下来我们继续分析commit的过程，也就是前面tc_log->commit里面的代码，tc_log->commit函数原型为MYSQL_BIN_LOG::commit，核心逻辑在commit->order_commit中，这里重点看下order_commit的核心逻辑：

stage0：如有必要等待SQL线程回放的事务提交完成。这里也很好理解，如果节点刚从slave提升为master节点，可能存在主从延时relay log没有回放完的情况，所以在做新的事务提交的时候必须等relay log 回放完保证数据的一致性。
stage1：等待redolog持久化，并进行binlog组提交
stage2：将fwirte后的binlog通过fsync调用持久化到磁盘中
stage3：存储引擎层commit，通过跟踪代码process_commit_stage_queue->ha_commit_low->ht.commit，这里的ht->commit原型为innodb_init里面指定的innobase_commit，继续跟踪innobase_commit->innobase_commit_low->trx_commit_for_mysql->trx_commit->trx_commit_low->trx_commit_in_memory，发现存储引擎层commit是一个内存操作，即在trx_commit_in_memory调用trx_release_impl_and_expl_locks将事务状态更新为TRX_STATE_COMMITTED_IN_MEMORY, 到这里事务提交就完成了，如果flush_log_at_trx_commit设置为1，还会在trx_flush_log_if_needed将已提交的事务的lsn写到redolog的结果并进行持久化

int MYSQL_BIN_LOG::ordered_commit(THD *thd, bool all, bool skip_commit) {
  ...
  /*
    Stage #0: ensure slave threads commit order as they appear in the slave's
              relay log for transactions flushing to binary log.
    等待SQL线程回放事务提交，如果有的话
  */
  if (Commit_order_manager::wait_for_its_turn_before_flush_stage(thd) ||
      ending_trans(thd, all) ||
      Commit_order_manager::get_rollback_status(thd)) {
    if (Commit_order_manager::wait(thd)) {
      return thd->commit_error;
    }
  }

  /*
    Stage #1: flushing transactions to binary log
    这里做两个事情：
    1.等待redolog持久化，前面提到redolog在prepare阶段只写到buffer中，持久化的过程是由异步线程log_writer和log_flusher完成的，这里只需要等待日志持久化完成并更新当前lsn位置
    2.binlog组提交，写文件系统缓存，下一步会持久化
  */
  ...
  DEBUG_SYNC(thd, "waiting_in_the_middle_of_flush_stage");
  flush_error =
      process_flush_stage_queue(&total_bytes, &do_rotate, &wait_queue); // redo持久化

  if (flush_error == 0 && total_bytes > 0)
    flush_error = flush_cache_to_file(&flush_end_pos);
  DBUG_EXECUTE_IF("crash_after_flush_binlog", DBUG_SUICIDE(););

  update_binlog_end_pos_after_sync = (get_sync_period() == 1);
  ...
  /*
    Stage #2: Syncing binary log file to disk
    binlog 调用fsync持久化
  */
  ...
  if (flush_error == 0 && total_bytes > 0) {
    DEBUG_SYNC(thd, "before_sync_binlog_file");
    std::pair<bool, bool> result = sync_binlog_file(false); // binlog持久化
    sync_error = result.first;
  }
  ...
  /*
    Stage #3: Commit all transactions in order.
    存储引擎层commit
  */
commit_stage:
  /* Clone needs binlog commit order. */
  if ((opt_binlog_order_commits || Clone_handler::need_commit_order()) &&
      (sync_error == 0 || binlog_error_action != ABORT_SERVER)) {
    
    if (flush_error == 0 && sync_error == 0)
      sync_error = call_after_sync_hook(commit_queue);

    // 存储引擎commit
    process_commit_stage_queue(thd, commit_queue);
    mysql_mutex_unlock(&LOCK_commit);
    ...
  } else {
    if (leave_mutex_before_commit_stage)
      mysql_mutex_unlock(leave_mutex_before_commit_stage);
    if (flush_error == 0 && sync_error == 0)
      sync_error = call_after_sync_hook(final_queue);
  }
  ...
}