【MySQL内核系列】MySQL8.0 InnoDB事务两阶段提交源码解读

文章详细解析了MySQLInnoDB引擎中事务提交的流程,包括事务入口函数trans_commit,ha_commit_trans函数中的prepare和commit阶段,以及redo日志和binlog的写入。在prepare阶段,redolog被写入redobuffer,而在commit阶段,等待redolog持久化并进行binlog组提交,最后在存储引擎层进行事务的内存提交。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

众所周知mysql的innodb引擎支持事务,并通过redolog和binlog的两阶段提交保证一致性,废话不多说直接开始看代码,这里的源码版本为percona mysql8.0.26版本:

首先找到事务提交的入口函数 trans_commit,进入ha_commit_trans为具体业务逻辑处理:

/**
  Commit the current transaction, making its changes permanent.

  @param[in] thd                       Current thread
  @param[in] ignore_global_read_lock   Allow commit to complete even if a
                                       global read lock is active. This can be
                                       used to allow changes to internal tables
                                       (e.g. slave status tables, analyze
  table).

  @retval false  Success
  @retval true   Failure
*/

bool trans_commit(THD *thd, bool ignore_global_read_lock) { 
  ...

  thd->server_status &=
      ~(SERVER_STATUS_IN_TRANS | SERVER_STATUS_IN_TRANS_READONLY);
  DBUG_PRINT("info", ("clearing SERVER_STATUS_IN_TRANS"));
  res = ha_commit_trans(thd, true, ignore_global_read_lock);
  if (res == false)
    if (thd->rpl_thd_ctx.session_gtids_ctx().notify_after_transaction_commit(
            thd))
      LogErr(WARNING_LEVEL, ER_TRX_GTID_COLLECT_REJECT);
  ...
}

ha_commit_trans函数很长,这里重点关注prepare(写redolog to redo buffer)和commit(redo和binglog组提交)这两个阶段的业务处理逻辑,即图中tc_log->prepare和tc_log->commit两个函数:

int ha_commit_trans(THD *thd, bool all, bool ignore_global_read_lock) {
  int error = 0;
  THD_STAGE_INFO(thd, stage_waiting_for_handler_commit);
  bool run_slave_post_commit = false;
  bool need_clear_owned_gtid = false;
  
  ...

  if (ha_info && !error) {
    uint rw_ha_count = 0;
    bool rw_trans;
    
    ...

    if (!trn_ctx->no_2pc(trx_scope) && (trn_ctx->rw_ha_count(trx_scope) > 1))
      error = tc_log->prepare(thd, all);
  }
    
  ...

  if (error || (error = tc_log->commit(thd, all))) {
    ha_rollback_trans(thd, all);
    error = 1;
    goto end;
  }
  
  ...
}

Prepare过程

我们先分析prepare过程的代码,即redolog写入redo buffer的过程,prepare函数原型在binlog.cc文件中的 MYSQL_BIN_LOG::prepare,具体业务逻辑在子函数ha_prepare_low:

int MYSQL_BIN_LOG::prepare(THD *thd, bool all) {
  
  ...

  int error = ha_prepare_low(thd, all);

  return error;
}

 具体prepare的实现是各个存储引擎实现的,这里的ht->prepare的函数原型需要注意,我们分析的是innodb引擎所以应该找innodb引擎初始化时注册的函数原型,即innobase_xa_prepare:

int ha_prepare_low(THD *thd, bool all) {
  ...
  if (ha_info) {
    for (; ha_info && !error; ha_info = ha_info->next()) {
      int err = 0;
      handlerton *ht = ha_info->ht();
      ...
      if ((err = ht->prepare(ht, thd, all))) {
        char errbuf[MYSQL_ERRMSG_SIZE];
        my_error(ER_ERROR_DURING_COMMIT, MYF(0), err,
                 my_strerror(errbuf, MYSQL_ERRMSG_SIZE, err));
        error = 1;
      }
      ...
    }
  }
  return error;
}
static int innodb_init(void *p) {

  handlerton *innobase_hton = (handlerton *)p;
  innodb_hton_ptr = innobase_hton;
  ...
  innobase_hton->commit = innobase_commit; 
  innobase_hton->rollback = innobase_rollback;
  innobase_hton->prepare = innobase_xa_prepare;
  innobase_hton->flush_logs = innobase_flush_logs;
  ...
}

这里展示了prepare、commit、flush_logs的注册函数原型,方便后面分析,接着我们继续分析innobase_xa_prepare函数,核心处理逻辑在trx_prepare_for_mysql->trx_prepare->trx_prepare_low,这里调用层数比较多,通过三层调用我们进入分析trx_prepare_low,重点分析mtr_commit

static lsn_t trx_prepare_low(
    trx_t *trx,               /*!< in/out: transaction */
    trx_undo_ptr_t *undo_ptr, /*!< in/out: pointer to rollback
                              segment scheduled for prepare. */
    bool noredo_logging)      /*!< in: turn-off redo logging. */
{
  if (undo_ptr->insert_undo != nullptr || undo_ptr->update_undo != nullptr) {
    ...
    /*--------------*/
    /* This mtr commit makes the transaction prepared in
    file-based world. */
    mtr_commit(&mtr);
    /*--------------*/
    ...
  }

  return 0;
}

mtr_commit实现原型为mtr_t::commit()函数,直接分析cmd.execute()函数:

/** Commit a mini-transaction. */
void mtr_t::commit() {
  DBUG_EXECUTE_IF("mtr_commit_crash", DBUG_SUICIDE(););

  Command cmd(this);

  if (m_impl.m_n_log_recs > 0 ||
      (m_impl.m_modifications && m_impl.m_log_mode == MTR_LOG_NO_REDO)) {
    ut_ad(!srv_read_only_mode || m_impl.m_log_mode == MTR_LOG_NO_REDO);

    cmd.execute();
  } else {
    cmd.release_all();
    cmd.release_resources();
  }
}

如果你有幸看到这里,那么恭喜你,重点来了,我们终于要开始写redo日志了,这里的mtr表示数据页级别的原子事务组,具体分析可自行搜索,在mysql8.0中写redo buffer使用无锁并行的设计优化了写redo的性能,prepare阶段只是将redolog无锁并行拷贝到redo buffer中,并将相关修改的脏页挂到FlushList中,prepare阶段就结束了。至于redo buffer中的数据是由后台的log_writerlog_flusher线程进行fwrite写文件缓存和fsync刷盘。

void mtr_t::Command::execute() {
  ut_ad(m_impl->m_log_mode != MTR_LOG_NONE);

#ifndef UNIV_HOTBACKUP
  ulint len = prepare_write(); //获取要写的redo日志字节数

  if (len > 0) {
    mtr_write_log_t write_log;
    write_log.m_left_to_write = len;

    //计算起始lsn、结束lsn、获取redo buffer空间
    auto handle = log_buffer_reserve(*log_sys, len); 

    //拷贝redo日志到redo buffer空间
    m_impl->m_log.for_each_block(write_log);

    //将相关修改的脏页添加到FlushList的头部,FlushList从头到尾按照lsn从大到小链接
    add_dirty_blocks_to_flush_list(handle.start_lsn, handle.end_lsn);
  } else {
    DEBUG_SYNC_C("mtr_noredo_before_add_dirty_blocks");

    add_dirty_blocks_to_flush_list(0, 0);
  }
#endif /* !UNIV_HOTBACKUP */
}

Commit过程

接下来我们继续分析commit的过程,也就是前面tc_log->commit里面的代码,tc_log->commit函数原型为MYSQL_BIN_LOG::commit,核心逻辑在commit->order_commit中,这里重点看下order_commit的核心逻辑:

  • stage0:如有必要等待SQL线程回放的事务提交完成。这里也很好理解,如果节点刚从slave提升为master节点,可能存在主从延时relay log没有回放完的情况,所以在做新的事务提交的时候必须等relay log 回放完保证数据的一致性。
  • stage1:等待redolog持久化,并进行binlog组提交
  • stage2:将fwirte后的binlog通过fsync调用持久化到磁盘中
  • stage3:存储引擎层commit,通过跟踪代码process_commit_stage_queue->ha_commit_low->ht.commit,这里的ht->commit原型为innodb_init里面指定的innobase_commit,继续跟踪innobase_commit->innobase_commit_low->trx_commit_for_mysql->trx_commit->trx_commit_low->trx_commit_in_memory,发现存储引擎层commit是一个内存操作,即在trx_commit_in_memory调用trx_release_impl_and_expl_locks将事务状态更新为TRX_STATE_COMMITTED_IN_MEMORY, 到这里事务提交就完成了,如果flush_log_at_trx_commit设置为1,还会在trx_flush_log_if_needed将已提交的事务的lsn写到redolog的结果并进行持久化
int MYSQL_BIN_LOG::ordered_commit(THD *thd, bool all, bool skip_commit) {
  ...
  /*
    Stage #0: ensure slave threads commit order as they appear in the slave's
              relay log for transactions flushing to binary log.
    等待SQL线程回放事务提交,如果有的话
  */
  if (Commit_order_manager::wait_for_its_turn_before_flush_stage(thd) ||
      ending_trans(thd, all) ||
      Commit_order_manager::get_rollback_status(thd)) {
    if (Commit_order_manager::wait(thd)) {
      return thd->commit_error;
    }
  }

  /*
    Stage #1: flushing transactions to binary log
    这里做两个事情:
    1.等待redolog持久化,前面提到redolog在prepare阶段只写到buffer中,持久化的过程是由异步线程log_writer和log_flusher完成的,这里只需要等待日志持久化完成并更新当前lsn位置
    2.binlog组提交,写文件系统缓存,下一步会持久化
  */
  ...
  DEBUG_SYNC(thd, "waiting_in_the_middle_of_flush_stage");
  flush_error =
      process_flush_stage_queue(&total_bytes, &do_rotate, &wait_queue); // redo持久化

  if (flush_error == 0 && total_bytes > 0)
    flush_error = flush_cache_to_file(&flush_end_pos);
  DBUG_EXECUTE_IF("crash_after_flush_binlog", DBUG_SUICIDE(););

  update_binlog_end_pos_after_sync = (get_sync_period() == 1);
  ...
  /*
    Stage #2: Syncing binary log file to disk
    binlog 调用fsync持久化
  */
  ...
  if (flush_error == 0 && total_bytes > 0) {
    DEBUG_SYNC(thd, "before_sync_binlog_file");
    std::pair<bool, bool> result = sync_binlog_file(false); // binlog持久化
    sync_error = result.first;
  }
  ...
  /*
    Stage #3: Commit all transactions in order.
    存储引擎层commit
  */
commit_stage:
  /* Clone needs binlog commit order. */
  if ((opt_binlog_order_commits || Clone_handler::need_commit_order()) &&
      (sync_error == 0 || binlog_error_action != ABORT_SERVER)) {
    
    if (flush_error == 0 && sync_error == 0)
      sync_error = call_after_sync_hook(commit_queue);

    // 存储引擎commit
    process_commit_stage_queue(thd, commit_queue);
    mysql_mutex_unlock(&LOCK_commit);
    ...
  } else {
    if (leave_mutex_before_commit_stage)
      mysql_mutex_unlock(leave_mutex_before_commit_stage);
    if (flush_error == 0 && sync_error == 0)
      sync_error = call_after_sync_hook(final_queue);
  }
  ...
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值