mysql redo log

最新推荐文章于 2024-12-13 17:16:51 发布

pool_Sneaker

最新推荐文章于 2024-12-13 17:16:51 发布

阅读量573

点赞数

CC 4.0 BY-SA版权

分类专栏： mysql

本文链接：https://blog.youkuaiyun.com/evan_cao_it/article/details/89669165

mysql 专栏收录该内容

23 篇文章

订阅专栏

本文介绍了MySQL InnoDB引擎的redo log，它是数据库系统和直接写文件系统的主要区别。redo log用于在崩溃恢复期间修正未完成事务的数据。讨论了innodb_flush_log_at_trx_commit参数的三种模式，分别对应不同的数据安全性和性能权衡。还涵盖了redo log的写盘情况、循环写入、文件结构以及与Mini Transaction（MTR）的关系。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

学习笔记，mtr章节的学习见最后阿里月报。

REDO LOG

首先需要了解的是：redo log是数据库系统和直接写文件系统管理数据的最根本的区别

The redo log is a disk-based data structure used during crash recovery to correct data written by incomplete transactions.
默认的redo log 是在mysql 安装路径，文件名称像ib_logfile0 and ib_logfile1.他和oracle 一样也是循环的写的。通过LSN顺序的写日志。
redo log 是 InnoDB 引擎特有的

大多数关系型数据库一样，InnoDB记录了对数据文件的物理更改，并保证总是日志先行，也就是所谓的WAL （Write-Ahead Logging）是一种实现事务日志的标准方法

在5.5之前最大不能大于4G,5.6.3. 后最大可以是512G

mysql> show variables like 'innodb_log_file_size';
+----------------------+------------+
| Variable_name        | Value      |
+----------------------+------------+
| innodb_log_file_size | 2147483648 |
+----------------------+------------+
1 row in set (0.01 sec)

mysql> show variables like 'innodb_log_files_in';
Empty set (0.00 sec)

mysql> show variables like 'innodb_log_files_in%';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| innodb_log_files_in_group | 3     |
+---------------------------+-------+

这几个都不是动态的参数，需要重启。2是innodb_log_files_in_group默认值，最大是100

写redo的过程

redo在mysql 的有三种状态：

1.在redolog buffer中，这其实在mysql 内存中，类似oracle 的log buffer
2.写到了磁盘，但是没有刷盘，其实就是在文件系统的page cache中
3.刷盘，持久化到磁盘

这个redo log写入策略，InnoDB提供了 innodb_flush_log_at_trx_commit参数来控制，这也是innodb引擎最重要的参数之一。

innodb_flush_log_at_trx_commit说明

具体描述见手册

对应的值有 0,1,2

1是最安全的，每次事物提交的时候都会把redo log 直接持久化到磁盘

2每次事物commit的时候只是吧redo log wirte即写入到page cache ，但是持久化到磁盘是每一秒一次。

0日志是每秒执行一次，commit时没有及时写入磁盘。

上面的3种情况：

安全性从0->2->1递增，分别对应于mysqld 进程crash可能丢失 -> OS crash可能丢失 -> 事务安全

即0的时候DBcrash会丢失数据，2的情况下主机 crash会丢失在文件系统page cache的事务，1的时候能确保事物安全。oracle相当于1的情况。

一般交易系统建议双1模式，如果再特殊活动和业务高峰期有性能问题，可以修改为2，不建议使用0

Redo 写盘的情况

redo log buffer空间不足

事物提交

mater 进程 1秒

binlog切换

停库

检查点

redo log的循环写

在这里插入图片描述

Redo log文件是循环写入的，在覆盖写之前，总是要保证对应的脏页已经刷到了磁盘。在非常大的负载下，Redo log可能产生的速度非常快，导致频繁的刷脏操作，进而导致性能下降，通常在未做checkpoint的日志超过文件总大小的76%之后，InnoDB 认为这可能是个不安全的点，会强制的preflush脏页，导致大量用户线程wait

redo的文件结构

日志文件的前2048字节是存放管理日志内容及整个数据库的状态，2k后面就是普通存放日志的文件。（5.6和5.7版本会有些差）

主要字段有：

LOG_GROUP_ID 这个log文件所属的日志组，占用4个字节，当前都是0；
LOG_FILE_START_LSN 这个log文件记录的开始日志的lsn，占用8个字节；
LOG_FILE_WAS_CRATED_BY_HOT_BACKUP 备份程序所占用的字节数，共占用32字节；
LOG_CHECKPOINT_1/LOG_CHECKPOINT_2 两个记录InnoDB checkpoint信息的字段，分别从文件头的第二个和第四个block开始记录，只使用日志文件组的第一个日志文件。从地址2KB偏移量开始，其后就是顺序写入的各个日志块（log block）

后面就是512字节的redo log块：每个Block包含12字节BlockHeader，4字节BlockTrailer，中间就是真实redo log.

MTR

说到redo 必须提到的是MTR(Mini transaction)

什么是MTR

MTR是innodb中最重要的一个机制来保证物理页面写入的完整和持久性，是对物理数据文件操作的最小事物单元。一个事物可以包含多个MTR。

mtr 也是事物，事物的开始就是结构体mtr_struct的初始化，源码主要结构在
mysql-5.7.25\storage\innobase\include\mtr0mtr.h。我这边的版本是5.7.25最新版：

struct mtr_t {

	/** State variables of the mtr */
	struct Impl {

		/** memo stack for locks etc. */
		mtr_buf_t	m_memo;     // mtr持有的锁类型

		/** mini-transaction log */
		mtr_buf_t	m_log;      //是一个动态的数组，用来存储这个事物在访问修改页面过程中redo log。

		/** true if mtr has made at least one buffer pool page dirty */
		bool		m_made_dirty; //是否至少产生了一个脏页

		/** true if inside ibuf changes */
		bool		m_inside_ibuf;  //
        memcpy(log_ptr, ins_ptr, rec_size);
        mlog_close(mtr, log_ptr + rec_size);

 buffer 

		/** true if the mini-transaction modified buffer pool pages */
		bool		m_modifications;  //是否修改了buffer

		/** Count of how many page initial log records have been
		written to the mtr log */
		ib_uint32_t	m_n_log_recs;   //产生了redo的条数

		/** specifies which operations should be logged; default
		value MTR_LOG_ALL */
		mtr_log_t	m_log_mode;    //mtr的模式
#ifdef UNIV_DEBUG
		/** Persistent user tablespace associated with the
		mini-transaction, or 0 (TRX_SYS_SPACE) if none yet */
		ulint		m_user_space_id;   //修改的表空间id 
#endif /* UNIV_DEBUG */
		/** User tablespace that is being modified by the
		mini-transaction */
		fil_space_t*	m_user_space;  //修改的表空间
		/** Undo tablespace that is being modified by the
		mini-transaction */
		fil_space_t*	m_undo_space;  //修改的undo空间
		/** System tablespace if it is being modified by the
		mini-transaction */
		fil_space_t*	m_sys_space;   //修改的系统表空间

		/** State of the transaction */
		mtr_state_t	m_state;  //mtr的状态

		/** Flush Observer */
		FlushObserver*	m_flush_observer;

#ifdef UNIV_DEBUG
		/** For checking corruption. */
		ulint		m_magic_n;
#endif /* UNIV_DEBUG */

		/** Owning mini-transaction */
		mtr_t*		m_mtr;
	};

里面对应的状态，模式，锁（这边的锁是内存锁）等可以看下：
E:\workplace\mysql-5.7.25\storage\innobase\include\mtr0types.h 里面有各个枚举值的描述；

这里就是开启的了mtr


/** Start a mini-transaction.
@param sync     true if it is a synchronous mini-transaction
@param read_only    true if read only mini-transaction */
void
mtr_t::start(bool sync, bool read_only)
{
    UNIV_MEM_INVALID(this, sizeof(*this));
    UNIV_MEM_INVALID(&m_impl, sizeof(m_impl));
    m_sync = sync;
    m_commit_lsn = 0;
    new(&m_impl.m_log) mtr_buf_t();
    new(&m_impl.m_memo) mtr_buf_t();
    m_impl.m_mtr = this;
    m_impl.m_log_mode = MTR_LOG_ALL;
    m_impl.m_inside_ibuf = false;
    m_impl.m_modifications = false;
    m_impl.m_made_dirty = false;
    m_impl.m_n_log_recs = 0;
    m_impl.m_state = MTR_STATE_ACTIVE;
    ut_d(m_impl.m_user_space_id = TRX_SYS_SPACE);
    m_impl.m_user_space = NULL;
    m_impl.m_undo_space = NULL;
    m_impl.m_sys_space = NULL;
    m_impl.m_flush_observer = NULL;
    ut_d(m_impl.m_magic_n = MTR_MAGIC_N);
}

初始化一些变量，比如：m_log_mode是默认所有操作写redo 。将状态职位active 。
new(&m_impl.m_log) mtr_buf_t();
new(&m_impl.m_memo) mtr_buf_t();初始化数组

mtr_commit：


void
mtr_t::commit()
{
    ut_ad(is_active());
    ut_ad(!is_inside_ibuf());
    ut_ad(m_impl.m_magic_n == MTR_MAGIC_N);
    m_impl.m_state = MTR_STATE_COMMITTING;
    /* This is a dirty read, for debugging. */
    ut_ad(!recv_no_log_write);
    Command cmd(this);
    if (m_impl.m_modifications
     && (m_impl.m_n_log_recs > 0
        || m_impl.m_log_mode == MTR_LOG_NO_REDO)) {
        ut_ad(!srv_read_only_mode
         || m_impl.m_log_mode == MTR_LOG_NO_REDO);
        cmd.execute();
    } else {
        cmd.release_all();
        cmd.release_resources();
    }
}

其中的execute（）如下：


void
mtr_t::Command::execute()
{
    ut_ad(m_impl->m_log_mode != MTR_LOG_NONE);
    if (const ulint len = prepare_write()) {
        finish_write(len);
    }
    if (m_impl->m_made_dirty) {
        log_flush_order_mutex_enter();
    }
    /* It is now safe to release the log mutex because the
    flush_order mutex will ensure that we are the first one
    to insert into the flush list. */
    log_mutex_exit();
    m_impl->m_mtr->m_commit_lsn = m_end_lsn;
    release_blocks();
    if (m_impl->m_made_dirty) {
        log_flush_order_mutex_exit();
    }
    release_latches();
    release_resources();
}

这一笔里面就是写重做日志记录，将脏页添加到刷新列表并释放

release_latches() --释放latch锁
release_resources():


    /* Reset the mtr buffers */
    m_impl->m_log.erase();
    m_impl->m_memo.erase();
    m_impl->m_state = MTR_STATE_COMMITTED;

s释放资源，memo 和log ，state 设置为committed，
对于上面提到innodb_flush_log_at_trx_commit就是在execute这步判断。

一条insert的过程：

入口：row_ins_clust_index_entry_low()


dberr_t
row_ins_clust_index_entry_low(
/*==========================*/
    ulint       flags,  /*!< in: undo logging and locking flags */
    ulint       mode,   /*!< in: BTR_MODIFY_LEAF or BTR_MODIFY_TREE,
                depending on whether we wish optimistic or
                pessimistic descent down the index tree */
    dict_index_t*   index,  /*!< in: clustered index */
    ulint       n_uniq, /*!< in: 0 or index->n_uniq */
    dtuple_t*   entry,  /*!< in/out: index entry to insert */
    ulint       n_ext,  /*!< in: number of externally stored columns */
    que_thr_t*  thr,    /*!< in: query thread */
    bool        dup_chk_only)
                /*!< in: if true, just do duplicate check
                and return. don't execute actual insert. */
{
    btr_pcur_t  pcur;
    btr_cur_t*  cursor;
    dberr_t     err     = DB_SUCCESS;
    big_rec_t*  big_rec     = NULL;
    mtr_t       mtr;
    mem_heap_t* offsets_heap    = NULL;
    ulint offsets_[REC_OFFS_NORMAL_SIZE];
    ulint* offsets = offsets_;
    rec_offs_init(offsets_);
    DBUG_ENTER("row_ins_clust_index_entry_low");
    ut_ad(dict_index_is_clust(index));
    ut_ad(!dict_index_is_unique(index)
     || n_uniq == dict_index_get_n_unique(index));
    ut_ad(!n_uniq || n_uniq == dict_index_get_n_unique(index));
    ut_ad(!thr_get_trx(thr)->in_rollback);
//开启一个mtr，初始化变量
    mtr_start(&mtr);
    //将当前修改的表空间保存。
    mtr.set_named_space(index->space);
//如果是临时表空间，只在session，关闭redo
    if (dict_table_is_temporary(index->table)) {
        /* Disable REDO logging as the lifetime of temp-tables is
        limited to server or connection lifetime and so REDO
        information is not needed on restart for recovery.
        Disable locking as temp-tables are local to a connection. */
        ut_ad(flags & BTR_NO_LOCKING_FLAG);
        ut_ad(!dict_table_is_intrinsic(index->table)
         || (flags & BTR_NO_UNDO_LOG_FLAG));
        mtr.set_log_mode(MTR_LOG_NO_REDO);
    }
    if (mode == BTR_MODIFY_LEAF && dict_index_is_online_ddl(index)) {
        mode = BTR_MODIFY_LEAF | BTR_ALREADY_S_LATCHED;
        mtr_s_lock(dict_index_get_lock(index), &mtr);
    }
    /* Note that we use PAGE_CUR_LE as the search mode, because then
    the function will return in both low_match and up_match of the
    cursor sensible values */
//获取这台记录索引上的插入位置
    btr_pcur_open(index, entry, PAGE_CUR_LE, mode, &pcur, &mtr);
    cursor = btr_pcur_get_btr_cur(&pcur);
    cursor->thr = thr;
    ut_ad(!dict_table_is_intrinsic(index->table)
     || cursor->page_cur.block->made_dirty_with_no_latch);
#ifdef UNIV_DEBUG
    {
        page_t* page = btr_cur_get_page(cursor);
        rec_t*  first_rec = page_rec_get_next(
            page_get_infimum_rec(page));
        ut_ad(page_rec_is_supremum(first_rec)
         || rec_n_fields_is_sane(index, first_rec, entry));
    }
#endif /* UNIV_DEBUG */
    /* Allowing duplicates in clustered index is currently enabled
    only for intrinsic table and caller understand the limited
    operation that can be done in this case. */
    ut_ad(!index->allow_duplicates
     || (index->allow_duplicates
         && dict_table_is_intrinsic(index->table)));
    if (!index->allow_duplicates
     && n_uniq
     && (cursor->up_match >= n_uniq || cursor->low_match >= n_uniq)) {
        if (flags
         == (BTR_CREATE_FLAG | BTR_NO_LOCKING_FLAG
            | BTR_NO_UNDO_LOG_FLAG | BTR_KEEP_SYS_FLAG)) {
            /* Set no locks when applying log
            in online table rebuild. Only check for duplicates. */
            err = row_ins_duplicate_error_in_clust_online(
                n_uniq, entry, cursor,
                &offsets, &offsets_heap);
            switch (err) {
            case DB_SUCCESS:
                break;
            default:
                ut_ad(0);
                /* fall through */
            case DB_SUCCESS_LOCKED_REC:
            case DB_DUPLICATE_KEY:
                thr_get_trx(thr)->error_info = cursor->index;
            }
        } else {
            /* Note that the following may return also
            DB_LOCK_WAIT */
            err = row_ins_duplicate_error_in_clust(
                flags, cursor, entry, thr, &mtr);
        }
        if (err != DB_SUCCESS) {
err_exit:
            mtr_commit(&mtr);
            goto func_exit;
        }
    }
    if (dup_chk_only) {
        mtr_commit(&mtr);
        goto func_exit;
    }
    /* Note: Allowing duplicates would qualify for modification of
    an existing record as the new entry is exactly same as old entry.
    Avoid this check if allow duplicates is enabled. */
    if (!index->allow_duplicates && row_ins_must_modify_rec(cursor)) {
        /* There is already an index entry with a long enough common
        prefix, we must convert the insert into a modify of an
        existing record */
        mem_heap_t* entry_heap  = mem_heap_create(1024);
        /* If the existing record is being modified and the new record
        doesn't fit the provided slot then existing record is added
        to free list and new record is inserted. This also means
        cursor that we have cached for SELECT is now invalid. */
        if(index->last_sel_cur) {
            index->last_sel_cur->invalid = true;
        }
        err = row_ins_clust_index_entry_by_modify(
            &pcur, flags, mode, &offsets, &offsets_heap,
            entry_heap, entry, thr, &mtr);
        if (err == DB_SUCCESS && dict_index_is_online_ddl(index)) {
            row_log_table_insert(btr_cur_get_rec(cursor), entry,
                     index, offsets);
        }
        mtr_commit(&mtr);
        mem_heap_free(entry_heap);
    } else {
        rec_t*  insert_rec;
        if (mode != BTR_MODIFY_TREE) {
            ut_ad((mode & ~BTR_ALREADY_S_LATCHED)
             == BTR_MODIFY_LEAF);
            err = btr_cur_optimistic_insert(
                flags, cursor, &offsets, &offsets_heap,
                entry, &insert_rec, &big_rec,
                n_ext, thr, &mtr);
        } else {
            if (buf_LRU_buf_pool_running_out()) {
                err = DB_LOCK_TABLE_FULL;
                goto err_exit;
            }
            DEBUG_SYNC_C("before_insert_pessimitic_row_ins_clust");
         //先乐观插入，乐观插入失败则进行悲观插入
            err = btr_cur_optimistic_insert(
                flags, cursor,
                &offsets, &offsets_heap,
                entry, &insert_rec, &big_rec,
                n_ext, thr, &mtr);
            if (err == DB_FAIL) {
                err = btr_cur_pessimistic_insert(
                    flags, cursor,
                    &offsets, &offsets_heap,
                    entry, &insert_rec, &big_rec,
                    n_ext, thr, &mtr);
            }
        }
        if (big_rec != NULL) {
            mtr_commit(&mtr);
            /* Online table rebuild could read (and
            ignore) the incomplete record at this point.
            If online rebuild is in progress, the
            row_ins_index_entry_big_rec() will write log. */
            DBUG_EXECUTE_IF(
                "row_ins_extern_checkpoint",
                log_make_checkpoint_at(
                    LSN_MAX, TRUE););
            err = row_ins_index_entry_big_rec(
                entry, big_rec, offsets, &offsets_heap, index,
                thr_get_trx(thr)->mysql_thd,
                __FILE__, __LINE__);
            dtuple_convert_back_big_rec(index, entry, big_rec);
        } else {
            if (err == DB_SUCCESS
             && dict_index_is_online_ddl(index)) {
                row_log_table_insert(
                    insert_rec, entry, index, offsets);
            }
           
            mtr_commit(&mtr);
        }
    }
func_exit:
    if (offsets_heap != NULL) {
        mem_heap_free(offsets_heap);
    }
    btr_pcur_close(&pcur);
    DBUG_RETURN(err);
}


一些路径
row_ins_clust_index_entry_low-> mtr_start/set_named_space-> btr_cur_optimistic_insert -> btr_pcur_open -> btr_pcur_open_low ->btr_cur_search_to_nth_level(找到对应的位置)->btr_cur_optimistic_insert(乐观锁插入)->btr_cur_ins_lock_and_undo(写undo)->page_cur_tuple_insert(插入记录)->page_cur_insert_rec_low-> page_cur_insert_rec_write_log(写red)->mach_write_to_2(记录在page上的偏移量)->memcpy/mlog_close(把log拷贝到mlog上，自己是这么理解的)->mtr_commit(mlog拷贝到log buffer，释放资源)

当从btr_cur_ins_lock_and_undo -> trx_undo_report_row_operation 往下看的时候，里面还会有一个mtr事物，到写完undo 后mtr_commit，因为mtr是物理事物，保证单次物理事物的完整性。
从这个流程里面可以看出先写undo ，在写redo 。
这边看的路径是步骤和调用关系混在一起，做的大致的笔记,看了一遍有些里面的好多判断不是很明白，很多分支判断没有进去看，后面有时间仔细的研究下。

参考：
http://mysql.taobao.org/monthly/2014/12/01/
http://mysql.taobao.org/monthly/2015/05/01/