MySQL技术内幕4：InnoDB存储引擎内存管理

原创于 2025-06-27 09:06:14 发布 · 1k 阅读

19 ·

CC 4.0 BY-SA版权

文章标签：

#mysql #android #数据库

MySQL 专栏收录该内容

6 篇文章

订阅专栏

0.简介

了解完MySQL的物理存储，本篇介绍MySQL的内存管理，主要包含基础的内存分配结构、算法以及缓存池的实现细节，同时对自适应哈希索引以及日志buffer进行介绍。本篇介绍完内存管理后，下一篇介绍物理存储到内存的转换过程。

1.原理介绍

对于操作系统来说，对于小块内存的管理，是利用不同大小的bucket链表串连相同大小的buf,如下图：

在这里插入图片描述
对于MySQL来说内存管理主要包含两个部分，基础内存申请和各种Buffer，对于基础的内存申请来说实质上就是封装了基础的内存分配方式如malloc，free, calloc等（也就是管理动态分配的内存），而各种Pool（可以理解为基础分配更上一层，专职存储某些数据）如Buffer Pool是缓存数据或索引页，减少磁盘的io。

2.代码分析

2.1 基础动态内存分配

MySQL基础的动态内存分配在innodb中做了封装，主要包括常用的malloc，free等函数和容器内部使用的allocater（包含两种模式，即是否需要追踪内存，由宏进行控制），代码可见常用内存分配mem_heap_alloc和mem_heap_allocator。

内存状态追踪和不追踪差别在于开始时是否需要去注册以及是否需要在内存分配时去触发相应的函数，其函数列表如下，在启动时调用register进行注册，在内存事件触发时（如my_malloc）通过PSI_MEMORY_CALL来进行相应的触发：

#define PSI_MEMORY_CALL(M) pfs_##M##_vc
void pfs_register_memory_vc(const char *category,
                            struct PSI_memory_info_v1 *info, int count);
PSI_memory_key pfs_memory_alloc_vc(PSI_memory_key key, size_t size,
                                   PSI_thread **owner);
PSI_memory_key pfs_memory_realloc_vc(PSI_memory_key key, size_t old_size,
                                     size_t new_size, PSI_thread **owner);
PSI_memory_key pfs_memory_claim_vc(PSI_memory_key key, size_t size,
                                   PSI_thread **owner, bool claim);
void pfs_memory_free_vc(PSI_memory_key key, size_t size, PSI_thread *owner);

了解了整体内部分类，下面来看数据结构，该部分管理的主要结构为mem_heap_t。其内部维护block链表，为了快速定位头尾，其中包含头尾指针，每个节点都是mem_heap_t的结构。mem_heap在创建的时候会初始一块内存作为第一个block。mem_heap_alloc响应基本的内存分配请求，先尝试从block中切分出满足请求大小的内存，如果不能满足则创建一个新的block，新的block size至少为上一个block的两倍(last block)，直到达到规定的上限值，新创建的block总是链到链表的尾部。mem_heap_t中记录了block的size和链表中block的总size以及分配模式(type)等信息，基本结构如下图：

MySQL · 源码阅读 · Innodb内存管理解析 - 图1

innodb除了上述的heap块申请还支持压缩页管理，其管理方式采用伙伴算法，支持1k，2k，4k，8k和16k的page管理，每种内存维护一个列表，和上面操作系统管理方式类似，每次申请先在zip_free[i]中申请，没有的话尝试在更大zip_free的申请，一部分用于使用，一部分放入zip_free[i]；归还时则判断其伙伴（也就是切分的另一部分）是否free，是的话合并向上归还。

2.2 Buffer Pool

Buffer Pool 在MySQL中用于缓存数据和索引页，以减少磁盘 I/O，提高查询性能。在面对修改时，先在Buffer Pool中修改，由后台线程来进行数据写回磁盘。

Buffer Pool的内存管理是使用chunk来管理的，每个chunk又被分为多个block，每个block大小为UNIV_PAGE_SIZE（16k)，在初始化的时候会将这些block加入free_list供后续使用。

接下来来看Buffer Pool管理的核心结构buf_pool_t，其主要可以分为各种锁，统计信息，内存链表，lru和flush算法相关：

struct buf_pool_t {
  /** @name General fields */
  /** @{ */
  /** protects (de)allocation of chunks*/
  BufListMutex chunks_mutex;
  /** LRU list mutex */
  BufListMutex LRU_list_mutex;
  /** free and withdraw list mutex */
  BufListMutex free_list_mutex;
  /** buddy allocator mutex */
  BufListMutex zip_free_mutex;
  /** zip_hash mutex */
  BufListMutex zip_hash_mutex;
  /** Flush state protection mutex */
  ib_mutex_t flush_state_mutex;
  /** Zip mutex of this buffer pool instance, protects compressed only pages (of
  type buf_page_t, not buf_block_t */
  BufPoolZipMutex zip_mutex;
  /** Array index of this buffer pool instance */
  ulint instance_no;
  /** Current pool size in bytes */
  ulint curr_pool_size;
  /** Reserve this much of the buffer pool for "old" blocks */
  ulint LRU_old_ratio;
#ifdef UNIV_DEBUG
  /** Number of frames allocated from the buffer pool to the buddy system.
  Protected by zip_hash_mutex. */
  ulint buddy_n_frames;
#endif
  /** Number of buffer pool chunks */
  volatile ulint n_chunks;
  /** New number of buffer pool chunks */
  volatile ulint n_chunks_new;
  /** buffer pool chunks */
  buf_chunk_t *chunks;
  /** old buffer pool chunks to be freed after resizing buffer pool */
  buf_chunk_t *chunks_old;
  /** Current pool size in pages */
  ulint curr_size;
  /** Previous pool size in pages */
  ulint old_size;
  /** Size in pages of the area which the read-ahead algorithms read
  if invoked */
  page_no_t read_ahead_area;
  /** Hash table of buf_page_t or buf_block_t file pages, buf_page_in_file() ==
  true, indexed by (space_id, offset).  page_hash is protected by an array of
  mutexes. */
  hash_table_t *page_hash;
  /** Hash table of buf_block_t blocks whose frames are allocated to the zip
  buddy system, indexed by block->frame */
  hash_table_t *zip_hash;
  /** Number of pending read operations. Accessed atomically */
  std::atomic<ulint> n_pend_reads;
  /** number of pending decompressions.  Accessed atomically. */
  std::atomic<ulint> n_pend_unzip;
  /** when buf_print_io was last time called. Accesses not protected. */
  std::chrono::steady_clock::time_point last_printout_time;
  /** Statistics of buddy system, indexed by block size. Protected by zip_free
  mutex, except for the used field, which is also accessed atomically */
  buf_buddy_stat_t buddy_stat[BUF_BUDDY_SIZES_MAX + 1];
  /** Current statistics */
  buf_pool_stat_t stat;
  /** Old statistics */
  buf_pool_stat_t old_stat;
  /** @} */
  /** @name Page flushing algorithm fields */
  /** @{ */
  /** Mutex protecting the flush list access. This mutex protects flush_list,
  flush_rbt and bpage::list pointers when the bpage is on flush_list. It also
  protects writes to bpage::oldest_modification and flush_list_hp */
  BufListMutex flush_list_mutex;
  /** "Hazard pointer" used during scan of flush_list while doing flush list
  batch.  Protected by flush_list_mutex */
  FlushHp flush_hp;
  /** Entry pointer to scan the oldest page except for system temporary */
  FlushHp oldest_hp;
  /** Base node of the modified block list */
  UT_LIST_BASE_NODE_T(buf_page_t, list) flush_list;
  /** This is true when a flush of the given type is being initialized.
  Protected by flush_state_mutex. */
  bool init_flush[BUF_FLUSH_N_TYPES];
  /** This is the number of pending writes in the given flush type.  Protected
  by flush_state_mutex. */
  ulint n_flush[BUF_FLUSH_N_TYPES];
  /** This is in the set state when there is no flush batch of the given type
  running. Protected by flush_state_mutex. */
  os_event_t no_flush[BUF_FLUSH_N_TYPES];
  ib_rbt_t *flush_rbt;
  ulint freed_page_clock;
  /** Set to false when an LRU scan for free block fails. This flag is used to
  avoid repeated scans of LRU list when we know that there is no free block
  available in the scan depth for eviction. Set to true whenever we flush a
  batch from the buffer pool. Accessed protected by memory barriers. */
  bool try_LRU_scan;
  /** Page Tracking start LSN. */
  lsn_t track_page_lsn;
  /** Maximum LSN for which write io has already started. */
  lsn_t max_lsn_io;
  /** @} */
  /** @name LRU replacement algorithm fields */
  /** @{ */
  /** Base node of the free block list */
  UT_LIST_BASE_NODE_T(buf_page_t, list) free;

  UT_LIST_BASE_NODE_T(buf_page_t, list) withdraw;
  /** Target length of withdraw block list, when withdrawing */
  ulint withdraw_target;
  /** "hazard pointer" used during scan of LRU while doing
  LRU list batch.  Protected by buf_pool::LRU_list_mutex */
  LRUHp lru_hp;
  /** Iterator used to scan the LRU list when searching for
  replaceable victim. Protected by buf_pool::LRU_list_mutex. */
  LRUItr lru_scan_itr;
  /** Iterator used to scan the LRU list when searching for
  single page flushing victim.  Protected by buf_pool::LRU_list_mutex. */
  LRUItr single_scan_itr;
  /** Base node of the LRU list */
  UT_LIST_BASE_NODE_T(buf_page_t, LRU) LRU;
  buf_page_t *LRU_old;
  ulint LRU_old_len;
  /** Base node of the unzip_LRU list. The list is protected by the
  LRU_list_mutex. */
  UT_LIST_BASE_NODE_T(buf_block_t, unzip_LRU) unzip_LRU;
#if defined UNIV_DEBUG || defined UNIV_BUF_DEBUG
  /** Unmodified compressed pages */
  UT_LIST_BASE_NODE_T(buf_page_t, list) zip_clean;
#endif /* UNIV_DEBUG || UNIV_BUF_DEBUG */
  /** Buddy free lists */
  UT_LIST_BASE_NODE_T(buf_buddy_free_t, list) zip_free[BUF_BUDDY_SIZES_MAX];
  buf_page_t *watch;
};

2.2.1 内存链表

1）free list：空闲链表，在需要内存块时在这取，这里如果没有需要在LRU或flush list中进行淘汰或者脏页刷盘来填充free list。

2）LRU list：数据链表，其包含两个部分，old list和young list，第一次获取新的page放到old list部分，下一次读取满足一定条件的话放到young list中，防止全表扫描造成大量常用page的淘汰。

3）flush list：存储已经修改但还没有刷盘的page，按照时间顺序存放，需要刷盘时进行遍历。

2.3 Log Buffer

Log Buffer是为了通过批量写磁盘来减少磁盘I/O操作，从而提升写入效率，其核心结构为log_t，本质就是一个连续的字节数组和它对应的管理，像大小，写入到磁盘的lsn号等。

2.4 Adaptive Hash Index

自适应hash索引是一种自动为热点数据创建的哈希索引，正常innodb索引使用b+树结构，但是查询根据条件一直到叶子节点，存在一定寻路开销，所以使用索引键前缀建立一个哈希索引表能很好的降低开销，其核心结构是btr_search_sys_t，其内包含hash表，指向索引页，可以理解为为了经常访问的索引页建立的hash表。

2.5 Change Buffer

Change Buffer主要用于优化对非唯一二级索引（Non-Unique Secondary Index）的写操作。Change Buffer 通过延迟和批量写入减少随机 I/O 操作，提高写入性能，特别是在大量插入、更新和删除操作的场景下。原理就是通过hash表记录索引信息，这样就将涉及到一个索引页的变更记录到了一起，然后统一加载page然后修改批量刷新到磁盘。