参考:http://blog.jcole.us/2013/01/04/page-management-in-innodb-space-files/ by Jeremy Cole.
一、index file segment structure
根据前面第七篇文章buffer pool的组织结构,已经知道:一个Index page(也就是root page)具有两个Fseg Header,分别用于leaf pages和non-leaf pages。
In a newly created table, the only page that exists will be the root page, which is also a leaf page, but is presentin the “internal” file segment(so that it doesn’t have to be moved later). The “leaf” file segment INODE lists and fragment array will all be empty. The “internal” file segment INODE lists will all be empty, and the single root page will be in the fragment array.
下图是某时刻的使用情况:(图中Extent栏5上方的线应该位于4上方)
The index root page points to two inodes (file segments), each of which have a fragment array (pointing to up to 32 individual pages from a fragment list), as well as several lists of whole extents, which are linked together using list pointers in the extent descriptors. The extent descriptors are used both to reference an extent as well as to keep track of free pages within an extent.
A multi-level index tree (overly simplified) in InnoDB looks like:
Since the root page is allocated when the index is first created, and that page number is stored in the data dictionary, the root page can never relocated or removed.Once the root page fills up, it will need to be split, forming a small tree of a root page plus two leaf pages.
However, the root page itself can't actually be split, since it cannot be relocated.Instead, a new, empty page is allocated, the records in the root are moved there (the root is "raised" a level), and that new page is split into two.The root page then does not need to be split again until the level immediately below it has enough pages that the root becomes full of child page pointers (called "node pointers"), which in practice often means several hundred to more than a thousand.
二、dict_ind_init()
Create dummy indexes for infimum and supremum records.
void
dict_ind_init(void)
/*===============*/
{
dict_table_t* table;
/* create dummy table and index for REDUNDANT infimum and supremum */
table = dict_mem_table_create("SYS_DUMMY1", DICT_HDR_SPACE, 1, 0); //Creates a table memory object,参数DICT_HDR_SPACE表示space 0,参数1表示1行。
dict_mem_table_add_col(table, NULL, NULL, DATA_CHAR,
DATA_ENGLISH | DATA_NOT_NULL, 8);
dict_ind_redundant = dict_mem_index_create("SYS_DUMMY1", "SYS_DUMMY1",
DICT_HDR_SPACE, 0, 1); //创建一个dict_index_t,并初始化,注意page为未定义,表示无root page.
dict_index_add_col(dict_ind_redundant, table,
dict_table_get_nth_col(table, 0), 0);
dict_ind_redundant->table = table;
/* create dummy table and index for COMPACT infimum and supremum */
table = dict_mem_table_create("SYS_DUMMY2",
DICT_HDR_SPACE, 1, DICT_TF_COMPACT);
dict_mem_table_add_col(table, NULL, NULL, DATA_CHAR,
DATA_ENGLISH | DATA_NOT_NULL, 8);
dict_ind_compact = dict_mem_index_create("SYS_DUMMY2", "SYS_DUMMY2",
DICT_HDR_SPACE, 0, 1);
dict_index_add_col(dict_ind_compact, table,
dict_table_get_nth_col(table, 0), 0);
dict_ind_compact->table = table;
/* avoid ut_ad(index->cached) in dict_index_get_n_unique_in_tree */
dict_ind_redundant->cached = dict_ind_compact->cached = TRUE;
}
三、dict_boot——关于数据字典
Initializes the data dictionary memory structures when the database is started.
调用流程:innobase_init----->dict_boot
void
dict_boot(void)
/*===========*/
{
dict_table_t* table;
dict_index_t* index;
dict_hdr_t* dict_hdr;
mem_heap_t* heap;
mtr_t mtr;
ulint error;
mtr_start(&mtr);
/* Create the hash tables etc. */
dict_init(); //初始化dict_sys_t* dict_sys
heap = mem_heap_create(450);
mutex_enter(&(dict_sys->mutex));
/* Get the dictionary header */
dict_hdr = dict_hdr_get(&mtr); //从space0中的page 7获得数据字典header,它紧跟在38字节的file header之后。该头部的具体内容之后给出。
/* Because we only write new row ids to disk-based data structure
(dictionary header) when it is divisible by
DICT_HDR_ROW_ID_WRITE_MARGIN, in recovery we will not recover
the latest value of the row id counter. Therefore we advance
the counter at the database startup to avoid overlapping values.
Note that when a user after database startup first time asks for
a new row id, then because the counter is now divisible by
..._MARGIN, it will immediately be updated to the disk-based
header. */
dict_sys->row_id = DICT_HDR_ROW_ID_WRITE_MARGIN
+ ut_uint64_align_up(mach_read_from_8(dict_hdr + DICT_HDR_ROW_ID),
DICT_HDR_ROW_ID_WRITE_MARGIN);
/* Insert into the dictionary cache the descriptions of the basic
system tables */
/*-------------------------*/
table = dict_mem_table_create("SYS_TABLES", DICT_HDR_SPACE, 8, 0);
dict_mem_table_add_col(table, heap, "NAME", DATA_BINARY, 0, 0);
dict_mem_table_add_col(table, heap, "ID", DATA_BINARY, 0, 0);
/* ROW_FORMAT = (N_COLS >> 31) ? COMPACT : REDUNDANT */
dict_mem_table_add_col(table, heap, "N_COLS", DATA_INT, 0, 4);
/* TYPE is either DICT_TABLE_ORDINARY, or (TYPE & DICT_TF_COMPACT)
and (TYPE & DICT_TF_FORMAT_MASK) are nonzero and TYPE = table->flags */
dict_mem_table_add_col(table, heap, "TYPE", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "MIX_ID", DATA_BINARY, 0, 0);
/* MIX_LEN may contain additional table flags when
ROW_FORMAT!=REDUNDANT. Currently, these flags include
DICT_TF2_TEMPORARY. */
dict_mem_table_add_col(table, heap, "MIX_LEN", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "CLUSTER_NAME", DATA_BINARY, 0, 0);
dict_mem_table_add_col(table, heap, "SPACE", DATA_INT, 0, 4);
table->id = DICT_TABLES_ID;
dict_table_add_to_cache(table, heap);
dict_sys->sys_tables = table;
mem_heap_empty(heap);
index = dict_mem_index_create("SYS_TABLES", "CLUST_IND",
DICT_HDR_SPACE,
DICT_UNIQUE | DICT_CLUSTERED, 1);//为“SYS_TABLES”表创建聚集索引
dict_mem_index_add_field(index, "NAME", 0);
index->id = DICT_TABLES_ID;
error = dict_index_add_to_cache(table, index,
mtr_read_ulint(dict_hdr
+ DICT_HDR_TABLES, //这里本例存的是page 8(0-7页是保留的,这个page 8用于SYS_TABLES的根页是多么的“理所当然”),也就是说page 8被用来存放root of the table index tree.
MLOG_4BYTES, &mtr), //可以猜想,这个B树中存放的记录是一个个表的表名、表id等信息。
FALSE); //Adds an index to the dictionary cache.同时将index链入到table->indexs.
ut_a(error == DB_SUCCESS);
/*-------------------------*/
index = dict_mem_index_create("SYS_TABLES", "ID_IND",
DICT_HDR_SPACE, DICT_UNIQUE, 1); //为“SYS_TABLES”表创建辅助索引,这个辅助索引根据表ID来查询。
dict_mem_index_add_field(index, "ID", 0);
index->id = DICT_TABLE_IDS_ID;
error = dict_index_add_to_cache(table, index,
mtr_read_ulint(dict_hdr
+ DICT_HDR_TABLE_IDS,//这里本例存的是page 9
MLOG_4BYTES, &mtr),
FALSE);
ut_a(error == DB_SUCCESS);
/*-------------------------*/
table = dict_mem_table_create("SYS_COLUMNS", DICT_HDR_SPACE, 7, 0);
dict_mem_table_add_col(table, heap, "TABLE_ID", DATA_BINARY, 0, 0);
dict_mem_table_add_col(table, heap, "POS", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "NAME", DATA_BINARY, 0, 0);
dict_mem_table_add_col(table, heap, "MTYPE", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "PRTYPE", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "LEN", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "PREC", DATA_INT, 0, 4);
table->id = DICT_COLUMNS_ID;
dict_table_add_to_cache(table, heap);
dict_sys->sys_columns = table;
mem_heap_empty(heap);
index = dict_mem_index_create("SYS_COLUMNS", "CLUST_IND",
DICT_HDR_SPACE,
DICT_UNIQUE | DICT_CLUSTERED, 2);//为“SYS_COLUMNS”表创建聚集索引
dict_mem_index_add_field(index, "TABLE_ID", 0);
dict_mem_index_add_field(index, "POS", 0);
index->id = DICT_COLUMNS_ID;
error = dict_index_add_to_cache(table, index,
mtr_read_ulint(dict_hdr
+ DICT_HDR_COLUMNS,//这里本例存的是page 10
MLOG_4BYTES, &mtr),
FALSE);
ut_a(error == DB_SUCCESS);
/*-------------------------*/
table = dict_mem_table_create("SYS_INDEXES", DICT_HDR_SPACE, 7, 0);
dict_mem_table_add_col(table, heap, "TABLE_ID", DATA_BINARY, 0, 0);
dict_mem_table_add_col(table, heap, "ID", DATA_BINARY, 0, 0);
dict_mem_table_add_col(table, heap, "NAME", DATA_BINARY, 0, 0);
dict_mem_table_add_col(table, heap, "N_FIELDS", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "TYPE", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "SPACE", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "PAGE_NO", DATA_INT, 0, 4);
/* The '+ 2' below comes from the fields DB_TRX_ID, DB_ROLL_PTR */
#if DICT_SYS_INDEXES_PAGE_NO_FIELD != 6 + 2
#error "DICT_SYS_INDEXES_PAGE_NO_FIELD != 6 + 2"
#endif
#if DICT_SYS_INDEXES_SPACE_NO_FIELD != 5 + 2
#error "DICT_SYS_INDEXES_SPACE_NO_FIELD != 5 + 2"
#endif
#if DICT_SYS_INDEXES_TYPE_FIELD != 4 + 2
#error "DICT_SYS_INDEXES_TYPE_FIELD != 4 + 2"
#endif
#if DICT_SYS_INDEXES_NAME_FIELD != 2 + 2
#error "DICT_SYS_INDEXES_NAME_FIELD != 2 + 2"
#endif
table->id = DICT_INDEXES_ID;
dict_table_add_to_cache(table, heap);
dict_sys->sys_indexes = table;
mem_heap_empty(heap);
index = dict_mem_index_create("SYS_INDEXES", "CLUST_IND",
DICT_HDR_SPACE,
DICT_UNIQUE | DICT_CLUSTERED, 2);//为“SYS_INDEXES”表创建聚集索引
dict_mem_index_add_field(index, "TABLE_ID", 0);
dict_mem_index_add_field(index, "ID", 0);
index->id = DICT_INDEXES_ID;
error = dict_index_add_to_cache(table, index,
mtr_read_ulint(dict_hdr
+ DICT_HDR_INDEXES,//这里本例存的是page 11
MLOG_4BYTES, &mtr),
FALSE);
ut_a(error == DB_SUCCESS);
/*-------------------------*/
table = dict_mem_table_create("SYS_FIELDS", DICT_HDR_SPACE, 3, 0);
dict_mem_table_add_col(table, heap, "INDEX_ID", DATA_BINARY, 0, 0);
dict_mem_table_add_col(table, heap, "POS", DATA_INT, 0, 4);
dict_mem_table_add_col(table, heap, "COL_NAME", DATA_BINARY, 0, 0);
table->id = DICT_FIELDS_ID;
dict_table_add_to_cache(table, heap);
dict_sys->sys_fields = table;
mem_heap_free(heap);
index = dict_mem_index_create("SYS_FIELDS", "CLUST_IND",
DICT_HDR_SPACE,
DICT_UNIQUE | DICT_CLUSTERED, 2);//为“SYS_FIELDS”表创建聚集索引
dict_mem_index_add_field(index, "INDEX_ID", 0);
dict_mem_index_add_field(index, "POS", 0);
index->id = DICT_FIELDS_ID;
error = dict_index_add_to_cache(table, index,
mtr_read_ulint(dict_hdr
+ DICT_HDR_FIELDS,//这里本例存的是page 12
MLOG_4BYTES, &mtr),
FALSE);
ut_a(error == DB_SUCCESS);
mtr_commit(&mtr);
/*-------------------------*/
/* Initialize the insert buffer table and index for each tablespace */
ibuf_init_at_db_start(); //初始化insert buf,其中也会调用dict_index_add_to_cache,根页是page 4
/* Load definitions of other indexes on system tables */
dict_load_sys_table(dict_sys->sys_tables);
dict_load_sys_table(dict_sys->sys_columns);//好像SYS_INDEXES表中并不含这几个表ID,该函数只是为了确保这几个专用表的索引不在SYS_INDEXES中记录。
dict_load_sys_table(dict_sys->sys_indexes);//与它们相关的index已经加入到dictionary cache中。
dict_load_sys_table(dict_sys->sys_fields);
mutex_exit(&(dict_sys->mutex));
}
/* Dictionary header offsets */
#define DICT_HDR_ROW_ID 0 /* The latest assigned row id */
#define DICT_HDR_TABLE_ID 8 /* The latest assigned table id */
#define DICT_HDR_INDEX_ID 16 /* The latest assigned index id */
#define DICT_HDR_MAX_SPACE_ID 24 /* The latest assigned space id, or 0*/
#define DICT_HDR_MIX_ID_LOW 28 /* Obsolete,always DICT_HDR_FIRST_ID */
#define DICT_HDR_TABLES 32 /* Root of the table index tree */
#define DICT_HDR_TABLE_IDS 36 /* Root of the table index tree */
#define DICT_HDR_COLUMNS 40 /* Root of the column index tree */
#define DICT_HDR_INDEXES 44 /* Root of the index index tree */
#define DICT_HDR_FIELDS 48 /* Root of the index field index tree */
#define DICT_HDR_FSEG_HEADER 56 /* Segment header for the tablespace
segment into which the dictionary
header is created */
由于我们在数据字典中创建了一些表,这些表记录其他所有表的信息,并且它们的root page存放在dictionary header中。从这些表的root page开始,我们就可以快速找到我们想要的表的信息(如从SYS_INDEXES表获取我们想要查找的表的root page位置)。猜测是根据表名从SYS_TABLES获得TABLE_ID,再根据TABLE_ID从SYS_INDEXES获取表的root page号。(dict_table_get_low用于根据table_name获得一个表,并load it to the dictionary if necessary.)
帮助理解:http://www.mysqlperformanceblog.com/2013/04/22/how-to-recover-table-structure-from-innodb-dictionary/
四、dict_table_add_to_cache
void
dict_table_add_to_cache(
/*====================*/
dict_table_t* table, /*!< in: table */
mem_heap_t* heap) /*!< in: temporary heap */
{
ulint fold;
ulint id_fold;
ulint i;
ulint row_len;
/* The lower limit for what we consider a "big" row */
#define BIG_ROW_SIZE 1024
ut_ad(mutex_own(&(dict_sys->mutex)));
dict_table_add_system_columns(table, heap);//为结构体table加三个column:"DB_ROW_ID", "DB_TRX_ID", "DB_ROLL_PTR",非常关键。
table->cached = TRUE;
fold = ut_fold_string(table->name);
id_fold = ut_fold_ull(table->id);
row_len = 0;
for (i = 0; i < table->n_def; i++) { //计算各个column的总长度row_len
ulint col_len = dict_col_get_max_size(
dict_table_get_nth_col(table, i));
row_len += col_len;
/* If we have a single unbounded field, or several gigantic
fields, mark the maximum row size as BIG_ROW_SIZE. */
if (row_len >= BIG_ROW_SIZE || col_len >= BIG_ROW_SIZE) {
row_len = BIG_ROW_SIZE;
break;
}
}
table->big_rows = row_len >= BIG_ROW_SIZE;
/* Look for a table with the same name: error if such exists */
{
dict_table_t* table2;
HASH_SEARCH(name_hash, dict_sys->table_hash, fold,
dict_table_t*, table2, ut_ad(table2->cached),
ut_strcmp(table2->name, table->name) == 0);
ut_a(table2 == NULL);
}
/* Look for a table with the same id: error if such exists */
{
dict_table_t* table2;
HASH_SEARCH(id_hash, dict_sys->table_id_hash, id_fold,
dict_table_t*, table2, ut_ad(table2->cached),
table2->id == table->id);
ut_a(table2 == NULL);
}
/* Add table to hash table of tables */
HASH_INSERT(dict_table_t, name_hash, dict_sys->table_hash, fold,
table);
/* Add table to hash table of tables based on table id */
HASH_INSERT(dict_table_t, id_hash, dict_sys->table_id_hash, id_fold,
table);
/* Add table to LRU list of tables */
UT_LIST_ADD_FIRST(table_LRU, dict_sys->table_LRU, table);
dict_sys->size += mem_heap_get_size(table->heap)
+ strlen(table->name) + 1;
}
总结:
dict_table_add_to_cache()意味着dict_table_t *table建立,并链入dict_sys。当需要使用一个表时,首先从dict_sys的哈希表中查找,找到后就可以使用该表的indexes等;若未找到,则根据dict_sys中的sys_tables(类型为dict_table_t *)和sys_indexes获取表信息在文件中的位置,并从文件中读取,之后调用dict_table_add_to_cache()。