PostgreSQL B+树索引---插入_btnull.in-优快云博客

本文链接：https://blog.youkuaiyun.com/obvious__/article/details/120364158

B+树索引—插入

预备知识

《B+树索引—查询》

概述

在了解了B+树的查询逻辑之后，再来看B+树插入逻辑就比较简单了，无非就是首先定位插入位置，然后将关键字进行插入。当节点插入满后，节点会发生分裂，节点分裂是一个非常复杂的流程，后面会有专门的文章来阐述节点分裂。本文主要解决以下几个问题：

插入的基本流程。
索引中会插入些什么？tid是否会作为索引列？
如何定位插入位置？
唯一索引和主键索引如何处理？

插入基本流程

PostgreSQL索引插入的入口函数是btinsert，代码实现如下：

bool
btinsert(Relation rel, Datum *values, bool *isnull,
		 ItemPointer ht_ctid, Relation heapRel,
		 IndexUniqueCheck checkUnique)
{
	bool		result;
	IndexTuple	itup;

	/* generate an index tuple
	 * step1：产生一个IndexTuple
     */
	itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
	itup->t_tid = *ht_ctid;

    /*step2：将IndexTuple插入B+树索引*/
	result = _bt_doinsert(rel, itup, checkUnique, heapRel);

	pfree(itup);

	return result;
}

这段代码非常简单，就两个步骤：

step1：产生一个IndexTuple
step2：将IndexTuple插入B+树索引

IndexTuple顾名思义就是索引元组，定义如下：

typedef struct IndexTupleData
{
	ItemPointerData t_tid;		/* reference TID to heap tuple */

	/* ---------------
	 * t_info is laid out in the following fashion:
	 *
	 * 15th (high) bit: has nulls
	 * 14th bit: has var-width attributes
	 * 13th bit: unused
	 * 12-0 bit: size of tuple
	 * ---------------
	 */

	unsigned short t_info;		/* various info about tuple */

} IndexTupleData;				/* MORE DATA FOLLOWS AT END OF STRUCT */
typedef IndexTupleData *IndexTuple;

这个结构体中只显示了两个部分：

t_tid：tuple id和Oracle的rowid是一个意思。
t_info：标志位，比如用来标识是否有元组的值为空。

如果索引元组中有空值，那么IndexTupleData后就会跟一个位图，用于表示对应索引元组是否为空，位图的结构体如下：

typedef struct IndexAttributeBitMapData
{
	bits8		bits[(INDEX_MAX_KEYS + 8 - 1) / 8];
}	IndexAttributeBitMapData;

注意，由于IndexTupleData中没有记录索引列的个数，所以位图的大小不会随索引列数量的变化而变化，总是INDEX_MAX_KEYS这么大（INDEX_MAX_KEYS表示索引列的个数的上限）。

接下来便是实际的索引元组数据了，索引元组的组织方式与数据元组一样，都是通过heap_fill_tuple函数进行元组组织的。所以我们现在可以回答第二个问题，索引元组的组织方式如下：

IndexTupleData + IndexAttributeBitMapData + 索引元组

如果不存在空列，那么就不存在IndexAttributeBitMapData。而tid不会作为索引列。

关于tid是否作为索引列

有一种做法是将tid作为索引列，这样的好处在于，由于tid是全局唯一的，所以如果将tid作为索引列那么所有的索引都是唯一索引。在索引删除的时候可以直接定位到删除索引的元组的位置。如果不将tid作为索引列，那么如果索引中包含大量的重复key，删除时就必须遍历所有重复的key，找到相应tid予以删除。但将tid作为索引列也有一个坏处，就是主键和唯一索引在插入开销更大（判断唯一性时不能带上tid，但确定插入位置时又需要带上tid）。

_bt_doinsert

构建好索引元组（itup）之后，就是调用**_bt_doinsert**将索引元组插入索引，我们现在来看看_bt_doinsert的实现。

bool
_bt_doinsert(Relation rel, IndexTuple itup,
			 IndexUniqueCheck checkUnique, Relation heapRel)
{
	bool		is_unique = false;
	int			natts = rel->rd_rel->relnatts;
	ScanKey		itup_scankey;
	BTStack		stack;
	Buffer		buf;
	OffsetNumber offset;

	/* we need an insertion scan key to do our search, so build one 
	 * step1：创建scankey
	 */
	itup_scankey = _bt_mkscankey(rel, itup);

top:
	/* find the first page containing this key 
	 * step2：获取插入节点
	 */
	stack = _bt_search(rel, natts, itup_scankey, false, &buf, BT_WRITE, NULL);

	offset = InvalidOffsetNumber;

	/* trade in our read lock for a write lock 
	 * step3：共享锁换互斥锁
	 */
	LockBuffer(buf, BUFFER_LOCK_UNLOCK);
	LockBuffer(buf, BT_WRITE);

	/*
	 * If the page was split between the time that we surrendered our read
	 * lock and acquired our write lock, then this page may no longer be the
	 * right place for the key we want to insert.  In this case, we need to
	 * move right in the tree.  See Lehman and Yao for an excruciatingly
	 * precise description.
	 * step4：move right
	 */
	buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
						true, stack, BT_WRITE, NULL);

	if (checkUnique != UNIQUE_CHECK_NO)
	{
		/*判重*/
	}

	if (checkUnique != UNIQUE_CHECK_EXISTING)
	{
		/*
		 * The only conflict predicate locking cares about for indexes is when
		 * an index tuple insert conflicts with an existing lock.  Since the
		 * actual location of the insert is hard to predict because of the
		 * random search used to prevent O(N^2) performance when there are
		 * many duplicate entries, we can just use the "first valid" page.
		 */
		CheckForSerializableConflictIn(rel, NULL, buf);
		/* do the insertion 
		 * step5：获取插入位置
		 */
		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
						  stack, heapRel);
        /*step6：执行插入*/
		_bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
	}
	else
	{
		/* just release the buffer */
		_bt_relbuf(rel, buf);
	}

	/* be tidy */
	_bt_freestack(stack);
	_bt_freeskey(itup_scankey);

	return is_unique;
}

上述代码大致可以分为6个步骤：

step1：创建scankey

这个sacnkey用于确定插入位置
step2：获取插入节点

这个和查询一样，首先获取用于执行插入的节点
step3：共享锁换互斥锁

在step2完成后，会返回用于执行插入的节点，此时这个节点上加的是共享锁，由于我们接下来要执行插入操作，所以需要将这个共享锁替换为互斥锁。
step4：move right

在step3中，我们先解锁节点，再给节点加互斥锁。在多进程环境下，当我们解锁了节点后，其他进程就有可能对节点加锁然后执行插入操作，插入又可能引起节点分裂。所以当当前进程给节点加上互斥锁之后，需要执行move right操作，从而确定真正的插入节点。
step5：获取插入位置

在确定了插入节点之后，就需要确定index tuple应该查询到该节点的什么位置。
step6：执行插入

确定了位置之后，就是实际的插入操作了。

_bt_findinsertloc

在step5中，PostgreSQL通过调用_bt_findinsertloc来获取index tuple在节点内的插入位置。我在刚开始研究这个步骤的时候很奇怪，获取节点内的插入位置不是直接使用二分法就好了么，也就是直接调用_bt_binsrch，为什么要搞个这么复杂的函数？原因是PostgreSQL为了减少分裂发生的概率进行了一些优化。这个优化基于这样一个事实：

在PostgreSQL中，一个节点的high_key一定等于其右兄弟的min_key。也就是一个节点内的最大值，一定等于右兄弟的最小值。这是通过分裂算法来实现的。那么这就意味着，如果插入的index tuple与当前节点的high_key相等，那么这个index tuple既可以插入当前节点，也可以插入当前节点的右兄弟。那么，如果插入到当前节点可能导致节点分裂，那么就可以考虑插入到右兄弟，如果右兄弟的high_key也等于index tuple，那么就还可以向右走。所以_bt_findinsertloc大致可以分为三步：

step1：判断当前节点空间是否足够
- 足够，执行step3。
- 不够，执行step2。
step2：判断当前节点是否可以执行vacuume
- 可以，执行vacuume进行空间整理，然后执行step1。
- 不可以，执行step3。
step3：比较当前节点的high_key与index tuple
- 相等：移动到右边节点（先将当前节点解锁，再给右节点加互斥锁）。
- 不相等：执行step4。
step4：使用二分法在节点内定位

经历的前面三步实际才算是真正的获取到了index tuple插入的节点，然后就可以调用二分法在节点内定位index tuple的插入位置。

_bt_findinsertloc的实现代码如下：

static void
_bt_findinsertloc(Relation rel,
				  Buffer *bufptr,
				  OffsetNumber *offsetptr,
				  int keysz,
				  ScanKey scankey,
				  IndexTuple newtup,
				  BTStack stack,
				  Relation heapRel)
{
	Buffer		buf = *bufptr;
	Page		page = BufferGetPage(buf);
	Size		itemsz;
	BTPageOpaque lpageop;
	bool		movedright,
				vacuumed;
	OffsetNumber newitemoff;
	OffsetNumber firstlegaloff = *offsetptr;

	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);

	itemsz = IndexTupleDSize(*newtup);
	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
								 * need to be consistent */

	if (itemsz > BTMaxItemSize(page))
		ereport(ERROR,
				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
			errmsg("index row size %zu exceeds maximum %zu for index \"%s\"",
				   itemsz, BTMaxItemSize(page),
				   RelationGetRelationName(rel)),
		errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
				"Consider a function index of an MD5 hash of the value, "
				"or use full text indexing."),
				 errtableconstraint(heapRel,
									RelationGetRelationName(rel))));

	movedright = false;
	vacuumed = false;
    /* step1：判断当前节点空间是否足够 */
	while (PageGetFreeSpace(page) < itemsz)
	{
		Buffer		rbuf;
		BlockNumber rblkno;

		/*
		 * before considering moving right, see if we can obtain enough space
		 * by erasing LP_DEAD items
		 */
		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
		{
            /* step2：判断当前节点是否可以执行vacuume */
			_bt_vacuum_one_page(rel, buf, heapRel);

			/*
			 * remember that we vacuumed this page, because that makes the
			 * hint supplied by the caller invalid
			 */
			vacuumed = true;

			if (PageGetFreeSpace(page) >= itemsz)
				break;			/* OK, now we have enough space */
		}

		/*
		 * nope, so check conditions (b) and (c) enumerated above
		 * step3：比较当前节点的high_key与index tuple
		 */
		if (P_RIGHTMOST(lpageop) ||
			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
			random() <= (MAX_RANDOM_VALUE / 100))
			break;

		rbuf = InvalidBuffer;
		
        /* 右移 */
		rblkno = lpageop->btpo_next;
		for (;;)
		{
			rbuf = _bt_relandgetbuf(rel, rbuf, rblkno, BT_WRITE);
			page = BufferGetPage(rbuf);
			lpageop = (BTPageOpaque) PageGetSpecialPointer(page);

			/*
			 * If this page was incompletely split, finish the split now. We
			 * do this while holding a lock on the left sibling, which is not
			 * good because finishing the split could be a fairly lengthy
			 * operation.  But this should happen very seldom.
			 */
			if (P_INCOMPLETE_SPLIT(lpageop))
			{
				_bt_finish_split(rel, rbuf, stack);
				rbuf = InvalidBuffer;
				continue;
			}

			if (!P_IGNORE(lpageop))
				break;
			if (P_RIGHTMOST(lpageop))
				elog(ERROR, "fell off the end of index \"%s\"",
					 RelationGetRelationName(rel));

			rblkno = lpageop->btpo_next;
		}
		_bt_relbuf(rel, buf);
		buf = rbuf;
		movedright = true;
		vacuumed = false;
	}

    /* step4：使用二分法在节点内定位 */
	if (movedright)
		newitemoff = P_FIRSTDATAKEY(lpageop);
	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
		newitemoff = firstlegaloff;
	else
		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);

	*bufptr = buf;
	*offsetptr = newitemoff;
}

_bt_insertonpg

定位到插入位置之后，最后就是调用_bt_insertonpg执行真正的插入操作。_bt_insertonpg大致分为三个步骤：

step1：判断是否需要分裂
- 是，进入分裂流程
- 否，执行step2
step2：执行插入
step3：写XLOG

static void
_bt_insertonpg(Relation rel,
			   Buffer buf,
			   Buffer cbuf,
			   BTStack stack,
			   IndexTuple itup,
			   OffsetNumber newitemoff,
			   bool split_only_page)
{
	Page		page;
	BTPageOpaque lpageop;
	OffsetNumber firstright = InvalidOffsetNumber;
	Size		itemsz;

	page = BufferGetPage(buf);
	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);

	/* child buffer must be given iff inserting on an internal page */
	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));

	/* The caller should've finished any incomplete splits already. */
	if (P_INCOMPLETE_SPLIT(lpageop))
		elog(ERROR, "cannot insert to incompletely split page %u",
			 BufferGetBlockNumber(buf));

	itemsz = IndexTupleDSize(*itup);
	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
								 * need to be consistent */

	/*
	 * Do we need to split the page to fit the item on it?
	 *
	 * Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
	 * so this comparison is correct even though we appear to be accounting
	 * only for the item and not for its line pointer.
	 * step1：判断是否需要分裂
	 */
	if (PageGetFreeSpace(page) < itemsz)
	{
		/*分裂流程*/
	}
	else
	{
        /*step2：执行插入*/
		Buffer		metabuf = InvalidBuffer;
		Page		metapg = NULL;
		BTMetaPageData *metad = NULL;
		OffsetNumber itup_off;
		BlockNumber itup_blkno;

		itup_off = newitemoff;
		itup_blkno = BufferGetBlockNumber(buf);

		/*
		 * If we are doing this insert because we split a page that was the
		 * only one on its tree level, but was not the root, it may have been
		 * the "fast root".  We need to ensure that the fast root link points
		 * at or above the current page.  We can safely acquire a lock on the
		 * metapage here --- see comments for _bt_newroot().
		 */
		if (split_only_page)
		{
			Assert(!P_ISLEAF(lpageop));

			metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
			metapg = BufferGetPage(metabuf);
			metad = BTPageGetMeta(metapg);

			if (metad->btm_fastlevel >= lpageop->btpo.level)
			{
				/* no update wanted */
				_bt_relbuf(rel, metabuf);
				metabuf = InvalidBuffer;
			}
		}

		/* Do the update.  No ereport(ERROR) until changes are logged */
		START_CRIT_SECTION();

		if (!_bt_pgaddtup(page, itemsz, itup, newitemoff))
			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
				 itup_blkno, RelationGetRelationName(rel));

		MarkBufferDirty(buf);

		if (BufferIsValid(metabuf))
		{
			metad->btm_fastroot = itup_blkno;
			metad->btm_fastlevel = lpageop->btpo.level;
			MarkBufferDirty(metabuf);
		}

		/* clear INCOMPLETE_SPLIT flag on child if inserting a downlink */
		if (BufferIsValid(cbuf))
		{
			Page		cpage = BufferGetPage(cbuf);
			BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);

			Assert(P_INCOMPLETE_SPLIT(cpageop));
			cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
			MarkBufferDirty(cbuf);
		}

		/* XLOG stuff 
		 * step3:写日志
		 */
		if (RelationNeedsWAL(rel))
		{
			xl_btree_insert xlrec;
			xl_btree_metadata xlmeta;
			uint8		xlinfo;
			XLogRecPtr	recptr;
			IndexTupleData trunctuple;

			xlrec.offnum = itup_off;

			XLogBeginInsert();
			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);

			if (P_ISLEAF(lpageop))
				xlinfo = XLOG_BTREE_INSERT_LEAF;
			else
			{
				/*
				 * Register the left child whose INCOMPLETE_SPLIT flag was
				 * cleared.
				 */
				XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);

				xlinfo = XLOG_BTREE_INSERT_UPPER;
			}

			if (BufferIsValid(metabuf))
			{
				xlmeta.root = metad->btm_root;
				xlmeta.level = metad->btm_level;
				xlmeta.fastroot = metad->btm_fastroot;
				xlmeta.fastlevel = metad->btm_fastlevel;

				XLogRegisterBuffer(2, metabuf, REGBUF_WILL_INIT);
				XLogRegisterBufData(2, (char *) &xlmeta, sizeof(xl_btree_metadata));

				xlinfo = XLOG_BTREE_INSERT_META;
			}

			/* Read comments in _bt_pgaddtup */
			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
			if (!P_ISLEAF(lpageop) && newitemoff == P_FIRSTDATAKEY(lpageop))
			{
				trunctuple = *itup;
				trunctuple.t_info = sizeof(IndexTupleData);
				XLogRegisterBufData(0, (char *) &trunctuple,
									sizeof(IndexTupleData));
			}
			else
				XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));

			recptr = XLogInsert(RM_BTREE_ID, xlinfo);

			if (BufferIsValid(metabuf))
			{
				PageSetLSN(metapg, recptr);
			}
			if (BufferIsValid(cbuf))
			{
				PageSetLSN(BufferGetPage(cbuf), recptr);
			}

			PageSetLSN(page, recptr);
		}

		END_CRIT_SECTION();

		/* release buffers */
		if (BufferIsValid(metabuf))
			_bt_relbuf(rel, metabuf);
		if (BufferIsValid(cbuf))
			_bt_relbuf(rel, cbuf);
		_bt_relbuf(rel, buf);
	}
}

唯一索引的处理

最后，我们来解决唯一索引的处理。在数据库中，不论对索引做什么操作，对于索引块都只有两种锁：共享锁和互斥锁。而更加严格的来讲，都不应该叫锁(lock)而应该叫闩(latch)。

关于latch通常有闩或者闩锁两种翻译，不过感觉不论怎么翻译都有点扯淡，所以下面就直接用英文单词。

在数据库中lock和latch的区别在于，lock和事务相关，必须遵循两阶段锁协议（2LP），即lock会一直持续到事务结束。而latch主要是用于临界资源的同步，比如防止两个进程同时写一个索引节点，而一旦节点的写操作完毕，节点上的latch就可以释放。从前面对于索引节点的处理中，我们不难发现这个特点。OK，那么下面的问题就来了。假设我有一张test表，建表语句如下：

create table test(id int primary key);

现在有两个事务T1，T2，按照如下顺序对test表进行插入：

T1	T2
begin;
insert into test values(1);	begin;
	insert into test values(1);
rollback;
	commit;

由于T1，T2都向test表中插入1，那么显然这两个1会被插入同一个块，假设这个块叫block_1，那么实际的插入流程如下：

T1	T2
latch block 1
insert 1
unlatch block1
	latch block 1
	insert 1 – error？
	unlatch block1
rollback;	commit;

由于block1上的latch不必等到T1结束才释放，所以T1在进行的过程中，T2也可以向block1中写入1。然而我们这里是一个主键索引，此时block1中已经存在了一个1，T2还能再写入1么？能？写进去了不就是主键冲突了么。不能？可以直接报错么？如果直接报错，那么后面我们看到T1执行了rollback，数据库中并不存在1。于是收到主键冲突错误的T2，查询了一下数据库，然后很无语的发现并没有1的存在。所以在这种情况下，是无法判断T2的插入是否可以执行的。必须等到T1事务结束，再来判断是否可以执行插入。在实际使用数据库时也是这样，T1执行完insert 1之后，如果事务没有结束。那么T2在执行insert 1时就会一直等待。

那么，如何来实现这个流程呢？将block1的latch换成lock行不行？显然不行，因为这样的话block1就被T1独占了，后续所有需要写block1的事务都必须等T1提交。我们来看看PostgreSQL是如何实现这套流程的：

TransactionId xwait;
uint32 speculativeToken;

/*step1：采用二分法定位index tuple的位置*/
offset = _bt_binsrch(rel, buf, natts, itup_scankey, false);
/*step2：判断index tuple是否唯一*/
xwait  = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
                         checkUnique, &is_unique, &speculativeToken);

if (TransactionIdIsValid(xwait))
{
    /* Have to wait for the other guy ... */
    _bt_relbuf(rel, buf);

    /*
	 * If it's a speculative insertion, wait for it to finish (ie. to
	 * go ahead with the insertion, or kill the tuple).  Otherwise
	 * wait for the transaction to finish as usual.
	 * step3：等待xwait结束
	 */
    if (speculativeToken)
        SpeculativeInsertionWait(xwait, speculativeToken);
    else
        XactLockTableWait(xwait, rel, &itup->t_tid, XLTW_InsertIndex);

    /* start over... 
     * step4：xwait结束后goto到_bt_search重新定位插入位置。
     */
    _bt_freestack(stack);
    goto top;
}

上述流程主要分为4步：

step1：采用二分法定位index tuple的位置
step2：判断index tuple是否唯一

校验index tuple是否存在，如果存在，则根据tid找到对应元组，获取该元组对应的事务id以及状态，然后执行step3。
step3：等待事务结束

如果step2获取的事务还没有结束，则调用XactLockTableWait来等待事务结束，然后执行step4。
step4：重新定位

事务结束后，调用_bt_search重新定位插入位置，重新进行唯一性校验。