SQLite使用C开发,存储引擎部分使用的是B树。
对于存储的最小粒度页(page)来说,里面包含数据项和指向其他页的指针。
对于搜索磁盘上某一个key的值,需要读取树中O(log(M))的页。
** The basic idea is that each page of the file contains N database
** entries and N+1 pointers to subpages.
**
** ----------------------------------------------------------------
** | Ptr(0) | Key(0) | Ptr(1) | Key(1) | ... | Key(N-1) | Ptr(N) |
** ----------------------------------------------------------------
在树的实现上,一个单独的数据文件可以包含一颗或多颗独立的B树。
对于表中任何数据条目来说,对应的key和数据组成了payload,一定数据量的payload可以直接在页上面存放。
如果页上存放payload的空间不足,则会讲多余的数据存放到overflow页中。payload和前一项指针组成了Cell。
每个页有一个小的头,包含了Ptr(N)指针和其他信息,例如key和数据的大小。
页的种类,有btree、freelist、overflow和pointer-map。
每个btree页分成三部分,头部、cell指针数组和cell内容区。
第一个页还包含一个100字节的文件头。在磁盘上存储的格式如下。
**
** |----------------|
** | file header | 100 bytes. Page 1 only.
** |----------------|
** | page header | 8 bytes for leaves. 12 bytes for interior nodes
** |----------------|
** | cell pointer | | 2 bytes per cell. Sorted order.
** | array | | Grows downward
** | | v
** |----------------|
** | unallocated |
** | space |
** |----------------| ^ Grows upwards
** | cell content | | Arbitrary order interspersed with freeblocks.
** | area | | and free space fragments.
** |----------------|
页头部如下
** OFFSET SIZE DESCRIPTION
** 0 1 Flags. 1: intkey, 2: zerodata, 4: leafdata, 8: leaf
** 1 2 byte offset to the first freeblock
** 3 2 number of cells on this page
** 5 2 first byte of the cell content area
** 7 1 number of fragmented free bytes
** 8 4 Right child (the Ptr(N) value). Omitted on leaves.
cell指针数组包含0个户多个2字节的指针,这些指针式cell内容区中对应的偏移量。
cell指针按照顺序存储,这样做的好处是,新的cell可以很容易添加而不会导致页的分裂。
cell的内容如下
** SIZE DESCRIPTION
** 4 Page number of the left child. Omitted if leaf flag is set.
** var Number of bytes of data. Omitted if the zerodata flag is set.
** var Number of bytes of key. Or the key itself if intkey flag is set.
** * Payload
** 4 First page of the overflow chain. Omitted if no overflow
Freelist页有两类,trunk页和leaf页。
MemPage是单独数据库页加载到内存中的对象。对象中的信息来源于磁盘上页的内容。
struct MemPage {
u8 isInit; /* True if previously initialized. MUST BE FIRST! */
u8 intKey; /* True if table b-trees. False for index b-trees */
u8 intKeyLeaf; /* True if the leaf of an intKey table */
Pgno pgno; /* Page number for this page */
/* Only the first 8 bytes (above) are zeroed by pager.c when a new page
** is allocated. All fields that follow must be initialized before use */
u8 leaf; /* True if a leaf page */
u8 hdrOffset; /* 100 for page 1. 0 otherwise */
u8 childPtrSize; /* 0 if leaf==1. 4 if leaf==0 */
u8 max1bytePayload; /* min(maxLocal,127) */
u8 nOverflow; /* Number of overflow cell bodies in aCell[] */
u16 maxLocal; /* Copy of BtShared.maxLocal or BtShared.maxLeaf */
u16 minLocal; /* Copy of BtShared.minLocal or BtShared.minLeaf */
u16 cellOffset; /* Index in aData of first cell pointer */
int nFree; /* Number of free bytes on the page. -1 for unknown */
u16 nCell; /* Number of cells on this page, local and ovfl */
u16 maskPage; /* Mask for page offset */
u16 aiOvfl[4]; /* Insert the i-th overflow cell before the aiOvfl-th
** non-overflow cell */
u8 *apOvfl[4]; /* Pointers to the body of overflow cells */
BtShared *pBt; /* Pointer to BtShared that this page is part of */
u8 *aData; /* Pointer to disk image of the page data */
u8 *aDataEnd; /* One byte past the end of the entire page - not just
** the usable space, the entire page. Used to prevent
** corruption-induced buffer overflow. */
u8 *aCellIdx; /* The cell index area */
u8 *aDataOfst; /* Same as aData for leaves. aData+4 for interior */
DbPage *pDbPage; /* Pager page handle */
u16 (*xCellSize)(MemPage*,u8*); /* cellSizePtr method */
void (*xParseCell)(MemPage*,u8*,CellInfo*); /* btreeParseCell method */
};
Btree的句柄
当打开一个数据库文件时,数据库的连接包含了指向这个对象的指针。
struct Btree {
sqlite3 *db; /* The database connection holding this btree */
BtShared *pBt; /* Sharable content of this btree */
u8 inTrans; /* TRANS_NONE, TRANS_READ or TRANS_WRITE */
u8 sharable; /* True if we can share pBt with another db */
u8 locked; /* True if db currently has pBt locked */
u8 hasIncrblobCur; /* True if there are one or more Incrblob cursors */
int wantToLock; /* Number of nested calls to sqlite3BtreeEnter() */
int nBackup; /* Number of backup operations reading this btree */
u32 iBDataVersion; /* Combines with pBt->pPager->iDataVersion */
Btree *pNext; /* List of other sharable Btrees from the same db */
Btree *pPrev; /* Back pointer of the same list */
#ifdef SQLITE_DEBUG
u64 nSeek; /* Calls to sqlite3BtreeMovetoUnpacked() */
#endif
#ifndef SQLITE_OMIT_SHARED_CACHE
BtLock lock; /* Object used to lock page 1 */
#endif
};
BtShared数据文件实例
一个数据库在同一时间可以被两个或更多的数据库连接打开。当这些连接共享相同的数据库文件时,每个连接有它自己的私有Btree对象,每个Btree
会指向BtShared对象。
struct BtShared {
Pager *pPager; /* The page cache */
sqlite3 *db; /* Database connection currently using this Btree */
BtCursor *pCursor; /* A list of all open cursors */
MemPage *pPage1; /* First page of the database */
u8 openFlags; /* Flags to sqlite3BtreeOpen() */
#ifndef SQLITE_OMIT_AUTOVACUUM
u8 autoVacuum; /* True if auto-vacuum is enabled */
u8 incrVacuum; /* True if incr-vacuum is enabled */
u8 bDoTruncate; /* True to truncate db on commit */
#endif
u8 inTransaction; /* Transaction state */
u8 max1bytePayload; /* Maximum first byte of cell for a 1-byte payload */
u8 nReserveWanted; /* Desired number of extra bytes per page */
u16 btsFlags; /* Boolean parameters. See BTS_* macros below */
u16 maxLocal; /* Maximum local payload in non-LEAFDATA tables */
u16 minLocal; /* Minimum local payload in non-LEAFDATA tables */
u16 maxLeaf; /* Maximum local payload in a LEAFDATA table */
u16 minLeaf; /* Minimum local payload in a LEAFDATA table */
u32 pageSize; /* Total number of bytes on a page */
u32 usableSize; /* Number of usable bytes on each page */
int nTransaction; /* Number of open transactions (read + write) */
u32 nPage; /* Number of pages in the database */
void *pSchema; /* Pointer to space allocated by sqlite3BtreeSchema() */
void (*xFreeSchema)(void*); /* Destructor for BtShared.pSchema */
sqlite3_mutex *mutex; /* Non-recursive mutex required to access this object */
Bitvec *pHasContent; /* Set of pages moved to free-list this transaction */
#ifndef SQLITE_OMIT_SHARED_CACHE
int nRef; /* Number of references to this structure */
BtShared *pNext; /* Next on a list of sharable BtShared structs */
BtLock *pLock; /* List of locks held on this shared-btree struct */
Btree *pWriter; /* Btree with currently open write transaction */
#endif
u8 *pTmpSpace; /* Temp space sufficient to hold a single cell */
int nPreformatSize; /* Size of last cell written by TransferRow() */
};
BtCursor游标
游标时数据库文件中,指向特定b-tree的条目的指针,对应MemPage和MemPage.aCell[]中的索引。
数据库可以同时被多个数据连接共享,但是游标不可以共享。每个游标和BtCursor.pBtree.db关联。
struct BtCursor {
u8 eState; /* One of the CURSOR_XXX constants (see below) */
u8 curFlags; /* zero or more BTCF_* flags defined below */
u8 curPagerFlags; /* Flags to send to sqlite3PagerGet() */
u8 hints; /* As configured by CursorSetHints() */
int skipNext; /* Prev() is noop if negative. Next() is noop if positive.
** Error code if eState==CURSOR_FAULT */
Btree *pBtree; /* The Btree to which this cursor belongs */
Pgno *aOverflow; /* Cache of overflow page locations */
void *pKey; /* Saved key that was cursor last known position */
/* All fields above are zeroed when the cursor is allocated. See
** sqlite3BtreeCursorZero(). Fields that follow must be manually
** initialized. */
#define BTCURSOR_FIRST_UNINIT pBt /* Name of first uninitialized field */
BtShared *pBt; /* The BtShared this cursor points to */
BtCursor *pNext; /* Forms a linked list of all cursors */
CellInfo info; /* A parse of the cell we are pointing at */
i64 nKey; /* Size of pKey, or last integer key */
Pgno pgnoRoot; /* The root page of this tree */
i8 iPage; /* Index of current page in apPage */
u8 curIntKey; /* Value of apPage[0]->intKey */
u16 ix; /* Current index for apPage[iPage] */
u16 aiIdx[BTCURSOR_MAX_DEPTH-1]; /* Current index in apPage[i] */
struct KeyInfo *pKeyInfo; /* Arg passed to comparison function */
MemPage *pPage; /* Current page */
MemPage *apPage[BTCURSOR_MAX_DEPTH-1]; /* Stack of parents of current page */
};
866

被折叠的 条评论
为什么被折叠?



