lucene倒排索引的内存结构

最新推荐文章于 2025-06-11 17:05:55 发布

原创最新推荐文章于 2025-06-11 17:05:55 发布 · 2.2k 阅读

3 ·

CC 4.0 BY-SA版权

lucene 专栏收录该内容

1 篇文章

订阅专栏

本文基于lucene6.6详细介绍了倒排索引的内存结构，包括基本概念、变长整数表示、slice链表和倒排索引存储的核心信息。通过讨论term和docId的存储方式，数据结构如postingsArray、BlockPool，以及处理流程，揭示了lucene如何高效地存储和查找term与docId的对应关系。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

简介

lucene索引格式是个老生常谈的问题，网上也有一些资料，但是由于年代比较古老（大都是基于3.x或者4.x的版本），和现有代码较难对上，这里基于lucene6.6重新讲解下，也帮助自己理解和记忆。

基本概念

这些信息很容易理解，看代码的时候也很清晰。

lucene在进行索引时，为了加速索引进程，会同时多线程同时进行索引，每一个线程在flush后都是一个完整的索引段。

对于每个索引线程，又会分为多个field域，每个field都是独立的内存结构，记录该field所有出现的term信息。

对于每个term，都是独立属于某个field（不同field，字面值相同的term，也是不同的term），都是独立的不可拆分的单位，是分词之后得到的结果，是搜索的时候的用来匹配的词。每个term都需要记录完整的倒排索引信息。

基础知识

变长整数vInt的表示：在lucene中，变长的整数，然用一种叫或然跟随规则的形式存储，对于一个byte，低7位来存储数据，最高位表示是否还有下一位数字，例如127，则直接采用0x7f存储，但是128，则使用0x80,0x01两个字节存储，其中0x80二进制最高位的1表示还有下一个字节。0x01则表示自己是最后一个字节，连起来表示的整数就是128。
slice链表：在lucene中，slice作为bytePool内存分配的一个重要单位，每隔slice的初始长度都是5，如果需要的字节数大于5，则会将当前这5个字节中的后4为作为指向下一层的指针，并在bytePool分配下一层的空间。这个在bytePool的内存分配写的比较清楚、

倒排索引要存哪些信息

这里我们仅讨论核心信息，非核心信息可以很容易同理可得。

具体的term值。
term对应的docId。
term在文档中的出现次数（Freq，用来打分）。
term在文档分词后的位置（pos，用来短语搜索）。
other（类似pos信息）。

逻辑结构类似：

|+ field(name,type)
    |+ term
        |+ docId & termFreq 
            |+ [position,offset,payload]
        |+ docId & termFreq 
            |+ [position,offset,payload].
    |+ term
    |+...
|+ field2(name,type)
|+ ...

term如何存储

这里我们忽略分词的过程，假设已经拿到所有分词结果。

term存储，主要涉及到两个问题：

term以什么结构存储。
重复的term如何解决。

基于以上两点，lucene设计了如下存储结构：

public int add(BytesRef bytes) {
    assert bytesStart != null : "Bytesstart is null - not initialized";
    final int length = bytes.length;
    // 获得term的hash存储位置，hash算法不展开。
    final int hashPos = findHash(bytes);
    // ids用来存储hashPos对应的termId。
    int e = ids[hashPos];
    
    //如果为-1，则是新的term
    if (e == -1) {
      // 存储的时候，在ByteBlockPool中的结构是：长度+具体的term。
      // lucene支持的term长度不超过2个字节，长度采用变长整数表示，因此需要申请的存储空间为2 + bytes.length。
      final int len2 = 2 + bytes.length;
      if (len2 + pool.byteUpto > BYTE_BLOCK_SIZE) {
        if (len2 > BYTE_BLOCK_SIZE) {
          throw new MaxBytesLengthExceededException("bytes can be at most "
              + (BYTE_BLOCK_SIZE - 2) + " in length; got " + bytes.length);
        }
        // 内存池扩容不展开叙述。
        pool.nextBuffer();
      }
      final byte[] buffer = pool.buffer;
      // 获取内存池的起始位置
      final int bufferUpto = pool.byteUpto;
      // byteStart用来记录termId在内存池中存储的起始位置，count是总term数量。
      if (count >= bytesStart.length) {
        bytesStart = bytesStartArray.grow();
        assert count < bytesStart.length + 1 : "count: " + count + " len: "
            + bytesStart.length;
      }
      //分配termId
      e = count++;
    
      // 记录对应termId在ByteStartPool中的起始位置。
      bytesStart[e] = bufferUpto + pool.byteOffset;

      // 长度小于128，则长度用一个字节的vInt即可存储。
      if (length < 128) {
        // 1 byte to store length
        buffer[bufferUpto] = (byte) length;
        pool.byteUpto += length + 1;
        assert length >= 0: "Length must be positive: " + length;
        System.arraycopy(bytes.bytes, bytes.offset, buffer, bufferUpto + 1,
            length);
      } else {
        // 2 byte to store length
        buffer[bufferUpto] = (byte) (0x80 | (length & 0x7f));
        buffer[bufferUpto + 1] = (byte) ((length >> 7) & 0xff);
        pool.byteUpto += length + 2;
        System.arraycopy(bytes.bytes, bytes.offset, buffer, bufferUpto + 2,
            length);
      }
      assert ids[hashPos] == -1;
      // 记录hashPos对应的termId为e。
      ids[hashPos] = e;
      // rehash，不展开叙述。
      if (count == hashHalfSize) {
        rehash(2 * hashSize, true);
      }
      return e;
    }
    // 如果不是新的term，则直接返回。
    return -(e + 1);
  }

到此为止，我们已经把term记录下来。下面，我们就要考虑如何把term和docId对应起来。

docId如何存储

在我们整个索引过程，每一个field的所有term是共用内存池的，存储docId的时候，要考虑到一个term可以出现在不同的文档中，对应多个不同的docId。

term的整个处理过程在TermsHashPerField中，我们可以在add()方法中看到，term的存储只是整个term索引过程第一步。

数据结构

现在term已经存储完成，我们搜索请求过来时，可以很轻松找到自己的termId，如何从termId查找docId是另一层对应关系需要做的事情，lucene为此，在TermsHashPerField中设计了几个数据结构，这几个数据结构在对term索引的时候起到了重要作用

postingsArray

这个结构中包含三个很重要的数组，分别用来记录不同的信息：

textStarts，本来是用来记录term本身在ByteBlockPool中的起始位置的，建索引的时候没有用到这个字段。
intStarts，用来记录对应termId对应的其他信息在IntPool中的记录位置，intpool中记录的具体是什么信息后面会说明。
byteStarts。用来记录termId的[docId,freq]组合在ByteBlockPool中的起始位置，注意是[docID,freq]组合，在bytePool中的存储形式类似于[docId,freq][docId,freq][docId,freq]…这种，这个起始位置的值 + slice初始化长度就是posi信息的起始位置。

BlockPool

在TermsHashPerField中可以看到三个blockPool

IntBlockPool intPool;
ByteBlockPool bytePool;
ByteBlockPool termBytePool;

IntPool用来termID对应的信息在bytePool中的位置，包含以下两种：

[docId,freq]链表的结束位置+1。
如果有posi等信息，则用来记录posi等信息的结束位置+1。

至于为什么这两个信息要记录到不同位置呢？是因为[docId,freq]信息要等一个doc处理结束才能确定，此时才会真正写入bytePool，而posi等信息，在处理doc的每一个term的时候都可以确定，可以直接写入bytePool，所以这里会分为两个地方写入。

bytePool和termBytePool用来存储真正的倒排信息，从代码中可以很轻松发现这两个引用指向同一个对象。

具体流程

这里我先用文字描述下即将发生的事情，后面我们跟着代码继续整理：

新增term

为term即将存储的[docId,freq]信息、posi等信息，在bytePool中申请slice（内存空间），并将对应的slice起始位置作为[docId,freq]和posi等信息的结束位置写入intPool（由于还没存入信息，所以用起始位置作为结束位置），两个信息在bytePool中分别存在独立的slice中。
调用FreqProxTermsWriterPerField的newTerm方法，首先将该term的lastdocId置为当前docId，将freq置为1，将docCodes置为当前docId << 1，左移一位目的是，最后一位为0，表示后面跟随freq信息，在addTerm时可以看到其他处理，这个优化是因为大多数term都只会出现一次，另开一个int存储比较浪费。
然后在bytePool中写入posi等信息，并调整intPool中posi信息的最后一位下标。

已有term

调用FreqProxTermsWriterPerField的addTerm方法，首先判断当前处理的docId和该term最后一次处理的docId是否一样，如果一样，则证明这是一个doc分词出的相同term，需要累加freq，但是不需要更新docId；如果不一样，则证明上一次的doc已经处理完毕，应当将上次的所有信息刷入内存池，我们以不一样为例讲解下。
如果不是一个docId，则证明上一个文档刚处理结束，当前所有记录的信息都是上一个doc的。如果出现频率的频率等于1，则没必要写入freq信息，直接把docCodes最后一位置为1，写入docCodes即可。否则，直接写入docCodes（此时docCodes最后一位为0，在newTerm的时候有设置），并且写入freq信息。
写入完成后，则上一个doc处理完毕，开始处理当前文档。首先将termFreq设置为1，表明这是当前文档第一次出现这个term，然后设置docCodes，采用差值设置，并左移一位，将最后一位置为0，原理同newTerm。
然后写入posi等信息，原理通newTerm。

至此，我们大概清楚了如何term到底是如何和docId对应起来的，并且这些东西使如何存储的。嘴上得来总觉浅，下面我们直接看下代码到底是如何处理的：

TermHashPerField里面的add()方法：

// 添加term，并返回termId
int termID = bytesHash.add(termAtt.getBytesRef());

//termId为正，则表明使新的term。
if (termID >= 0) {// New posting

      //这里貌似没什么作用
      bytesHash.byteStart(termID);
      // numPosingInt用来记录在intPool需要几位来记录信息，intPool不够则扩容
      if (numPostingInt + intPool.intUpto > IntBlockPool.INT_BLOCK_SIZE) {
        intPool.nextBuffer();
      }
      
      // 同理，判断bytePool是否需要扩容，需要为term在bytePool中分配numPosingInt个slice，每个slice的初始大小都是FIRET_LEVEL_SIZE。
      if (ByteBlockPool.BYTE_BLOCK_SIZE - bytePool.byteUpto < numPostingInt*ByteBlockPool.FIRST_LEVEL_SIZE) {
        bytePool.nextBuffer();
      }
          
      intUptos = intPool.buffer;
      intUptoStart = intPool.intUpto;
      intPool.intUpto += streamCount;
      
      // intStarts记录intPool中term信息的位置    
      postingsArray.intStarts[termID] = intUptoStart + intPool.intOffset;

      // 为每个域分配slice，并记录结束位置，streamCount应该等同numPosingInt
      for(int i=0;i<streamCount;i++) {
        final int upto = bytePool.newSlice(ByteBlockPool.FIRST_LEVEL_SIZE);
        intUptos[intUptoStart+i] = upto + bytePool.byteOffset;
      }
      // 记录[docId,freq]链表起始位置，intPool中记录的理应是结束位置，但是由于此时还没写入内容，所以起始位置等于结束位置
      postingsArray.byteStarts[termID] = intUptos[intUptoStart];

      // 调用newTerm方法，执行FreqProxTermsWriterPerField的newTerm
      newTerm(termID);

    } else {
      termID = (-termID)-1;
      int intStart = postingsArray.intStarts[termID];
      // 准备一些内存池相关参数
      intUptos = intPool.buffers[intStart >> IntBlockPool.INT_BLOCK_SHIFT];
      intUptoStart = intStart & IntBlockPool.INT_BLOCK_MASK;
      // 调用addTerm，执行FreqProxTermsWriterPerField的addTerm
      addTerm(termID);
    }

FreqProxTermsWriterPerField的newTerm()方法

void newTerm(final int termID) {
    final FreqProxPostingsArray postings = freqProxPostingsArray;
    
    // 该term最后处理的docId就是当前docId
    postings.lastDocIDs[termID] = docState.docID;
    // 不记录freq，只需要维护docId链就可以
    if (!hasFreq) {
      assert postings.termFreqs == null;
      postings.lastDocCodes[termID] = docState.docID;
    } else {
      // 记录docId链，左移一位，最后一位表示后面跟随freq
      postings.lastDocCodes[termID] = docState.docID << 1;
      postings.termFreqs[termID] = 1;
      // 写入posi等信息
      if (hasProx) {
        writeProx(termID, fieldState.position);
        if (hasOffsets) {
          writeOffsets(termID, fieldState.offset);
        }
      } else {
        assert !hasOffsets;
      }
    }
    fieldState.maxTermFrequency = Math.max(1, fieldState.maxTermFrequency);
    fieldState.uniqueTermCount++;
  }

FreqProxTermsWriterPerField的addTerm()方法

void addTerm(final int termID) {
    final FreqProxPostingsArray postings = freqProxPostingsArray;

    assert !hasFreq || postings.termFreqs[termID] > 0;
    
    // 不记录freq的情况，比较简单，不展开。
    if (!hasFreq) {
      assert postings.termFreqs == null;
      if (docState.docID != postings.lastDocIDs[termID]) {
        // New document; now encode docCode for previous doc:
        assert docState.docID > postings.lastDocIDs[termID];
        writeVInt(0, postings.lastDocCodes[termID]);
        postings.lastDocCodes[termID] = docState.docID - postings.lastDocIDs[termID];
        postings.lastDocIDs[termID] = docState.docID;
        fieldState.uniqueTermCount++;
      }
    } else if (docState.docID != postings.lastDocIDs[termID]) {
      // 当前处理的docId不等于上次处理的docId，则证明上次的doc已经处理完毕，需要写入上次的信息
      // 如果freq等于1，则将lastDocCodes最后一位置为1，表示后面不跟随freq信息，省掉一个记录freq的字节。
      if (1 == postings.termFreqs[termID]) {
        writeVInt(0, postings.lastDocCodes[termID]|1);
      } else {
        // 否则，要写入docCodes和freq，此时docCodes最后一位是0。
        writeVInt(0, postings.lastDocCodes[termID]);
        writeVInt(0, postings.termFreqs[termID]);
      }
      // 旧的文档处理结束，开始写入新的文档信息，基本和newTerm()处理手段一致。
      postings.termFreqs[termID] = 1;
      fieldState.maxTermFrequency = Math.max(1, fieldState.maxTermFrequency);
      // 这里是docId链采用差值法存储，也是为了节省内存。
      postings.lastDocCodes[termID] = (docState.docID - postings.lastDocIDs[termID]) << 1;
      postings.lastDocIDs[termID] = docState.docID;
      if (hasProx) {
        writeProx(termID, fieldState.position);
        if (hasOffsets) {
          postings.lastOffsets[termID] = 0;
          writeOffsets(termID, fieldState.offset);
        }
      } else {
        assert !hasOffsets;
      }
      fieldState.uniqueTermCount++;
    } else {
      // 进到这里，说明是同一个doc的同一个field中分词分出了多个相同的term，只需要额外写入posi等信息即可
      fieldState.maxTermFrequency = Math.max(fieldState.maxTermFrequency, ++postings.termFreqs[termID]);
      if (hasProx) {
        writeProx(termID, fieldState.position-postings.lastPositions[termID]);
        if (hasOffsets) {
          writeOffsets(termID, fieldState.offset);
        }
      }
    }
  }

至此，整个doc信息都已经被串联起来并写入内存了，剩下就是在合适的时候将这些信息刷入磁盘文件，这部分本文不做探讨。为了帮助理解，我们以一份简单的索引，来看下上面提到的这些内存池的结构，加深理解。

实战

我们以下面这份简单的索引为例，看下这份索引的内存结构到底是什么样子。

    private Document getDocument(String value) throws Exception {
        Document doc = new Document();
        FieldType fieldType = new FieldType();
        fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        fieldType.setTokenized(true);
        Field pathField = new Field("name", value, fieldType);
        //向document中添加信息
        doc.add(pathField);
        return doc;
    }

    //创建索引
    public void writeToIndex() throws Exception {
        //需要创建索引的数据位置
        Document document = getDocument("lucene1");
        writer.addDocument(document);
        // breakpoint1
        document = getDocument("lucene2 lucene2");
        writer.addDocument(document);
        // breakpoint2
        document = getDocument("lucene2 lucene2 test lucene2 lucene2");
        writer.addDocument(document);
        // breakpoint3
    }

breakpoint1

下标	postindesArray.byteStarts	intPool	bytePool
0	8	8	7
1	0	14	108
2	0	0	117
3	0	0	99
4	0	0	101
5	0	0	110
6	0	0	101
7	0	0	49
8	0	0	0
9	0	0	0
10	0	0	0
11	0	0	0
12	0	0	16
13	0	0	0
14	0	0	0
15	0	0	0
16	0	0	0
17	0	0	16

在这个断点，只有一个term出现，lucene1的termId为0。

textStarts[0] = 0，表示term字面值在bytePool中第0位开始，bytePool[0] = 7，表示term长度为7，bytePool中1~7为term字面值。

8~12是第一个slice，用来存储[docId,freq]，最后一位16表示没有向后延伸。

13~17是第二个slice，用来存储posi等信息，最后一位16表示没有向后延伸。

再来看intStarts[0] = 0，表示term相关信息在intPool中第0位开始，由于有posi信息，则在intPool中需要占两个位置。因此intPool[0]和intPool[1]分别表示这个term在bytePool中[docId,freq]和posi等信息的结束位置+1

byteStarts[0] = 8，表示term的[docId,freq]信息在bytePool中从第8个字节开始。

intPool[0] = 8，表示[docId,freq]在bytePool中结束位置 + 1 。为什么明明有一个doc，但是intPool[0]中指示[doc,freq]的结束位置为8，等于byteStarts[0]呢，相当于没有任何信息呢？原因是虽然doc1已经处理完毕，但是此时对于lucene1这个term，没有其他的doc，所以这个信息还没有被写入intPool，仍存在lucene1的这个term的docCodes、freq数组中。

intPool[1] = 14，表示pos等信息的结束位置为14，这个信息的长度可以通过[docId,freq]的数量计算出来，分词后的每一个term都会存这个信息，因此这个信息长度为sum(freq)。这里可以看到值为0。这个要分两部分看，二进制最后一位为0，表示没有后续信息，前7位为0，表示term在这个field原生值分词后的第一位。

到这里，breakpoint1的所有信息都分析完毕。

breakpoint2

下标	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	0	0	0	26	117
3	0	0	0	33	99
4	0	0	0	0	101
5	0	0	0	0	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	0
27	0	0	0	0	0
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	2
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	16

在这个断点，lucene2的termId为1。

textStarts[1] = 18，表示term字面值在bytePool中第18位开始，bytePool[18] = 7，表示term长度为7，bytePool中19~25为term字面值。

26~30是第一个slice，用来存储[docId,freq]，最后一位16表示没有向后延伸。

31~35是第二个slice，用来存储posi等信息，最后一位16表示没有向后延伸。

再来看intStarts[1] = 2，表示term相关信息在intPool中第2位开始，由于有posi信息，则在intPool中需要占两个位置。因此intPool[2]和intPool[3]分别表示这个term在bytePool中[docId,freq]和posi等信息的结束位置+1

byteStarts[1] = 26，表示term的[docId,freq]信息在bytePool中从第26个字节开始。

intPool[2] = 26，表示[docId,freq]在bytePool中结束位置 + 1 。为什么等于byteStarts[1]，原因同lucene1

intPool[3] = 33，表示pos等信息的结束位置为3。可以看到bytePool[31] = 0，表示在分词列表中出现的位置是0，后面不跟随其他信息，bytePool[32] = 2，表示在分词列表中出现的位置是1，后面不跟随其他信息。

到这里，breakpoint2的所有信息都分析完毕。

breakpoint3

下标	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	36	4	41	28	117
3	0	0	0	56	99
4	0	0	0	41	101
5	0	0	0	47	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	2
27	0	0	0	0	2
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	0
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	51
36	0	0	0	0	4
37	0	0	0	0	116
38	0	0	0	0	101
39	0	0	0	0	115
40	0	0	0	0	116
41	0	0	0	0	0
42	0	0	0	0	0
43	0	0	0	0	0
44	0	0	0	0	0
45	0	0	0	0	16
46	0	0	0	0	4
47	0	0	0	0	0
48	0	0	0	0	0
49	0	0	0	0	0
50	0	0	0	0	16
51	0	0	0	0	2
52	0	0	0	0	0
53	0	0	0	0	2
54	0	0	0	0	4
55	0	0	0	0	2
56	0	0	0	0	0
57	0	0	0	0	0
58	0	0	0	0	0
59	0	0	0	0	0
60	0	0	0	0	0
61	0	0	0	0	0
62	0	0	0	0	0
63	0	0	0	0	0
64	0	0	0	0	17
65	0	0	0	0	0
66	0	0	0	0	0
67	0	0	0	0	0

在这个断点，lucene2是已经出现过的term，会把doc1的信息刷入bytePool，test是新的term，会单独存储并分配slic。

这个field总共会分出5个term：lucene2、lucene2、test、lucene2、lucene2。我们一个个分析信息是如何写入bytePool中的。

第一个lucene2

首先，会发现这是已有的term，termId = 1，addTerm时发现上次的docId是1，这次的docId是2，会先将上次doc的信息刷入bytePool。
上次的docId为1，由于termFreq = 2，需要跟随freq信息，因此将docId左移一位的值直接写入bytePool，然后写入freq，注意freq使用vInt写入的，但是此时freq = 2，只需要一个字节，所以写入的值是2.
向intPool查询当前可以写入的位置，intPool[1] = 26，因此第26个字节写入2表示docId，并且后面跟随freq，第27个字节写入2，表示freq = 2，并设置[docId,freq]结束位置为28。
然后，更新lastDocId等信息，并写入新的term posi等信息。

第二个lucene2

这个没什么好说的，就是正常的addTerm，更新freq，写入posi等信息，freq列表为下标31~34，值为0、2、0、2。

test

新的term出现了，和之前新term处理方式一样，写入term字面值（bytePool下标36_{40），申请[docId,freq]的splic（41}45），申请posi等信息的slice并写入（46~50），写入的值为4，二进制最后一位为0表示不跟随其他信息，右移一位为2表示在分词链中第2个出现，因此posi结束位置为47，[doc,freq]信息还没刷入bytePool，结束位置为41。

第三个lucene2

正常执行addTerm方法，但是在写入posi等信息的时候，要写入的位置是35，这个位置值16表示这是slice的末尾，不能写入值。slice要扩容，并将32_{34的信息复制到新扩容的区域，重新申请slice得到的slice起始位置为51，将32}35四个字节合并表示51，因此32_{34为0，35表示51，将原本32到34的值复制到51}53，因此51~53的置为2、0、2，新的词在分词列表中处于第3位，上一个lucene2处于第1位，采用差值法，应当写入2，左移一位将末尾置0，表示后面没有其他信息，因此54位置写入的值为4。

第四个lucene2

同第二个lucene2，直接在55的位置写入2，将posi信息结束位置修改为53。

到这里，breakpoint3的所有信息都分析完毕。

The End

到这里，我们已经把整个lucene倒排索引如何创建的，以及其内存结构讲清楚了。所有复杂的结构本身都是有必须复杂的道理，lucene设计的这么复杂的结构的目的就是为了节省内存，尽可能的利用每一个字节，从而在内存中放更多的东西。

下标	postindesArray.byteStarts	intPool	bytePool
0	8	8	7
1	0	14	108
2	0	0	117
3	0	0	99
4	0	0	101
5	0	0	110
6	0	0	101
7	0	0	49
8	0	0	0
9	0	0	0
10	0	0	0
11	0	0	0
12	0	0	16
13	0	0	0
14	0	0	0
15	0	0	0
16	0	0	0
17	0	0	16

下标	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	0	0	0	26	117
3	0	0	0	33	99
4	0	0	0	0	101
5	0	0	0	0	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	0
27	0	0	0	0	0
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	2
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	16

下标	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	36	4	41	28	117
3	0	0	0	56	99
4	0	0	0	41	101
5	0	0	0	47	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	2
27	0	0	0	0	2
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	0
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	51
36	0	0	0	0	4
37	0	0	0	0	116
38	0	0	0	0	101
39	0	0	0	0	115
40	0	0	0	0	116
41	0	0	0	0	0
42	0	0	0	0	0
43	0	0	0	0	0
44	0	0	0	0	0
45	0	0	0	0	16
46	0	0	0	0	4
47	0	0	0	0	0
48	0	0	0	0	0
49	0	0	0	0	0
50	0	0	0	0	16
51	0	0	0	0	2
52	0	0	0	0	0
53	0	0	0	0	2
54	0	0	0	0	4
55	0	0	0	0	2
56	0	0	0	0	0
57	0	0	0	0	0
58	0	0	0	0	0
59	0	0	0	0	0
60	0	0	0	0	0
61	0	0	0	0	0
62	0	0	0	0	0
63	0	0	0	0	0
64	0	0	0	0	17
65	0	0	0	0	0
66	0	0	0	0	0
67	0	0	0	0	0

下标	postindesArray.byteStarts	intPool	bytePool
0	8	8	7
1	0	14	108
2	0	0	117
3	0	0	99
4	0	0	101
5	0	0	110
6	0	0	101
7	0	0	49
8	0	0	0
9	0	0	0
10	0	0	0
11	0	0	0
12	0	0	16
13	0	0	0
14	0	0	0
15	0	0	0
16	0	0	0
17	0	0	16

下标	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	0	0	0	26	117
3	0	0	0	33	99
4	0	0	0	0	101
5	0	0	0	0	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	0
27	0	0	0	0	0
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	2
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	16

下标	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	36	4	41	28	117
3	0	0	0	56	99
4	0	0	0	41	101
5	0	0	0	47	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	2
27	0	0	0	0	2
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	0
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	51
36	0	0	0	0	4
37	0	0	0	0	116
38	0	0	0	0	101
39	0	0	0	0	115
40	0	0	0	0	116
41	0	0	0	0	0
42	0	0	0	0	0
43	0	0	0	0	0
44	0	0	0	0	0
45	0	0	0	0	16
46	0	0	0	0	4
47	0	0	0	0	0
48	0	0	0	0	0
49	0	0	0	0	0
50	0	0	0	0	16
51	0	0	0	0	2
52	0	0	0	0	0
53	0	0	0	0	2
54	0	0	0	0	4
55	0	0	0	0	2
56	0	0	0	0	0
57	0	0	0	0	0
58	0	0	0	0	0
59	0	0	0	0	0
60	0	0	0	0	0
61	0	0	0	0	0
62	0	0	0	0	0
63	0	0	0	0	0
64	0	0	0	0	17
65	0	0	0	0	0
66	0	0	0	0	0
67	0	0	0	0	0

下标	postindesArray.byteStarts	intPool	bytePool
0	8	8	7
1	0	14	108
2	0	0	117
3	0	0	99
4	0	0	101
5	0	0	110
6	0	0	101
7	0	0	49
8	0	0	0
9	0	0	0
10	0	0	0
11	0	0	0
12	0	0	16
13	0	0	0
14	0	0	0
15	0	0	0
16	0	0	0
17	0	0	16

下标	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	0	0	0	26	117
3	0	0	0	33	99
4	0	0	0	0	101
5	0	0	0	0	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	0
27	0	0	0	0	0
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	2
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	16

下标	postingsArray.textStarts	postingsArray.intStarts	postindesArray.byteStarts	intPool	bytePool
0	0	0	8	8	7
1	18	2	26	14	108
2	36	4	41	28	117
3	0	0	0	56	99
4	0	0	0	41	101
5	0	0	0	47	110
6	0	0	0	0	101
7	0	0	0	0	49
8	0	0	0	0	0
9	0	0	0	0	0
10	0	0	0	0	0
11	0	0	0	0	0
12	0	0	0	0	16
13	0	0	0	0	0
14	0	0	0	0	0
15	0	0	0	0	0
16	0	0	0	0	0
17	0	0	0	0	16
18	0	0	0	0	7
19	0	0	0	0	108
20	0	0	0	0	117
21	0	0	0	0	99
22	0	0	0	0	101
23	0	0	0	0	110
24	0	0	0	0	101
25	0	0	0	0	50
26	0	0	0	0	2
27	0	0	0	0	2
28	0	0	0	0	0
29	0	0	0	0	0
30	0	0	0	0	16
31	0	0	0	0	0
32	0	0	0	0	0
33	0	0	0	0	0
34	0	0	0	0	0
35	0	0	0	0	51
36	0	0	0	0	4
37	0	0	0	0	116
38	0	0	0	0	101
39	0	0	0	0	115
40	0	0	0	0	116
41	0	0	0	0	0
42	0	0	0	0	0
43	0	0	0	0	0
44	0	0	0	0	0
45	0	0	0	0	16
46	0	0	0	0	4
47	0	0	0	0	0
48	0	0	0	0	0
49	0	0	0	0	0
50	0	0	0	0	16
51	0	0	0	0	2
52	0	0	0	0	0
53	0	0	0	0	2
54	0	0	0	0	4
55	0	0	0	0	2
56	0	0	0	0	0
57	0	0	0	0	0
58	0	0	0	0	0
59	0	0	0	0	0
60	0	0	0	0	0
61	0	0	0	0	0
62	0	0	0	0	0
63	0	0	0	0	0
64	0	0	0	0	17
65	0	0	0	0	0
66	0	0	0	0	0
67	0	0	0	0	0