Lucene索引创建与更新过程详解-优快云博客

本文深入剖析了Lucene创建索引的过程，包括使用场景、创建IndexWriter、构建Document、更新Document的详细步骤。重点讨论了DocumentsWriterPerThreadPool、ThreadState等关键类的作用，以及为何Lucene更新Document需要先删除再添加的机制。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文档旨在分析Lucene如何把业务信息写到磁盘上的大致流程，并不涉及Document中每个Field如何存储（该部分放在另外一篇wiki中介绍）。

一，Lucene建索引API

 
           Directory dire = NIOFSDirectory.open(FileSystems.getDefault().getPath(indexDirectory));
          
           IndexWriterConfig iwc = 
           new 
           IndexWriterConfig(
           new 
           StandardAnalyzer());
          
           iwc.setRAMBufferSizeMB(
           64
           ); 
           //兆默认刷
          
           indexWriter = 
           new 
           IndexWriter(dire, iwc);
          
           Document doc = createDocument(artiste, skuId);
          
           indexWriter.addDocument(doc);
          
           indexWriter.commit();
          
           indexWriter.close();

二，创建IndexWriter

NIOFSDirectory.open()

如果是64位JRE会得到MMapDirectory（采用内存映射的方式写索引数据到File中）。

 
           IndexWriterConfig
          
           //properties
          
           this
           .analyzer = analyzer;
          
           ramBufferSizeMB = IndexWriterConfig.DEFAULT_RAM_BUFFER_SIZE_MB;
           //默认超过16M就会触发flush磁盘操作
          
           maxBufferedDocs = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DOCS;
           //默认按照RAM空间大小触发flush
          
           maxBufferedDeleteTerms = IndexWriterConfig.DEFAULT_MAX_BUFFERED_DELETE_TERMS;
           //
          
           mergedSegmentWarmer = 
           null
           ;
          
           delPolicy = 
           new 
           KeepOnlyLastCommitDeletionPolicy();
           //删除策略
          
           commit = 
           null
           ;
          
           useCompoundFile = IndexWriterConfig.DEFAULT_USE_COMPOUND_FILE_SYSTEM;
          
           openMode = OpenMode.CREATE_OR_APPEND;
           //IndexWriter打开模式
          
           similarity = IndexSearcher.getDefaultSimilarity();
           //相似度计算，一般初始化Searcher的时候会用（因为只有查询的时候才会用到相似度计算）
          
           mergeScheduler = 
           new 
           ConcurrentMergeScheduler();
           //每个segement的merge交个一个线程完成
          
           writeLockTimeout = IndexWriterConfig.WRITE_LOCK_TIMEOUT;
           //写操作遇到锁超时时间
          
           indexingChain = DocumentsWriterPerThread.defaultIndexingChain;
          
           codec = Codec.getDefault();
          
           if 
           (codec == 
           null
           ) {
          
           throw 
           new 
           NullPointerException();
          
           }
          
           infoStream = InfoStream.getDefault();
          
           mergePolicy = 
           new 
           TieredMergePolicy();
           //合并策略
          
           flushPolicy = 
           new 
           FlushByRamOrCountsPolicy();
           //flush策略
          
           readerPooling = IndexWriterConfig.DEFAULT_READER_POOLING;
          
           indexerThreadPool = 
           new 
           DocumentsWriterPerThreadPool(IndexWriterConfig.DEFAULT_MAX_THREAD_STATES);
           //并发写索引线程池
          
           perThreadHardLimitMB = IndexWriterConfig.DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB;

可以对IndexWriter做一些属性配置，IndexWriterConfig里面有非常丰富的各种配置。

三，创建Document

这个步骤比较简单，主要是将业务字段组装成一个Document。一个Document由多个Field组成的。

每个Filed一般有四个属性组成：

name：该字段的名称
value：该字段的值
value是否需要存储到索引文件中：如果存储到索引文件中，则search的时候可以从Document中读取到该字段的值
value值是否被索引：如果该字段被索引，则可以通过该字段为条件进行检索

四，添加Document

添加一个Document，其实调用的是updateDocument。而Lucene更新Document不像Mysql可以直接更新某一条记录，所以只能先删除这条记录（Document），然后再添加上这条Document。下面参数Term，是一个检索条件，满足条件的Document做更新。

 
         public 
         void 
         updateDocument(Term term, Iterable<? 
         extends 
         IndexableField> doc) 
         throws 
         IOException {
        
         ensureOpen();
        
         try 
         {
        
         boolean 
         success = 
         false
         ;
        
         try 
         {
        
         if 
         (docWriter.updateDocument(doc, analyzer, term)) {
        
         processEvents(
         true
         , 
         false
         );
        
         }
        
         success = 
         true
         ;
        
         } 
         finally 
         {
        
         if 
         (!success) {
        
         if 
         (infoStream.isEnabled(
         "IW"
         )) {
        
         infoStream.message(
         "IW"
         , 
         "hit exception updating document"
         );
        
         }
        
         }
        
         }
        
         } 
         catch 
         (AbortingException | OutOfMemoryError tragedy) {
        
         tragicEvent(tragedy, 
         "updateDocument"
         );
        
         }
        
         }

1 Lucene使用场景

这里从下面几个角度阐述下为什么Lucene不能直接更新一个Document？

Lucene的设计本质是一个面向检索，或者面向读的系统。为了方面的检索，在建立索引的时候做了大量的读优化存储设计。简而言之，为了读的性能，牺牲了方便写、更新的操作。
Lucene使用背景暗含了：Lucene适合（擅长）频繁读，不常写的场景。

所以上面添加一个Document，最后演变成了更新一个Document。并且updateDocument包含两个串行操作

（1）先检索，如果有满足条件的Document，则删除

（2）如果没有满足条件的Document，则直接添加到内存中

2 重要的几个基础类

在看docWriter.updateDocument(doc, analyzer, term)代码之前，我们先看几个Lucene子建的类，下面着重分析下：

2.1 DocumentsWriterPerThreadPool

Lucene内部实现的一个DocumentsWriterPerThread池（并不是严格意义的线程池），主要是

实现DocumentsWriterPerThread的重用（准确来说是实现ThreadState的重用）。该类可以简单理解一个线程池。

2.2 ThreadState

 
           /*{@link ThreadState} references and guards a
          
           * {@link DocumentsWriterPerThread} instance that is used during indexing to build a in-memory index segment.
          
           */
          
           final 
           static 
           class 
           ThreadState 
           extends 
           ReentrantLock {
          
           DocumentsWriterPerThread dwpt;
          
           // TODO this should really be part of DocumentsWriterFlushControl
          
           // write access guarded by DocumentsWriterFlushControl
          
           volatile 
           boolean 
           flushPending = 
           false
           ;
          
           // TODO this should really be part of DocumentsWriterFlushControl
          
           // write access guarded by DocumentsWriterFlushControl
          
           long 
           bytesUsed = 
           0
           ;
          
           // guarded by Reentrant lock
          
           private 
           boolean 
           isActive = 
           true
           ;
          
           ThreadState(DocumentsWriterPerThread dpwt) {
          
           this
           .dwpt = dpwt;
          
           }

本质是个读写锁，用来配合DocumentsWriterPerThread来完成对一个Document的写操作。

2.3 DocumentsWriterPerThread

简单理解成一个Document的写线程。线程池保证了DocumentsWriterPerThread的重用。

2.4 DocumentsWriterFlushControl

控制DocumentsWriterPerThread完成index过程中flush操作

2.5 FlushPolicy

刷新策略

理解了ThreadState这个类应该就简单了，甚至可以直接把该类看做带读写锁控制的写线程。其实是ThreadState内部引用DocumentWriterPerThread实例。在线程池初始化的时候就创建了8个ThreadState（这个时候并没有初始化，意思是DocumentWriterPerThread并没有新建起来，而是延迟初始化具体线程）。后面就尽量重用这个8个ThreadState。

 
           DocumentsWriterPerThreadPool(
           int 
           maxNumThreadStates) {
           //默认maxNumThreadStates=8
          
           if 
           (maxNumThreadStates < 
           1
           ) {
          
           throw 
           new 
           IllegalArgumentException(
           "maxNumThreadStates must be >= 1 but was: " 
           + maxNumThreadStates);
          
           }
          
           threadStates = 
           new 
           ThreadState[maxNumThreadStates];
          
           numThreadStatesActive = 
           0
           ;
          
           for 
           (
           int 
           i = 
           0
           ; i < threadStates.length; i++) {
          
           threadStates[i] = 
           new 
           ThreadState(
           null
           );
          
           }
          
           freeList = 
           new 
           ThreadState[maxNumThreadStates];
          
           }

3 docWriter.updateDocument

好了，看完了几个基础类，回到上面updateDocument最关键的是这一行。

 
           docWriter.updateDocument(doc, analyzer, term)
          
           boolean 
           updateDocument(
           final 
           Iterable<? 
           extends 
           IndexableField> doc, 
           final 
           Analyzer analyzer,
          
           final 
           Term delTerm) 
           throws 
           IOException, AbortingException {
          
           boolean 
           hasEvents = preUpdate();
          
           final 
           ThreadState perThread = flushControl.obtainAndLock();
          
           final 
           DocumentsWriterPerThread flushingDWPT;
          
           try 
           {
          
           if 
           (!perThread.isActive()) {
          
           ensureOpen();
          
           assert 
           false
           : 
           "perThread is not active but we are still open"
           ;
          
           }
          
           ensureInitialized(perThread);
           //真正初始化单个具体线程DocumentsWriterPerThread
          
           assert 
           perThread.isInitialized();
          
           final 
           DocumentsWriterPerThread dwpt = perThread.dwpt;
          
           final 
           int 
           dwptNumDocs = dwpt.getNumDocsInRAM();
          
           try 
           {
          
           dwpt.updateDocument(doc, analyzer, delTerm); 
           //DocumentsWriterPerThread线程真正更新文档
          
           } 
           catch 
           (AbortingException ae) {
          
           flushControl.doOnAbort(perThread);
          
           dwpt.abort();
          
           throw 
           ae;
          
           } 
           finally 
           {
          
           // We don't know whether the document actually
          
           // counted as being indexed, so we must subtract here to
          
           // accumulate our separate counter:
          
           numDocsInRAM.addAndGet(dwpt.getNumDocsInRAM() - dwptNumDocs);
          
           }
          
           final 
           boolean 
           isUpdate = delTerm != 
           null
           ;
          
           flushingDWPT = flushControl.doAfterDocument(perThread, isUpdate);
          
           } 
           finally 
           {
          
           perThreadPool.release(perThread);
           //将该线程重新放回到线程池中，释放掉资源
          
           }
          
           return 
           postUpdate(flushingDWPT, hasEvents);
          
           }

4 docWriter.updateDocument详细步骤

从线程池中获取一个ThreadState

 
             ThreadState obtainAndLock() {
            
             final 
             ThreadState perThread = perThreadPool.getAndLock(Thread
            
             .currentThread(), documentsWriter);
             //从线程池中拿取一个ThreadState
            
             boolean 
             success = 
             false
             ;
            
             try 
             {
            
             if 
             (perThread.isInitialized()
            
             && perThread.dwpt.deleteQueue != documentsWriter.deleteQueue) {
            
             // There is a flush-all in process and this DWPT is
            
             // now stale -- enroll it for flush and try for
            
             // another DWPT:
            
             addFlushableState(perThread);
            
             }
            
             success = 
             true
             ;
            
             // simply return the ThreadState even in a flush all case sine we already hold the lock
            
             return 
             perThread;
            
             } 
             finally 
             {
            
             if 
             (!success) { 
             // make sure we unlock if this fails
            
             perThreadPool.release(perThread);
            
             }
            
             }
            
             }