LuceneInAction-理解索引和搜索过程的核心类

本文链接：https://blog.youkuaiyun.com/zhangjunhd/article/details/54747028

本文介绍Lucene的索引和搜索过程，包括核心组件如IndexWriter、IndexSearcher等的功能与使用方法，并通过示例代码展示如何创建索引及执行基本搜索。

理解索引过程的核心类

IndexWriter
- （写索引）是索引过程的核心组件。这个类负责创建新索引或者打开已有索引，以及向索引中添加、删除或更新被索引文档的信息。
Directory
- 描述了Lucene索引的存放位置。是一个抽象类。FSDirectory是具体子类。
Analyzer
- 文本文件在被索引之前，需要经过Analyzer处理。它负责从被索引文件中提取词汇单元，并剔除剩下的无用信息。
Document
- Document对象代表一些Field的集合。可以将Document对象理解为虚拟文档——比如Web页面、E-mail信息或者文本文件。
- 我们为每一个索引到的文件建立一个Document实例，并向实例中添加各个Field。然后将Document对象添加到索引中，这样就完成了文档的索引工作。
Field
- 索引中的每个文档都包含一个或多个不同命名的域，这些域包含在Field中。

类Indexer把”src/lia/meetlucene/data”目录下.txt结尾的文件进行索引。

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileFilter;
import java.io.IOException;
import java.io.FileReader;

// From chapter 1

/**
 * This code was originally written for
 * Erik's Lucene intro java.net article
 */
public class Indexer {

  public static void main(String[] args) throws Exception {
    String indexDir = "indexes/MeetLucene";         //1 在指定目录创建索引
    String dataDir = "src/lia/meetlucene/data";     //2 对指定目录中的*.txt文件进行索引

    long start = System.currentTimeMillis();
    Indexer indexer = new Indexer(indexDir);
    int numIndexed;
    try {
      numIndexed = indexer.index(dataDir, new TextFilesFilter());
    } finally {
      indexer.close();
    }
    long end = System.currentTimeMillis();

    System.out.println("Indexing " + numIndexed + " files took "
      + (end - start) + " milliseconds");
  }

  private IndexWriter writer;

  public Indexer(String indexDir) throws IOException {
    Directory dir = FSDirectory.open(new File(indexDir));
    writer = new IndexWriter(dir,            //3 创建Lucene Index Writer
                 new StandardAnalyzer(       //3
                     Version.LUCENE_30),     //3
                     true,            //3
                     IndexWriter.MaxFieldLength.UNLIMITED); //3
  }

  public void close() throws IOException {
    writer.close();                          //4 关闭 Index Writer
  }

  public int index(String dataDir, FileFilter filter) throws Exception {
    File[] files = new File(dataDir).listFiles();

    for (File f: files) {
      if (!f.isDirectory() &&
          !f.isHidden() &&
          f.exists() &&
          f.canRead() &&
          (filter == null || filter.accept(f))) {
        indexFile(f);
      }
    }

    return writer.numDocs();                     //5 返回被索引文档数
  }

  private static class TextFilesFilter implements FileFilter {
    public boolean accept(File path) {
      return path.getName().toLowerCase()        //6 只索引.txt文件，采用 FileFilter
             .endsWith(".txt");                  //6
    }
  }

  protected Document getDocument(File f) throws Exception {
    Document doc = new Document();
    doc.add(new Field("contents", new FileReader(f)));      //7 索引文件内容
    doc.add(new Field("filename", f.getName(),              //8 索引文件名
                Field.Store.YES, Field.Index.NOT_ANALYZED));      //8
    doc.add(new Field("fullpath", f.getCanonicalPath(),     //9 索引文件完整路径
                Field.Store.YES, Field.Index.NOT_ANALYZED));      //9
    return doc;
  }

  private void indexFile(File f) throws Exception {
    System.out.println("Indexing " + f.getCanonicalPath());
    Document doc = getDocument(f);
    writer.addDocument(doc);                                      //10 向Lucene索引中添加文档
  }
}

/*
#1 Create index in this directory
#2 Index *.txt files from this directory
#3 Create Lucene IndexWriter
#4 Close IndexWriter
#5 Return number of documents indexed
#6 Index .txt files only, using FileFilter
#7 Index file content
#8 Index file name
#9 Index file full path
#10 Add document to Lucene index
*/

运行结果

Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/apache1.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/apache1.1.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/apache2.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/cpl1.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/epl1.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/freebsd.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl1.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl2.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl3.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lgpl2.1.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lgpl3.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lpgl2.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mit.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mozilla1.1.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mozilla_eula_firefox3.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mozilla_eula_thunderbird2.txt
Indexing 16 files took 702 milliseconds

理解搜索过程的核心类

IndexSearcher
- IndexSearcher类用于搜索由IndexWriter类创建的索引：这个类公开了几个搜索方法，它是连接索引的中心环节。可以将IndexSearcher类看做一个以只读方式打开索引的类。
- 最简单的搜索方法是将单个Query对象和int topN计数作为该方法的参数，并返回一个TopDocs对象。该方法的一个典型应用如下所示：

public class Fragments {
  public void simpleSearch() throws IOException {
    Directory dir = FSDirectory.open(new File("/tmp/index"));
    IndexSearcher searcher = new IndexSearcher(dir);
    Query q = new TermQuery(new Term("contents", "lucene"));
    TopDocs hits = searcher.search(q, 10);
    searcher.close();
  }
}

Term
- Term对象是搜索功能的基本单元。与Field对象类似，包含一对字符串元素：域名和单词（或域文本值）。
Query
- Lucene含有许多具体的Query(查询)子类：
  - TermQuery
  - BooleanQuery
  - PhraseQuery
  - PrefixQuery
  - PhrasePrefixQuery
  - TermRangeQuery
  - NumericRangeQuery
  - FilteredQuery
  - SpanQuery
TermQuery
- TermQuery是Lucene提供的最基本的查询类型，用来匹配指定域中包含特定值的文档。
TopDocs
- TopDocs类是一个简单的指针容器，指针一般指向前N个排名的搜索结果，搜索结果即匹配查询条件的文档。TopDocs会记录前N个结果中每个结果的int docID和浮点型分数。

import org.apache.lucene.document.Document;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;

// From chapter 1

/**
 * This code was originally written for
 * Erik's Lucene intro java.net article
 */
public class Searcher {

  public static void main(String[] args) throws IllegalArgumentException,
        IOException, ParseException {
    String indexDir = "indexes/MeetLucene";  //1 输入的索引路径
    String q = "patent";                     //2 输入的查询字符串

    search(indexDir, q);
  }

  public static void search(String indexDir, String q)
    throws IOException, ParseException {

    Directory dir = FSDirectory.open(new File(indexDir)); //3 打开索引文件
    IndexSearcher is = new IndexSearcher(dir);            //3

    QueryParser parser = new QueryParser(Version.LUCENE_30, //4
                                         "contents",        //4
                     new StandardAnalyzer(                  //4
                       Version.LUCENE_30));                 //4
    Query query = parser.parse(q);                          //4 解析查询字符串
    long start = System.currentTimeMillis();
    TopDocs hits = is.search(query, 10); //5 搜索索引
    long end = System.currentTimeMillis();

    System.err.println("Found " + hits.totalHits +   //6 记录搜索状态
      " document(s) (in " + (end - start) +          //6
      " milliseconds) that matched query '" +        //6
      q + "':");                                     //6

    for(ScoreDoc scoreDoc : hits.scoreDocs) {
      Document doc = is.doc(scoreDoc.doc);           //7 返回匹配文本
      System.out.println(doc.get("fullpath"));       //8 显示匹配文件名
    }

    is.close();                                      //9 关闭IndexSearcher
  }
}

/*
#1 Parse provided index directory
#2 Parse provided query string
#3 Open index
#4 Parse query
#5 Search index
#6 Write search stats
#7 Retrieve matching document
#8 Display filename
#9 Close IndexSearcher
*/

结果在被索引的16个文件中，有8个文件包含单词“patent”：

Found 8 document(s) (in 22 milliseconds) that matched query 'patent':
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/cpl1.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mozilla1.1.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/epl1.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl3.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/apache2.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl2.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lpgl2.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lgpl2.1.txt