lucene in action 第一章-优快云博客

本文链接：https://blog.youkuaiyun.com/skywalkerVVV/article/details/8438438

本文详细介绍了Lucene作为索引引擎的工作原理，包括索引过程、关键类及其使用方法，以及如何进行搜索操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

lucene 是一个索引引擎，和sphinx类似。他不是搜索引擎，只是建造搜索引擎的基础。

lucene 核心是一个 jar库。只有1M左右。它的有很多外围的模块，使他更强大，比如spellchecker（拼写检查）和highlighter（高亮结果中的选中文本）。

下面是一个lucene的索引和搜索的列子。

索引主要用到的类有

Document

Field

Analyzer

IndexWriter

Directory

索引的过程如下图所示：

Document 和 Field

每一个带索引的文档都由Document表示，每一个document由若干Field组成，可以类别Document就是数据库的表的一条记录，而Field就是该表中的一个字段。Field有name和Value组成，名字可以相同，索引的时候会按照加入document的顺序，索引，search的时候也是。

使用Field.Store.YES， Field.Store.YES,Field.Index.ANALYZED可以控制是否存储在索引中，和是否使用Analyzer进行分词

protected Document getDocument(File f) throws Exception

{

Document doc = new Document();

// 为每一个document加入一个Fieled,以便在search时候选择要search哪一个field

doc.add(new Field("contents",new FileReader(f)));

// Field.Index.ANALYZED 和 NOT_ANALYZED 代表是否需要分析（分词)

doc.add(new Field("filename",f.getName(),Field.Store.YES,Field.Index.ANALYZED));

doc.add(new Field("fullpath",f.getPath(),Field.Store.YES,Field.Index.NOT_ANALYZED));

return doc;

}

Analyzer（分词器）

就是从文本中抽取token去掉其他的不分，用以进行索引。Analyzer就是使用前面提到的document的内容进行分词的。

IndexWriter

使用分词器和document的进行索引，他可以add，update，remove 文档进索引。IndexWriter 在创建的时候需要确定Analyzer和Directory，也就是用什么分词器进行分词，以及在哪里存放索引。

this.writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),true,IndexWriter.MaxFieldLength.UNLIMITED);

Directory

Directory代表索引存放的位置--目录。Dirctory是一个抽象类，FSDirectory是一个工厂类，使用FSDirectory.open得到一个Directory实例. 例如 Directory dir = FSDirectory.open(new File(indexDir));

搜索是用的类有

IndexSearcher

Query

Term

TermQuery

TopDocs

IndexSearcher

IndexSearcher之于search，就像indexWriter之于index。他是搜索的核心。 IndexSearcher需要一个Directory参数，告诉它索引的位置。

Directory dir = FSDirectory.open(new File(indexDir));

IndexSearcher is = new IndexSearcher(dir);

Query

Query是一个抽象类，他告诉IndexSearcher如何search，就是search的内容。

他又很多的子类，例如

//contents是要进行search的field名字,在index时候建立的使用QueryParser可以对query的文本进行分词

QueryParser parser = new QueryParser(Version.LUCENE_36,"contents",new StandardAnalyzer(Version.LUCENE_36));

//Query query = parser.parse(q);

//不对Term的value进行分词。

Query query = new TermQuery(new Term("fullpath","D:\\work\\myself\\java books\\lucene\\charpter1\\testTxtFile\\name2 - 副本 (10).txt"));

Term

Term就search的一个基本单位，他和Field相似都有name和value。例如

new Term("fullpath","D:\\work\\myself\\java books\\lucene\\charpter1\\testTxtFile\\name2 - 副本 (10).txt")

在这里fullpath代表要搜索的Field的name。后面代表要搜索的字符串

TermQuery

Query的子类，是搜索的一个原始类，它用于搜索特定的field中是否含有特定的内容。

TopDocs

TopDocs是一个容器，他容纳了指向topN的search结果的指针。

TopDocs hits = is.search(query, 20);

for(ScoreDoc doc : hits.scoreDocs)

{

// 取得命中的文档

Document d = is.doc(doc.doc);

System.out.println(d.get("fullpath"));

}

--------------------------------------------索引------------------------------------------

package charpter1;

import java.io.File;

import java.io.FileFilter;

import java.io.FileReader;

import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.document.Field.Index;

import org.apache.lucene.index.CorruptIndexException;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

public class TxtIndexer {

private IndexWriter writer;

public TxtIndexer(String indexDir) throws IOException

{

Directory dir = FSDirectory.open(new File(indexDir));

this.writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),true,IndexWriter.MaxFieldLength.UNLIMITED);

}

public void close() throws CorruptIndexException, IOException

{

this.writer.close();

}

public int index(String dir, FileFilter filter) throws Exception

{

File [] files = new File(dir).listFiles();

for( File f: files)

{

if(!f.isDirectory() && !f.isHidden() && f.canRead() && f.exists() && (filter == null || filter.accept(f)))

{

indexFile(f);

}

return this.writer.numDocs();

}

protected Document getDocument(File f) throws Exception

{

Document doc = new Document();

// 为每一个document加入一个Fieled,以便在search时候选择要search哪一个field

doc.add(new Field("contents",new FileReader(f)));

// Field.Index.ANALYZED 和 NOT_ANALYZED 代表是否需要分析（分词)

doc.add(new Field("filename",f.getName(),Field.Store.YES,Field.Index.ANALYZED));

doc.add(new Field("fullpath",f.getPath(),Field.Store.YES,Field.Index.NOT_ANALYZED));

return doc;

}

private static class TxtFileFilter implements FileFilter

{

@Override

public boolean accept(File arg0) {

// TODO Auto-generated method stub

return arg0.getName().toLowerCase().endsWith(".txt");

}

private void indexFile(File f) throws Exception

{

System.out.println("indexing " + f.getCanonicalPath());

Document doc = this.getDocument(f);

// 索引

this.writer.addDocument(doc);

}

public static void main(String[] args) {

// TODO Auto-generated method stub

String dataDir = "D:\\work\\myself\\java books\\lucene\\charpter1\\testTxtFile"; // path of a directory

String indexDir = "."; // path of index file

long start = System.currentTimeMillis();

int numIndexed = 0;

TxtIndexer indexer = null;

try

{

indexer = new TxtIndexer(indexDir);

numIndexed = indexer.index(dataDir, new TxtFileFilter());

}

catch(Exception e)

{

e.printStackTrace();

}

finally

{

try {

indexer.close();

} catch (CorruptIndexException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

long end = System.currentTimeMillis();

System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");

}

-------------------------------搜索----------------------------------------

package charpter1;

import java.io.File;

import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.index.Term;

import org.apache.lucene.queryParser.ParseException;

import org.apache.lucene.queryParser.QueryParser;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import org.apache.lucene.search.ScoreDoc;

import org.apache.lucene.search.TermQuery;

import org.apache.lucene.search.TopDocs;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

public class TxtSearcher

{

public static void main(String [] args) throws IOException, ParseException

{

String indexDir = "."; // path of index file

String queryString = "cpu";

search(indexDir,queryString);

}

static public void search(String indexDir,String q) throws IOException, ParseException

{

Directory dir = FSDirectory.open(new File(indexDir));

IndexSearcher is = new IndexSearcher(dir);

//contents是要进行search的field名字,在index时候建立的

QueryParser parser = new QueryParser(Version.LUCENE_36,"contents",new StandardAnalyzer(Version.LUCENE_36));

//Query query = parser.parse(q);

Query query = new TermQuery(new Term("fullpath","D:\\work\\myself\\java books\\lucene\\charpter1\\testTxtFile\\name2 - 副本 (10).txt"));

long start = System.currentTimeMillis();

//search files

TopDocs hits = is.search(query, 20);

long end = System.currentTimeMillis();

System.err.println("Found " + hits.totalHits +

" document(s) (in " + (end - start) +

" milliseconds) that matched query '" +

q + "':");

for(ScoreDoc doc : hits.scoreDocs)

{

// 取得命中的文档

Document d = is.doc(doc.doc);

System.out.println(d.get("fullpath"));

}

lucene&nbsp;in&nbsp;action&nbsp;第一章

lucene in action 第一章