理解索引过程的核心类
- IndexWriter
- (写索引)是索引过程的核心组件。这个类负责创建新索引或者打开已有索引,以及向索引中添加、删除或更新被索引文档的信息。
- Directory
- 描述了Lucene索引的存放位置。是一个抽象类。FSDirectory是具体子类。
- Analyzer
- 文本文件在被索引之前,需要经过Analyzer处理。它负责从被索引文件中提取词汇单元,并剔除剩下的无用信息。
- Document
- Document对象代表一些Field的集合。可以将Document对象理解为虚拟文档——比如Web页面、E-mail信息或者文本文件。
- 我们为每一个索引到的文件建立一个Document实例,并向实例中添加各个Field。然后将Document对象添加到索引中,这样就完成了文档的索引工作。
- Field
- 索引中的每个文档都包含一个或多个不同命名的域,这些域包含在Field中。
类Indexer把”src/lia/meetlucene/data”目录下.txt结尾的文件进行索引。
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.Version;
import java.io.File;
import java.io.FileFilter;
import java.io.IOException;
import java.io.FileReader;
// From chapter 1
/**
* This code was originally written for
* Erik's Lucene intro java.net article
*/
public class Indexer {
public static void main(String[] args) throws Exception {
String indexDir = "indexes/MeetLucene"; //1 在指定目录创建索引
String dataDir = "src/lia/meetlucene/data"; //2 对指定目录中的*.txt文件进行索引
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try {
numIndexed = indexer.index(dataDir, new TextFilesFilter());
} finally {
indexer.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took "
+ (end - start) + " milliseconds");
}
private IndexWriter writer;
public Indexer(String indexDir) throws IOException {
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir, //3 创建Lucene Index Writer
new StandardAnalyzer( //3
Version.LUCENE_30), //3
true, //3
IndexWriter.MaxFieldLength.UNLIMITED); //3
}
public void close() throws IOException {
writer.close(); //4 关闭 Index Writer
}
public int index(String dataDir, FileFilter filter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f: files) {
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f))) {
indexFile(f);
}
}
return writer.numDocs(); //5 返回被索引文档数
}
private static class TextFilesFilter implements FileFilter {
public boolean accept(File path) {
return path.getName().toLowerCase() //6 只索引.txt文件,采用 FileFilter
.endsWith(".txt"); //6
}
}
protected Document getDocument(File f) throws Exception {
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f))); //7 索引文件内容
doc.add(new Field("filename", f.getName(), //8 索引文件名
Field.Store.YES, Field.Index.NOT_ANALYZED)); //8
doc.add(new Field("fullpath", f.getCanonicalPath(), //9 索引文件完整路径
Field.Store.YES, Field.Index.NOT_ANALYZED)); //9
return doc;
}
private void indexFile(File f) throws Exception {
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f);
writer.addDocument(doc); //10 向Lucene索引中添加文档
}
}
/*
#1 Create index in this directory
#2 Index *.txt files from this directory
#3 Create Lucene IndexWriter
#4 Close IndexWriter
#5 Return number of documents indexed
#6 Index .txt files only, using FileFilter
#7 Index file content
#8 Index file name
#9 Index file full path
#10 Add document to Lucene index
*/
运行结果
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/apache1.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/apache1.1.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/apache2.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/cpl1.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/epl1.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/freebsd.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl1.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl2.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl3.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lgpl2.1.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lgpl3.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lpgl2.0.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mit.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mozilla1.1.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mozilla_eula_firefox3.txt
Indexing /Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mozilla_eula_thunderbird2.txt
Indexing 16 files took 702 milliseconds
理解搜索过程的核心类
- IndexSearcher
- IndexSearcher类用于搜索由IndexWriter类创建的索引:这个类公开了几个搜索方法,它是连接索引的中心环节。可以将IndexSearcher类看做一个以只读方式打开索引的类。
- 最简单的搜索方法是将单个Query对象和int topN计数作为该方法的参数,并返回一个TopDocs对象。该方法的一个典型应用如下所示:
public class Fragments {
public void simpleSearch() throws IOException {
Directory dir = FSDirectory.open(new File("/tmp/index"));
IndexSearcher searcher = new IndexSearcher(dir);
Query q = new TermQuery(new Term("contents", "lucene"));
TopDocs hits = searcher.search(q, 10);
searcher.close();
}
}
- Term
- Term对象是搜索功能的基本单元。与Field对象类似,包含一对字符串元素:域名和单词(或域文本值)。
- Query
- Lucene含有许多具体的Query(查询)子类:
- TermQuery
- BooleanQuery
- PhraseQuery
- PrefixQuery
- PhrasePrefixQuery
- TermRangeQuery
- NumericRangeQuery
- FilteredQuery
- SpanQuery
- Lucene含有许多具体的Query(查询)子类:
- TermQuery
- TermQuery是Lucene提供的最基本的查询类型,用来匹配指定域中包含特定值的文档。
- TopDocs
- TopDocs类是一个简单的指针容器,指针一般指向前N个排名的搜索结果,搜索结果即匹配查询条件的文档。TopDocs会记录前N个结果中每个结果的int docID和浮点型分数。
import org.apache.lucene.document.Document;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import java.io.File;
import java.io.IOException;
// From chapter 1
/**
* This code was originally written for
* Erik's Lucene intro java.net article
*/
public class Searcher {
public static void main(String[] args) throws IllegalArgumentException,
IOException, ParseException {
String indexDir = "indexes/MeetLucene"; //1 输入的索引路径
String q = "patent"; //2 输入的查询字符串
search(indexDir, q);
}
public static void search(String indexDir, String q)
throws IOException, ParseException {
Directory dir = FSDirectory.open(new File(indexDir)); //3 打开索引文件
IndexSearcher is = new IndexSearcher(dir); //3
QueryParser parser = new QueryParser(Version.LUCENE_30, //4
"contents", //4
new StandardAnalyzer( //4
Version.LUCENE_30)); //4
Query query = parser.parse(q); //4 解析查询字符串
long start = System.currentTimeMillis();
TopDocs hits = is.search(query, 10); //5 搜索索引
long end = System.currentTimeMillis();
System.err.println("Found " + hits.totalHits + //6 记录搜索状态
" document(s) (in " + (end - start) + //6
" milliseconds) that matched query '" + //6
q + "':"); //6
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = is.doc(scoreDoc.doc); //7 返回匹配文本
System.out.println(doc.get("fullpath")); //8 显示匹配文件名
}
is.close(); //9 关闭IndexSearcher
}
}
/*
#1 Parse provided index directory
#2 Parse provided query string
#3 Open index
#4 Parse query
#5 Search index
#6 Write search stats
#7 Retrieve matching document
#8 Display filename
#9 Close IndexSearcher
*/
结果在被索引的16个文件中,有8个文件包含单词“patent”:
Found 8 document(s) (in 22 milliseconds) that matched query 'patent':
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/cpl1.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/mozilla1.1.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/epl1.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl3.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/apache2.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/gpl2.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lpgl2.0.txt
/Users/zhangjun/Downloads/lia2e/src/lia/meetlucene/data/lgpl2.1.txt
本文介绍Lucene的索引和搜索过程,包括核心组件如IndexWriter、IndexSearcher等的功能与使用方法,并通过示例代码展示如何创建索引及执行基本搜索。
860

被折叠的 条评论
为什么被折叠?



