lucene in action第二章（1）（深…

最新推荐文章于 2020-12-09 10:49:03 发布

原创最新推荐文章于 2020-12-09 10:49:03 发布 · 580 阅读

0 ·

CC 4.0 BY-SA版权

lucene 专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了Lucene的索引与文档处理机制，包括Field的处理方式、Document的灵活模式、反向索引原理、索引片段的管理及具体修改操作。通过实例展示了如何在Lucene中添加、删除、更新文档，以及搜索特定字段信息。

lucene的索引

document是lucene的index和search的原子单位。每一个field包含若干个Field，每一个Field包含真正需要的内容

一、对一个Field我们可以对它有三种处理：

new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED,TermVector.WITH_POSITIONS);

1、是否index，

用 Field.Index.ANALYZED Field.Index.NO等等来表示

ANALYZED
使用analyzer 分析，将分析得到的字段用于索引，

ANALYZED_NO_NORMS
ANALYZED 的变体，区别是， ANALYZED 存储了index time ，boost information等norms，而 ANALYZED_NO_NORMS 不存储，这会在search的时候节约内存空间

NO
这个field不能被search

NOT_ANALYZED
不使用 analyzer分析，整体作为一个token，常用语精确匹配，例如文件名，ID等就用这个。

NOT_ANALYZED_NO_NORMS
同理 NOT_ANALYZED 的变种

2、如果index是否存储term vector。

term 就是analyzer分词后的词组。每一个document 都含有一个term vector 存储了这个document含有的term（unique，如果某个term出现多次也只存一个），以及这个term出现在field 中的position，以及offset。这些信息可以用来以后高亮一个选中的term等等。

3、field的value是否存在index中

用 Field.Store.YES, Field.Store.NO来表示

二、Document灵活的模式：

与数据库的表不同，数据库表中的每一行（对于这里的一个document),都必须有相同的字段。

而Document可以不同，它不要求每一个doc都有相同的field，可以不同。比如doc1 可以有filed1 field2,而doc2可以有filed1 field2 fild3等，他们可以加入同一个index

三、lucene的反向索引：

什么是反向索引？

lucene使用analyzed的词组作为查询的key，它不是回答“这个document包含哪些words？”的问题，而是回答“这个word出现在哪些document中？”

四、lucene的索引片段（ index segments）

每一个lucene的index包含了一个或者多个index segments。

每一个index segment 是一个单独的index，它只包含所有document的一个子集。

每当indexWriter改写index时候就有一个新的segment创建出来

在search阶段，search操作是在每一个segment上单独进行的，最后的每个单独的结果合并成一个总的结果给用户。

每一个segment包含多个文件，例如_0.fdt,_0.fdx 等等。

_X.<ext>的格式存在

例如下图中有2个segment ，segment0 和segment1

lucene <wbr>in <wbr>action第二章（1）（深入index）

segments_<NUM>文件是非常重要的。它包含了到其他segment的引用（reference).lucene首先打开的就是这个文件，再打开它指向的其他segment(segment也是一个index）

例如图中的 segments_2,其中二代表“generation”（第几代），indexWriter每commit一次这个num就加一。

当segment太多时候，打开索引会消耗很多资源（例如文件描述符），indexWriter会使用一个MergeScheduler在适当的时候合并segement，以减小segment的个数

五、index的具体修改操作：

先要确定indexWriter的类型，下面是两个构造函数。

IndexWriter(Directory d, Analyzer a, boolean create, IndexWriter.MaxFieldLength mfl)

IndexWriter(Directory d, Analyzer a, IndexWriter.MaxFieldLength mfl)

如果使用第一个构造函数，create = true的话，每一打开的indexWiter，都会新建一个index，使用addDocuemnt没有效果。

使用 create = false或者使用第二个构造函数（第二个构造函数没有create，但是会先检查是否已经存在index，如果存在则打开它，否则新建一个）就可以修改了。

往索引中添加一个document.

Document doc = new Document();

doc.add(new Field("id", ids[i],

Field.Store.YES,

Field.Index.NOT_ANALYZED));

doc.add(new Field("country", unindexed[i],

Field.Store.YES,

Field.Index.NO));

doc.add(new Field("contents", unstored[i],

Field.Store.NO,

Field.Index.ANALYZED));

doc.add(new Field("city", text[i],

Field.Store.YES,

Field.Index.ANALYZED));

writer.addDocument(doc);

删除document

有下面的几个delete可以使用

deleteDocuments(Term) deletes all documents containing the provided term.

deleteDocuments(Term[])deletes all documents containing any of the terms in the provided array.

deleteDocuments(Query) deletes all documents matching the provided query.

deleteDocuments(Query[])deletes all documents matching any of the queries in the provided array.

deleteAll() deletes all documents in the index. This is exactly the same as closing the writer and opening a new writer with create=true, without having to close your writer.

public void deleteDocuments(Term [] terms) throws CorruptIndexException, IOException

{

// deletes all documents containing any of the terms in the provided array.

this.writer.deleteDocuments(terms);

System.out.println("docs = " + writer.numDocs());

}

update document

记住，操作index的基本单位是一个document，update操作只能更新一个document而不能更新一个field。

其实update是delete和add操作组成的。

具体的update操作有

updateDocument(Term, Document) first deletes all documents containing the provided term and then adds the new document using the writer’s default analyzer.

updateDocument(Term, Document, Analyzer) does the same but uses the provided analyzer instead of the writer’s default analyzer.

-------------------------------------------------------------------------

package charpter2;

import java.io.File;

import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.index.CorruptIndexException;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.Term;

import org.apache.lucene.queryParser.ParseException;

import org.apache.lucene.queryParser.QueryParser;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import org.apache.lucene.search.ScoreDoc;

import org.apache.lucene.search.TermQuery;

import org.apache.lucene.search.TopDocs;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

public class ChangeIndex {

private IndexWriter writer;

protected String[] ids = {"1", "2"};

protected String[] unindexed = {"Netherlands", "Italy"};

protected String[] unstored = {"Amsterdam has lots of bridges","Venice has lots of canals"};

protected String[] text = {"Amsterdam", "Venice"};

Directory dir = null;

public ChangeIndex(String indexDir) throws IOException

{

dir = FSDirectory.open(new File(indexDir));

//the "create" variable of indexWriter constructor must be "false"

//IndexWriter(Directory d, Analyzer a, boolean create, IndexWriter.MaxFieldLength mfl)

this.writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);

}

public void addDocuments() throws CorruptIndexException, IOException

{

for (int i = 0; i < ids.length; i++)

{

Document doc = new Document();

doc.add(new Field("id", ids[i],

Field.Store.YES,

Field.Index.NOT_ANALYZED));

doc.add(new Field("country", unindexed[i],

Field.Store.YES,

Field.Index.NO));

doc.add(new Field("contents", unstored[i],

Field.Store.NO,

Field.Index.ANALYZED));

doc.add(new Field("city", text[i],

Field.Store.YES,

Field.Index.ANALYZED));

writer.addDocument(doc);

}

System.out.println("docs = " + writer.numDocs());

}

public void deleteDocuments(Term [] terms) throws CorruptIndexException, IOException

{

// deletes all documents containing any of the terms in the provided array.

this.writer.deleteDocuments(terms);

System.out.println("docs = " + writer.numDocs());

}

public void updateDocuments(Term term) throws CorruptIndexException, IOException

{

Document doc = new Document();

doc.add(new Field("id", "1",

Field.Store.YES,

Field.Index.NOT_ANALYZED));

doc.add(new Field("country", "Netherlands",

Field.Store.YES,

Field.Index.NO));

doc.add(new Field("contents",

"Den Haag has a lot of museums",

Field.Store.YES,

Field.Index.ANALYZED));

doc.add(new Field("city", "Den Haag",

Field.Store.YES,

Field.Index.ANALYZED));

writer.updateDocument(new Term("id", "1"),

doc);

System.out.println("docs = " + writer.numDocs());

}

public void search(String fieldName,String q) throws CorruptIndexException, IOException, ParseException

{

IndexSearcher searcher = new IndexSearcher(dir);

QueryParser parser = new QueryParser(Version.LUCENE_36,"contents",new StandardAnalyzer(Version.LUCENE_36));

Query query = parser.parse(q);

TopDocs hits = searcher.search(query, 20);

System.out.println("search result:");

for(ScoreDoc doc : hits.scoreDocs)

{

// 取得命中的文档

Document d = searcher.doc(doc.doc);

System.out.println(d.get("contents"));

}

public void commit() throws CorruptIndexException, IOException

{

this.writer.commit();

}

public static void main(String[] args) throws IOException, ParseException {

// TODO Auto-generated method stub

ChangeIndex ci = new ChangeIndex("charpter2-1");

//test add index

ci.addDocuments();

ci.commit();

//test delete index

// the term to delete

//Term [] terms = {new Term("id","1"),new Term("id","10")};

//ci.deleteDocuments(terms);

//test update index

System.out.println("before udpate");

ci.search("contents", "Haag");

ci.updateDocuments(new Term("id","1"));

ci.commit();

System.out.println("after udpate");

ci.search("contents", "Haag");

}

lucene&nbsp;in&nbsp;action第二章（1）（深…

lucene in action第二章（1）（深…