lucene的索引
document是lucene的index和search的原子单位。每一个field包含若干个Field,每一个Field包含真正需要的内容
一、对一个Field我们可以对它有三种处理:
new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED,TermVector.WITH_POSITIONS);
1、是否index,
用 Field.Index.ANALYZED Field.Index.NO等等来表示
ANALYZED
ANALYZED_NO_NORMS
ANALYZED
ANALYZED
ANALYZED_NO_NORMS
NO
NOT_ANALYZED
3、field的value是否存在index中
二、Document灵活的模式:
与数据库的表不同,数据库表中的每一行(对于这里的一个document),都必须有相同的字段。
而Document可以不同,它不要求每一个doc都有相同的field,可以不同。比如doc1 可以有filed1 field2,而doc2可以有filed1 field2 fild3等,他们可以加入同一个index
三、lucene的反向索引:
四、lucene的索引片段( index segments)
每一个lucene的index包含了一个或者多个index segments。
每一个index segment 是一个单独的index,它只包含所有document的一个子集。
每当indexWriter改写index时候就有一个新的segment创建出来
在search阶段,search操作是在每一个segment上单独进行的,最后的每个单独的结果合并成一个总的结果给用户。
每一个segment包含多个文件,例如_0.fdt,_0.fdx 等等。
_X.<ext>的格式存在
例如下图中有2个segment ,segment0 和segment1
segments_<NUM>文件是非常重要的。它包含了到其他segment的引用(reference).lucene首先打开的就是这个文件,再打开它指向的其他segment(segment也是一个index)
例如图中的 segments_2,其中二代表“generation”(第几代),indexWriter每commit一次这个num就加一。
当segment太多时候,打开索引会消耗很多资源(例如文件描述符),indexWriter会使用一个MergeScheduler在适当的时候合并segement,以减小segment的个数
五、index的具体修改操作:
先要确定indexWriter的类型,下面是两个构造函数。
如果使用第一个构造函数,create = true的话,每一打开的indexWiter,都会新建一个index,使用addDocuemnt没有效果。
使用
create = false或者使用第二个构造函数(第二个构造函数没有create,但是会先检查是否已经存在index,如果存在则打开它,否则新建一个) 就可以修改了。
往索引中添加一个document.
Document doc = new Document();
doc.add(new Field("id", ids[i],
Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("country", unindexed[i],
Field.Store.YES,
Field.Index.NO));
doc.add(new Field("contents", unstored[i],
Field.Store.NO,
Field.Index.ANALYZED));
doc.add(new Field("city", text[i],
Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
删除document
有下面的几个delete可以使用
deleteAll() deletes all documents in the index. This is exactly the same as closing the writer and opening a new writer with create=true, without having to close your writer.
public void deleteDocuments(Term [] terms) throws CorruptIndexException, IOException
{
// deletes all documents containing any of the terms in the provided array.
this.writer.deleteDocuments(terms);
System.out.println("docs = " + writer.numDocs());
}
update document
记住,操作index的基本单位是一个document,update操作只能更新一个document而不能更新一个field。
其实update是delete和add操作组成的。
具体的update操作有
updateDocument(Term, Document) first deletes all documents containing the provided term and then adds the new document using the writer’s default analyzer.
updateDocument(Term, Document, Analyzer) does the same but uses the provided analyzer instead of the writer’s default analyzer.
-------------------------------------------------------------------------
package charpter2;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class ChangeIndex {
private IndexWriter writer;
protected String[] ids = {"1", "2"};
protected String[] unindexed = {"Netherlands", "Italy"};
protected String[] unstored = {"Amsterdam has lots of bridges","Venice has lots of canals"};
protected String[] text = {"Amsterdam", "Venice"};
Directory dir = null;
public ChangeIndex(String indexDir) throws IOException
{
this.writer = new IndexWriter(dir,new StandardAnalyzer(Version.LUCENE_36),IndexWriter.MaxFieldLength.UNLIMITED);
}
public void addDocuments() throws CorruptIndexException, IOException
{
for (int i = 0; i < ids.length; i++)
{
Document doc = new Document();
doc.add(new Field("id", ids[i],
Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("country", unindexed[i],
Field.Store.YES,
Field.Index.NO));
doc.add(new Field("contents", unstored[i],
Field.Store.NO,
Field.Index.ANALYZED));
doc.add(new Field("city", text[i],
Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
}
System.out.println("docs = " + writer.numDocs());
}
public void deleteDocuments(Term [] terms) throws CorruptIndexException, IOException
{
// deletes all documents containing any of the terms in the provided array.
this.writer.deleteDocuments(terms);
System.out.println("docs = " + writer.numDocs());
}
public void updateDocuments(Term term) throws CorruptIndexException, IOException
{
Document doc = new Document();
doc.add(new Field("id", "1",
Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("country", "Netherlands",
Field.Store.YES,
Field.Index.NO));
doc.add(new Field("contents",
"Den Haag has a lot of museums",
Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("city", "Den Haag",
Field.Store.YES,
Field.Index.ANALYZED));
writer.updateDocument(new Term("id", "1"),
doc);
System.out.println("docs = " + writer.numDocs());
}
public void search(String fieldName,String q) throws CorruptIndexException, IOException, ParseException
{
IndexSearcher searcher = new IndexSearcher(dir);
QueryParser parser = new QueryParser(Version.LUCENE_36,"contents",new StandardAnalyzer(Version.LUCENE_36));
Query query = parser.parse(q);
TopDocs hits = searcher.search(query, 20);
System.out.println("search result:");
for(ScoreDoc doc : hits.scoreDocs)
{
// 取得命中的文档
Document d = searcher.doc(doc.doc);
System.out.println(d.get("contents"));
}
}
public void commit() throws CorruptIndexException, IOException
{
this.writer.commit();
}
public static void main(String[] args) throws IOException, ParseException {
// TODO Auto-generated method stub
ChangeIndex ci = new ChangeIndex("charpter2-1");
//test add index
ci.addDocuments();
ci.commit();
//test delete index
// the term to delete
//Term [] terms = {new Term("id","1"),new Term("id","10")};
//ci.deleteDocuments(terms);
//test update index
System.out.println("before udpate");
ci.search("contents", "Haag");
ci.updateDocuments(new Term("id","1"));
ci.commit();
System.out.println("after udpate");
ci.search("contents", "Haag");
}
}