依然是照着前一篇说的视频来做的,这篇不是索引文件夹下的txt文档,而是索引自己建立的一些内容。这篇主要用来研究建立索引、删除索引、合并索引以及更新索引和恢复索引的内容。重点应了解,建立索引和删除索引以及更新索引等都是Indexwriter,而查询索引和恢复则是用indexreader,还需要了解的就是更新索引实质上就是删除索引在重新建立的过程。
源代码之前先明白索引的文档格式:
----------------------
_2_1.del
_2.fdt
_2.fdx
_2.fnm
_2.frq
_2.nrm
_2.prx
_2.tii
_2.tis
segments_4
segments.gen
----------------------
_2.fdt
_2.fdx
_2.fnm
_2.frq
_2.nrm
_2.prx
_2.tii
_2.tis
segments_4
segments.gen
1. index
a. *.fnm file
这个文件用来记录各个域的属性
name(STR), isIndex, omitNorms, storePayloads, omitTermFreqAndPositions
storeTermVector, storePositionWithTermVector, storeOffsetWithTermVector(BYTE)
下面的三个属性主要用来高亮显示,类似我们这边的DI高亮,标注索引时每个DOC每个域切出的每个词出现的相关
属性:次数,基于词的POS,基于STR的OFF
isIndex - 是否索引,只要索引的字段才可以搜
omitNorms - 取消算分因子
.nrm file
域的omitNorms=false的话,会保存这个域上各个DOC的算分因子,其文件格式如下
[N, R, M, -1][bytes size=doc num][bytes size=doc num][bytes size=doc num]
头文件标示,接下来是三个omitNorms=false的域下各个document的算分因子
在建索引时可以为DOC设定一个分数,DOC下各个域也可以设定一个分数
norm[doc][filed] = sorce(doc)*sorce(filed) | float -> byte 故精确性因为使用不当而不好
搜索时最终的算分是经过norm[doc][filed]累乘了的
storePayloads - payloads 用来存放TERM在DOC出现时的属性,可以是以后用来算分的因子,也可以是
是其它的属性,在搜索时可以选择用不用这块的数据,以及怎么用。其特点是TERM在DOC里的每次出现
都能有一个payload属性,这个值是在分词组件时切词时设置的。lucene为这块提供的灵活性还是很高
omitTermFreqAndPositions - 有些域在搜索时不需要TERM在DOC里的FREQ,POS信息,只要通过TERM
能定位到DOC就满足了,这样的域omitTermFreqAndPositions最好设置成true. 这样的域一般不会
有TERM间模糊查询,当然还有payload属性(因为payload是跟随POS的).
b. *.tii & *.tis file
*.tis记录所有TERM,类似于我们这边的termsort文件,其文件格式是
[HEAD related][term bytes, docfreq, freq pointer, prox pointer][…]
*.tii是基于*.tis的词表索引文件,类似于我们这边的tindex*文件,其文件格式是
[HEAD related][term bytes, docfreq, freq pointer, prox pointer, index pointer][…]
跳跃间隔默认是128,即在索引时生成*.tis后,*.tis里每间隔128个TERM写一个到*.tii文件中,并且记录当前
*.tis文件的偏移量
这样在搜索时将*.tii文件加载到内存里面,原理同我们的一样,在*.tii里没找到TERM时,根据其
index pointer到*.tis中去找
. docfreq - doclist的长度
. freq pointer - TERM的doclist信息在*.frq文件里的偏移量
. prox pointer - TERM的doclist里各个DOC里TERM的POS信息在*.prx文件里的偏移量
c. *.frq & *.prx
*.frq记录TERM的DOC流的文件,*.prx记录TERM的doclist里各个DOC里TERM的POS信息
从*.tii & *.tis文件读到TERM的docfreq, freq pointer, prox pointer后
1). 一般的不需要POS信息来进行结果过滤的搜索,例如“德国欧洲杯”的搜索,经分词后得到TERM:"德国", "欧洲杯"
只需要TERMDOC流就可以了,定位TERM:"德国"的docfreq, freq pointer,偏移*.frq文件freq pointer,再
读取docfreq个DOCID&FREQ信息, omitTermFreqAndPositions=true -> FREQ=1; 同样处理TERM:"欧洲杯"得到
一个doclist,两个doclist取并集|交集(并集|交集跟有些设置有关,实际操作不是将这两个doclist读取出来取交|并,
而是TERMDOC支持skip,两个流不断的滚动得到结果)
2). 有些搜索为了提高相关性,例如“德国欧洲杯 ~2”的搜索,经分词后得到TERM:"德国", "欧洲杯"
而这时需要TERMPOSITION流,TERMPOSITION流包含TERMDOC流,在TERMDOC流基础上多了TERM的doclist里各个DOC
里TERM的POS信息.
TERM:"德国"的docfreq, freq pointer,prox pointer
TERM:"欧洲杯"的docfreq, freq pointer,prox pointer
freq pointer - *.frq -> doclist for term
prox pointer - *.prx -> positions of per doc
这样比较两个doclist,在得到同样的DOCID后,再到对应的DOCID的POS流中找到所有的两个流中POSITION之差小于等于2
的数目,大于等于0这个DOCID才接纳,同时满足的数目越多分数阅读,同样这个分数是基于各个TERM的分数的
d. *.del file
*.del文件用于记录标记删除的DOCID,文件里除头文件里的两个int,其它的bytes里的每一个bit代表一个DOC,为1表示
DOC已经被删除,在TERMDOC流中是会被删除的
e. *.fdx & *.fdt
*.fdx & *.fdt用于存储STORE数据相关,*.fdx存储索引,*.fdt存储STORE数据
*.fdx文件格式如下
[HEAD][fdt file pointer(LONG)][…]
在搜索的最后一步会读取一些存储信息给搜索调用者,根据DOCID到*.fdx获取其存储信息在*.fdt里的偏移量,然后再去读
*.fdt文件,数据可以是二进制和STR格式两种,二进制都是压缩的,STR可以选择是否压缩
在*.fdt文件偏移点先读一个VINT得到存储的域的个数,然后迭代读取各个域,对于各个域先是存储属性(二进制,压缩)
然后是数据长度,接下来是数据.
以上颜色部分摘自http://www.cnblogs.com/mandela/archive/2012/06/11/2545254.html
实验程序用了包lucene-core-3.5.0.jar和junit-4.7.jar以及commons-io-2.4.jar
程序源代码:
package cn.edu.hit.lx;
import java.awt.Window.Type;
import java.io.File;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;
public class indexutil {
private String[] sds = { "1", "2", "3", "4", "5", "6" };
private String[] emails = { "aa@mtlab.org", "bb@hit.org", "cc@lx.org",
"dd@hit.org", "ee@hit.org", "ff@hit.org" };
private String[] content = { "welcome to visit the space,i like book",
"hello boy,i like book", "my name is cc,i like game",
"i like football", "i like football and i like basketball too",
"i like movie and swimming" };
private int[] attachs = { 1, 2, 3, 4, 5, 5 };
private Date[] dates = {};
private String[] names = { "zhangshan", "lisi", "jahan", "jetts",
"michael", "jack" };
private Directory directory = null;
private Map<String, Float> scores = new HashMap<String, Float>();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
public indexutil() {
try {
setdates();
scores.put("mtlab.org", 5.5f);
scores.put("lx.org", 7.5f);
directory = FSDirectory.open(new File("d:/lucene/index02"));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void setdates() {
dates = new Date[sds.length];
try {
dates[0] = sdf.parse("2012-12-15");
dates[1] = sdf.parse("2012-12-14");
dates[2] = sdf.parse("2012-12-13");
dates[3] = sdf.parse("2012-12-12");
dates[4] = sdf.parse("2012-12-11");
dates[5] = sdf.parse("2012-12-10");
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void query() {
try {
IndexReader reader = IndexReader.open(directory);
// 通过reader可以有效的获得文档的数量
System.out.println("numdocs:" + reader.numDocs());
System.out.println("maxdocs:" + reader.maxDoc());
System.out.println("deletedocs:" + reader.numDeletedDocs());
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void delete() {
IndexWriter writer = null;
try {
writer = new IndexWriter(directory, new IndexWriterConfig(
Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));
// 参数是一个选项,可以是一个Query,也可以是一个Term,Term指的是精确查找
// 此时删除索引并未完全删除,只是存在一个回收站中,可以恢复的
writer.deleteDocuments(new Term("id", "1"));
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (LockObtainFailedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
if (writer != null)
writer.close();
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public void undelete() {
// 对删除索引进行恢复
// 恢复时必须将IndexReader的open方法中的一个参数Readonly设置为false
try {
IndexReader reader = IndexReader.open(directory, false);
reader.undeleteAll();
reader.close();
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void forcedelete() {
IndexWriter writer = null;
try {
writer = new IndexWriter(directory, new IndexWriterConfig(
Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));
// 参数是一个选项,可以是一个Query,也可以是一个Term,Term指的是精确查找
// 此时删除索引并未完全删除,只是存在一个回收站中,可以恢复的
writer.forceMergeDeletes();
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (LockObtainFailedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
if (writer != null)
writer.close();
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public void merge() {
IndexWriter writer = null;
try {
writer = new IndexWriter(directory, new IndexWriterConfig(
Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));
writer.deleteAll();
// 会将索引合并为俩段,这俩段被删除的数据会被清空
// 此处在3.5后不建议使用,会消耗大量开销
writer.forceMerge(1);
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (LockObtainFailedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
if (writer != null)
writer.close();
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public void update() {
IndexWriter writer = null;
try {
writer = new IndexWriter(directory, new IndexWriterConfig(
Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));
Document doc = new Document();
doc.add(new Field("id", sds[0], Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS));
doc.add(new Field("email", emails[0], Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("content", content[0], Field.Store.NO,
Field.Index.ANALYZED));
doc.add(new Field("name", names[0], Field.Store.YES,
Field.Index.NOT_ANALYZED));
/*
* lucene没有更新操作,这里的更新属于俩个操作的合集 先删除再添加
*/
writer.updateDocument(new Term("id", "1"), doc);
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (LockObtainFailedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
if (writer != null)
writer.close();
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public void index() {
IndexWriter writer = null;
try {
writer = new IndexWriter(directory, new IndexWriterConfig(
Version.LUCENE_35, new StandardAnalyzer(Version.LUCENE_35)));
Document doc = null;
for (int i = 0; i < sds.length; i++) {
doc = new Document();
doc.add(new Field("id", sds[i], Field.Store.YES,
Field.Index.NOT_ANALYZED_NO_NORMS));
doc.add(new Field("email", emails[i], Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("content", content[i], Field.Store.NO,
Field.Index.ANALYZED));
doc.add(new Field("name", names[i], Field.Store.YES,
Field.Index.NOT_ANALYZED));
// 存储数字
doc.add(new NumericField("attach", Field.Store.YES, true)
.setIntValue(attachs[i]));
doc.add(new NumericField("date", Field.Store.YES, true)
.setLongValue(dates[i].getTime()));
String at = emails[i].substring(emails[i].lastIndexOf("@") + 1);
System.out.println(at);
if (scores.containsKey(at)) {
doc.setBoost(scores.get(at));
} else {
doc.setBoost(0.5f);
}
writer.addDocument(doc);
}
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (LockObtainFailedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
if (writer != null)
writer.close();
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public void search() {
try {
IndexReader reader = IndexReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery query = new TermQuery(new Term("content", "like"));
TopDocs sds = searcher.search(query, 10);
for (ScoreDoc sd : sds.scoreDocs) {
Document document = searcher.doc(sd.doc);
Date dt=new Date(Long.parseLong(document.get("date")));
String dString=sdf.format(dt);
// 此处getBoost()得到的另一个对象的
System.out.println(document.getBoost() + document.get("name")
+ "{" + document.get("email") + "|-->"
+ document.get("id") + "|-->" + document.get("attach")
+ "|-->" + "}" +dString+"!-->"+ sd.score);
}
} catch (CorruptIndexException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
测试代码:
package cn.edu.hit.lx;
import java.util.IdentityHashMap;
import org.junit.Test;
public class testIndex {
@Test
public void testIndex(){
indexutil ix=new indexutil();
ix.index();
}
@Test
public void testquery(){
indexutil ix=new indexutil();
ix.query();
}
@Test
public void testdelete(){
indexutil ix=new indexutil();
ix.delete();
}
@Test
public void testundel(){
indexutil ix=new indexutil();
ix.undelete();
}
@Test
public void testforcedel(){
indexutil ix=new indexutil();
ix.forcedelete();
}
@Test
public void testmerge(){
indexutil ix=new indexutil();
ix.merge();
}
@Test
public void testupdate(){
indexutil ix=new indexutil();
ix.update();
}
@Test
public void testsearch(){
indexutil ix=new indexutil();
ix.search();
}
}