主要介绍增删改查索引的功能,并且对于查询到的关键字,返回高亮的结果。高亮的效果,就是将查询出来的结果,在前后加上标签,<font color="red">和</font>这样在浏览器显示的就是红色的字体.

目录效果如上,建立一个com.lucene的包,建立一个IndexDao的类,里面写入索引的增删改查方法;而建立的IndexDaoText类则是对这增删改查的测试;QueryResult则是一个查询结果的类,里面只有2个字段,总记录数和记录集合.
其中IndexDao类中的代码如下,比较长,其实也只是Search长,search长也只是因为查询前要做一些设置,排序,过滤器;查询后取出数据还要做一些高亮和摘要的设置.
- package com.lucene;
-
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.HashMap;
- import java.util.List;
- import java.util.Map;
-
- import jeasy.analysis.MMAnalyzer;
-
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.document.NumberTools;
- import org.apache.lucene.index.IndexWriter;
- import org.apache.lucene.index.IndexWriter.MaxFieldLength;
- import org.apache.lucene.index.Term;
- import org.apache.lucene.queryParser.MultiFieldQueryParser;
- import org.apache.lucene.queryParser.QueryParser;
- import org.apache.lucene.search.Filter;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.RangeFilter;
- import org.apache.lucene.search.ScoreDoc;
- import org.apache.lucene.search.Sort;
- import org.apache.lucene.search.SortField;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.search.highlight.Formatter;
- import org.apache.lucene.search.highlight.Fragmenter;
- import org.apache.lucene.search.highlight.Highlighter;
- import org.apache.lucene.search.highlight.QueryScorer;
- import org.apache.lucene.search.highlight.Scorer;
- import org.apache.lucene.search.highlight.SimpleFragmenter;
- import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
-
- public class IndexDao {
-
-
- String indexPath = "F:\\Users\\liuyanling\\workspace\\LuceneDemo\\luceneIndex";
-
-
-
- Analyzer analyzer = new MMAnalyzer();
-
-
-
-
-
-
- public void save(Document doc) {
- IndexWriter indexWriter = null;
- try {
- indexWriter = new IndexWriter(indexPath, analyzer, MaxFieldLength.LIMITED);
- indexWriter.addDocument(doc);
- } catch (Exception e) {
- throw new RuntimeException(e);
- } finally {
- try {
- indexWriter.close();
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- }
-
-
-
-
-
-
-
-
-
-
-
- public void delete(Term term) {
- IndexWriter indexWriter = null;
- try {
- indexWriter = new IndexWriter(indexPath, analyzer, MaxFieldLength.LIMITED);
- indexWriter.deleteDocuments(term);
- } catch (Exception e) {
- throw new RuntimeException(e);
- } finally {
- try {
- indexWriter.close();
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- }
-
-
-
-
-
-
-
-
-
-
-
-
- public void update(Term term, Document doc) {
- IndexWriter indexWriter = null;
- try {
- indexWriter = new IndexWriter(indexPath, analyzer, MaxFieldLength.LIMITED);
- indexWriter.updateDocument(term, doc);
- } catch (Exception e) {
- throw new RuntimeException(e);
- } finally {
- try {
- indexWriter.close();
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- }
-
-
-
-
-
-
-
-
-
-
-
-
-
- public QueryResult search(String queryString, int firstResult, int maxResults) {
- try {
-
- String[] fields = { "name", "content" };
- Map<String, Float> boosts = new HashMap<String, Float>();
-
- boosts.put("name", 3f);
- boosts.put("content", 1.0f);
-
-
- QueryParser queryParser = new MultiFieldQueryParser(fields, analyzer, boosts);
-
- Query query = queryParser.parse(queryString);
-
-
- return search(query, firstResult, maxResults);
- } catch (Exception e) {
- throw new RuntimeException(e);
- }
- }
-
-
-
-
-
-
-
-
- public QueryResult search(Query query, int firstResult, int maxResults) {
- IndexSearcher indexSearcher = null;
-
- try {
-
- indexSearcher = new IndexSearcher(indexPath);
-
-
-
- Filter filter = new RangeFilter("size", NumberTools.longToString(200),NumberTools.longToString(1000),true,true);
-
-
- Sort sort = new Sort();
- sort.setSort(new SortField("size"));
-
-
-
- TopDocs topDocs = indexSearcher.search(query, filter, 10000, sort);
-
-
- int recordCount = topDocs.totalHits;
- List<Document> recordList = new ArrayList<Document>();
-
-
- Formatter formatter = new SimpleHTMLFormatter("<font color='red'>", "</font>");
- Scorer scorer = new QueryScorer(query);
- Highlighter highlighter = new Highlighter(formatter, scorer);
-
-
- Fragmenter fragmenter = new SimpleFragmenter(50);
- highlighter.setTextFragmenter(fragmenter);
-
-
-
- int endResult = Math.min(firstResult + maxResults, topDocs.totalHits);
- for (int i = firstResult; i < endResult; i++) {
- ScoreDoc scoreDoc = topDocs.scoreDocs[i];
- int docSn = scoreDoc.doc;
- Document doc = indexSearcher.doc(docSn);
-
-
-
- String highContent = highlighter.getBestFragment(analyzer, "content", doc.get("content"));
- if (highContent == null) {
-
- String content = doc.get("content");
- int endIndex = Math.min(50, content.length());
- highContent = content.substring(0, endIndex);
- }
-
- doc.getField("content").setValue(highContent);
-
-
- recordList.add(doc);
- }
-
-
- return new QueryResult(recordCount, recordList);
- } catch (Exception e) {
- throw new RuntimeException(e);
- } finally {
- try {
- indexSearcher.close();
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }
- }
看完IndexDao的类的增删改查,还有测试这些增删改查的方法.按照之前的做法,在IndexDao一个类,就能写完测试方法,但是这里分开了.分开了的话,就算了解耦了,灵活性会好很多.
下面是IndexDaoTest的测试代码,
- package com.lucene;
-
- import org.apache.lucene.document.Document;
- import org.apache.lucene.index.Term;
- import org.junit.Test;
-
- import com.lucene.units.File2DocumentUtils;
-
- public class IndexDaoTest {
-
- String filePath = "F:\\Users\\liuyanling\\workspace\\LuceneDemo\\datasource\\IndexWriter addDocument's a javadoc .txt";
- String filePath2 = "F:\\Users\\liuyanling\\workspace\\LuceneDemo\\datasource\\小笑话_总统的房间 Room .txt";
-
-
- IndexDao indexDao = new IndexDao();
-
-
-
-
- @Test
- public void testSave() {
- Document doc = File2DocumentUtils.file2Document(filePath);
-
- doc.setBoost(3f);
- indexDao.save(doc);
-
- Document doc2 = File2DocumentUtils.file2Document(filePath2);
-
- indexDao.save(doc2);
- }
-
-
-
-
- @Test
- public void testDelete() {
-
- Term term = new Term("path", filePath);
- indexDao.delete(term);
- }
-
-
-
-
- @Test
- public void testUpdate() {
-
- Term term = new Term("path", filePath);
-
- Document doc = File2DocumentUtils.file2Document(filePath);
- doc.getField("content").setValue("这是更新后的文件内容");
-
- indexDao.update(term, doc);
- }
-
-
-
-
- @Test
- public void testSearch() {
-
- String queryString1 = "IndexWriter";
- String queryString2 = "房间";
- String queryString3 = "content:绅士";
-
- printSearchResult(queryString1, 0, 10);
- printSearchResult(queryString2, 0, 10);
- printSearchResult(queryString3, 0, 10);
- }
-
-
-
-
-
-
-
- private void printSearchResult(String queryString,int firstResult, int maxResults) {
- QueryResult qr = indexDao.search(queryString, firstResult, maxResults);
-
- System.out.println("总共有【" + qr.getRecordCount() + "】条匹配结果");
- for (Document doc : qr.getRecordList()) {
-
- File2DocumentUtils.printDocumentInfo(doc);
- }
- }
-
- }
还有最后一个查询结果类,代码如下.
- package com.lucene;
-
- import java.util.List;
-
- import org.apache.lucene.document.Document;
-
-
-
-
-
-
- public class QueryResult {
-
- private int recordCount;
-
- private List<Document> recordList;
-
-
- public QueryResult(int recordCount, List<Document> recordList) {
- super();
- this.recordCount = recordCount;
- this.recordList = recordList;
- }
-
-
- public int getRecordCount() {
- return recordCount;
- }
-
- public void setRecordCount(int recordCount) {
- this.recordCount = recordCount;
- }
-
- public List<Document> getRecordList() {
- return recordList;
- }
-
- public void setRecordList(List<Document> recordList) {
- this.recordList = recordList;
- }
- }
代码看完了,现在看下运行效果.从增查改删依次测起.首先删了现有的索引文件夹.
1.执行添加,效果就是索引文件建立出来了,并且是两个文件都建立好了索引.

2.查看下结果.照理应该是三条都是有匹配结果的,但是第一条没有,是因为用了过滤器

把IndexDao中的search中配置的filter设置为null,就可以查出结果了.IndexWriter的size只有169,正好被过滤了.而且由于content中没有IndexWriter关键字,所以没有高亮,并且被摘要只有50个字符

3.然后执行以下改的方法,会将IndexWriter中内容修改.修改之后,照理只有一个结果,但是实际上查询结果是有两个记录,之前那条没有删除,然后新建了一条修改了.

4.删除,会删除indexWriter中的所有索引,还是2条记录,结果不对.

后来,发现原来
Lucene在删除索引时,经常会出现代码成功执行,但索引并未正直删除的现象,总结一下,要注意以下因素:
1.在创建Term时,注意Term的key一定要是以"词"为单位,否则删除不成功,例如:添加索引时,如果把"d:\doc\id.txt"当作要索引的字符串索引过了,那么在删除时,如果直接把"d:\doc\id.txt"作为查询的key来创建Term是无效的,应该用Id.txt(但这样会把所有文件名为Id.txt的都删除,所以官方建议最好用一个能唯一标识的关键字来删除,比如产品编号,新闻编号等(我的猜想:每个document都加入一个id字段作为唯一标识(可用系统当前时间值作为id的值),每当要删除包含某关键字的文档索引时,先将这些文档搜索出来,然后获取它们的id值,传给一个List,然后再用List结合id字段来删除那些文档的索引......不过这个方法的效率可能会低了一点,因为每个文档都要搜两遍);
2.要删除的“词”,在创建索引时,一定要是Tokened过的,否则也不成功;
3.IndexReader,IndexModifer,IndexWriter都提供了DeleteDocuements方法,但建议用IndexModifer来操作,原因是IndexModifer内部做了很多线程安全处理;(PS:IndexModifer已经过期了)
4.删除完成后,一定要调用相应的Close方法,否则并未真正从索引中删除。
以上是网上查找的原因,最后实验出来为什么lucene的索引删不掉还是花了一点时间。首先是看下File2DocumentUtils中的索引字段的设置。
- Document doc = new Document();
-
- doc.add(new Field("name",file.getName(),Store.YES,Index.ANALYZED));
- doc.add(new Field("content",readFileContent(file),Store.YES,Index.ANALYZED));
-
-
- doc.add(new Field("size",NumberTools.longToString(file.length()),Store.YES,Index.NOT_ANALYZED));
- doc.add(new Field("path", file.getAbsolutePath(), Store.YES, Index.NO));
其中,name和content是ANALYZED的,而size和path分别是NOT_ANALYZED和NO.上面说要可以要Tokened的字段才能用,但是Tokened过期了,用ANALYZED替换了.而对于"d:\doc\id.txt",我开始搞不清楚是什么意思.后来发现是指如我的路径:F:\datasource\IndexWriter addDocument'sa javadoc .txt,按照正常思路,写成这样Term term = new Term("path","F:\\datasource\\IndexWriteraddDocument's a javadoc .txt");,但是term不认,写成这样反而认Term term = new Term("path","IndexWriter addDocument's a javadoc .txt");我开始以为是这样,后来发现也不认。写成这样它才认Term term = new Term("path","indexwriter");可以看出IndexWriter变成小写了,因为大写也不认。这就是上面说的key一定要是以"词"为单位.
对于Term到底是认哪种的写法,感到很奇怪.设想可能跟分词器有关,正好我之前我有用名字文本测试分词器,效果如下,但是我用的语句是这样的"IndexWriter addDocument's a javadoc.txt",所以,我把文件名由"IndexWriter addDocument's a javadoc .txt"改成了"IndexWriter addDocument's ajavadoc.txt".

Term term = new Term("name","s");写成这样,测试发现MMAnalyzer分词器认这种写法,可以查出结果,但是StandardAnalyzer不认.所以对于极易分词器,写上indexwriter,adddocument,s,javadoc.txt都是可以的.

将删除和更新的方法重新写一下,term写成这样
- Term term = new Term("name", "indexwriter");
进行测试,删除索引,重新创建索引,更新效果如下,只有一个语句了.

而删除效果,则全部删除了,一条不留.

最后,还没完,还有下篇《全文检索之lucene的优化篇--查询篇》,介绍lucene中的各种查询方法。