今天才开始学Lucene,这也是很久前就想学然后一直滞留没学的东西,压力真的是个好东西,能让人不断前行的动力.
Lucene即全文检索,它是为每一个进检索器的对象构建索引,这就有区分数据库直接用%%检索,我目前的公司有项搜索组织,部门,人员的功能,用的就是sql语句去模糊查询,因为是个内部系统,因此这种方法倒也够用.但是不能应付成千上万的数据,运行效率会极低的,况且执行还是调用的接口.因此Lucene的好处也就出来了,它应用于BBS,Eclipse,电脑中的搜索文件夹的功能都是用Lucene写的.
第一个小例子---创建索引
public void testLuncence() throws Exception{
//当前使用版本
Version version = Version.LUCENE_44;
//分词器
Analyzer analyzer = new StandardAnalyzer(version);
//索引存放目录
Directory directory = FSDirectory.open(new File("luncenceDemo"));
//索引写入时的相关一些配置
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(version, analyzer);
//创建操作索引的类
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
Document doc = new Document();
IndexableField field = new IntField("id", 1, Store.YES);
IndexableField stringField = new StringField("name", "ligang", Store.YES);
IndexableField textField = new TextField("addr", "吉林长春大连吐鲁番", Store.YES);
doc.add(field);
doc.add(stringField);
doc.add(textField);
indexWriter.addDocument(doc);
indexWriter.close();
}
搜索的demo,不知道为什么我的syso出不来
public void testSearcher() throws Exception{
Directory directory = FSDirectory.open(new File("luncenceDemo"));
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Query query = new TermQuery(new Term("addr", "吉林长春大连吐鲁番"));
TopDocs topDocs = searcher.search(query, 10);
int total=topDocs.totalHits; //总记录数
ScoreDoc[] scoreDocs = topDocs.scoreDocs; //Lucene自己维护的id
for (ScoreDoc scoreDoc : scoreDocs) {
int docID = scoreDoc.doc; //document的id
Document document = searcher.doc(docID);
String addr = document.get("addr");
String content = document.get("content");
System.out.println("addr="+addr);
System.out.println("content="+content);
}
}
搜索单行的代码改成Query query = new TermQuery(new Term("addr", "吉"));则能出结果
原因是分词器的原因,分词器是Lucene的关键,创建它有很多不同的机制,我用的则是标准的StandardAnalyzer分词器,它的StringField()是按一整句做索引的,而TextField()是按单个字进行分词的.
全文检索具体的增删改查
javaBean
public class Article {
private int id;
private String title;
private String content;
private String author;
private String url;
....................
dao层
package com.itheima.com.luncence.demo2;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import com.itheima.com.luncence.bean.Article;
import com.itheima.com.luncence.demo2.util.ArticleUtil;
import com.itheima.com.luncence.demo2.util.LuceneUtil;
public class LuceneDao {
private ScoreDoc[] scoreDocs;
private int doc;
public void addLucene(Article article) throws Exception{
IndexWriter indexWriter = LuceneUtil.getIndexWriter();
Document document = ArticleUtil.articleToDocument(article);
indexWriter.addDocument(document);
indexWriter.close();
}
public void deleteLucene(String fieldName,String fieldValue) throws Exception{
IndexWriter indexWriter = LuceneUtil.getIndexWriter();
Term term = new Term(fieldName,fieldValue);
indexWriter.deleteDocuments(term);
indexWriter.close();
}
public void updateLuncene(){
}
public List<Article> getLucene(String keyword,int firstResult,int maxResult) throws Exception{
IndexSearcher indexSercher = LuceneUtil.getIndexSercher();
//Query query = new MultiPhraseQuery();
String[] fields = {"title","content","author","url"};
QueryParser parser = new MultiFieldQueryParser(LuceneUtil.getVersion(), fields,LuceneUtil.getAnalyzer()) ; //对多个字段进行解析
Query query = parser.parse(keyword);
TopDocs search = indexSercher.search(query, 10);
ScoreDoc[] scoreDocs=search.scoreDocs;
List<Article> articleList = new ArrayList<Article>();
for (int i = 0; i < scoreDocs.length; i++) {
int docID = scoreDocs[i].doc; //Lucene自己维护的id
Document document = indexSercher.doc(docID);
Article documentToArticle = ArticleUtil.documentToArticle(document);
articleList.add(documentToArticle);
}
return articleList;
}
}
junit测试
package com.itheima.com.luncence.junit;
import java.util.List;
import org.apache.lucene.index.Term;
import org.junit.Test;
import com.itheima.com.luncence.bean.Article;
import com.itheima.com.luncence.demo2.LuceneDao;
public class LuceneJunit {
private LuceneDao dao = new LuceneDao();
@Test
public void testAddLucene() throws Exception{
for(int i=0;i<20;i++){
Article article = new Article();
article.setTitle("大标题");
article.setContent("大标题才能被人民看见");
article.setAuthor("大熊");
article.setUrl("www.itheima.com");
dao.addLucene(article);
}
}
@Test
public void testGetLucene() throws Throwable{
String keyword="大";
List<Article> lucene = dao.getLucene(keyword,0,10);
for (Article article : lucene) {
System.out.println("title="+article.getTitle());
System.out.println("id="+article.getId());
System.out.println("content="+article.getContent());
System.out.println("url="+article.getUrl());
System.out.println("author="+article.getAuthor());
}
}
@Test
public void testDeleteLucene() throws Exception{
dao.deleteLucene("title", "大");
}
}
这里的更新还没写,以后再写吧
测试都能通过,但是StringField组合的字段查出来的是null,TextField则正常,删除则没问题
分词器***
引用了第三方的分词器IKAnalyzer,这个分词器更精确,会根据词语进行划分,还提供了配置自己的扩展字典和配置自己的扩展停止词字典,一个小例子可以看分词效果
public static void main(String[] args) throws Exception{
String text = "大标题是的吗嘛为什么的呢我这么只懂到的呢";
Analyzer analyzer=new IKAnalyzer();
testAnanlyzer(analyzer,text);
}
public static void testAnanlyzer(Analyzer analyzer,String text) throws IOException{
TokenStream tokenStream = analyzer.tokenStream("title", new StringReader(text));
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
System.out.println(new String(charTermAttribute.toString()));
}
tokenStream.clearAttributes();
}