Lucene全文检索

最新推荐文章于 2022-01-15 11:12:40 发布

Ashe_wyq

最新推荐文章于 2022-01-15 11:12:40 发布

阅读量160

点赞数

分类专栏： Java

本文链接：https://blog.youkuaiyun.com/wangyanqun2017/article/details/85462416

版权

Java 专栏收录该内容

14 篇文章

订阅专栏

Lucene全文检索

什么是lucene

一个用Java写的高性能、可伸缩的全文检索引擎工具包，它可以方便的嵌入到各种应用中实现针对全文索引/检索功能。
lucene的目标是为各种中小型应用程序加入全文检索功能。

开发步骤

导入jar包
Lucene增删改查之HelloWorld版

package com.wyq;

import java.io.File;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;

public class TestCRUD {
	String content1 = "hello world java";
	String content2 = "hello world lucene";
	String content3 = "hello world你好世界";
	String content4 = "你好世界";
	String path = "D:/workspace/Eclipse/lucene2/lucene/crud";
	
	@Test
	// 增
	public void testCreate() throws Exception {
		Directory d = FSDirectory.open(new File(path));
		Analyzer analyzer = new IKAnalyzer();
		IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
		IndexWriter writer = new IndexWriter(d, conf);
		Document doc1 = new Document();
		Document doc2 = new Document();
		Document doc3 = new Document();
		Document doc4 = new Document();
		FieldType type = new FieldType();
		type.setIndexed(true);
		type.setStored(true);
		type.setTokenized(true);
		doc1.add(new Field("title", "doc1", type));
		doc1.add(new Field("content", content1, type));
		doc2.add(new Field("title", "doc2", type));
		doc2.add(new Field("content", content2, type));
		doc3.add(new Field("title", "doc3", type));
		doc3.add(new Field("content", content3, type));
		doc4.add(new Field("title", "doc4", type));
		doc4.add(new Field("content", content4, type));
		writer.addDocument(doc1);
		writer.addDocument(doc2);
		writer.addDocument(doc3);
		writer.addDocument(doc4);
		writer.commit();
		writer.close();
	}
	
	@Test
	// 改
	public void testUpdate() throws Exception {
		Directory d = FSDirectory.open(new File(path));
		Analyzer analyzer = new IKAnalyzer();
		IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
		IndexWriter writer = new IndexWriter(d, conf);
		Document doc = new Document();
		FieldType type = new FieldType();
		type.setIndexed(true);
		type.setStored(true);
		type.setTokenized(true);
		doc.add(new Field("title", "doc1", type));
		doc.add(new Field("content", "hello world python", type));
		writer.updateDocument(new Term("title", "doc1"), doc); // 表示更新title为doc1的文档
		writer.commit();
		writer.close();
	}
	@Test
	// 根据自定义的Term删除
	public void testDelete() throws Exception {
		Directory d = FSDirectory.open(new File(path));
		Analyzer analyzer = new IKAnalyzer();
		IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
		IndexWriter writer = new IndexWriter(d, conf);
		writer.deleteDocuments(new Term("content", "hello"));
		writer.commit();
		writer.close();
	}
	@Test
	// 根据查询结果删除
	public void testDelete2() throws Exception {
		Directory d = FSDirectory.open(new File(path));
		Analyzer analyzer = new IKAnalyzer();
		IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_4, analyzer);
		IndexWriter writer = new IndexWriter(d, conf);
		QueryParser parser = new QueryParser("content", analyzer);
		Query query = parser.parse("hello");
		writer.deleteDocuments(query);
		writer.commit();
		writer.close();
	}
	
	@Test
	// 查
	public void testQuery() throws Exception {
		Directory d = FSDirectory.open(new File(path));
		IndexReader r = DirectoryReader.open(d);
		IndexSearcher searcher = new IndexSearcher(r);
		Analyzer analyzer = new IKAnalyzer();
		QueryParser parser = new QueryParser("content", analyzer);
		Query query = parser.parse("你好");
		TopDocs tds = searcher.search(query, 1000);
		System.out.println("命中数：" + tds.totalHits);
		for (ScoreDoc scoreDoc : tds.scoreDocs) {
			System.out.println("...........");
			System.out.println("文档id:" + scoreDoc.doc);
			System.out.println("文档得分" + scoreDoc.score);
			System.out.println("title:" + searcher.doc(scoreDoc.doc).get("title") );
			System.out.println("content:" + searcher.doc(scoreDoc.doc).get("content"));
		}
	}
}

核心API

IndexWriter 创建索引
- writer.addDocument(doc1); // 增
- writer.deleteDocuments(query); // 删
- writer.updateDocument(new Term("title", "doc1"), doc); // 改
- writer.commit(); // 提交(增删改必须提交，不然不生效，类似数据库提交事务)
- writer.close(); // 关闭(类似于数据库关闭连接)
IndexSearcher 索引查询
- searcher.search(query, 1000); // 查
- searcher.doc(scoreDoc.doc); // 获取查询到的文档对象
FieldType
- type.setStored(true); // 是否存档
  -type.setIndexed(true); // 是否建立索引
- type.setTokenized(true);// 是否分词
Analyzer(抽象类)
- 分词器
SimpleAnalyzer
- 根据空格分词(仅对英文有效)
StandardAnalyzer
- 根据单词或者单个汉字分词(中文直接拆成单个汉字，没什么乱用)
PerFieldAnalyzerWrapper
- 分词器的包装类(利用了装饰设计模式)
- 主要作用：可以对不同的字段设置不同的分词器
CJKAnalyzer
- 把中文拆分成2个2个的词(打死不能用,垃圾分词器)
SmartChineseAnalyzer
- 根据字典来分词(字典有的词就拆成词，没有就拆成单个字)
- 需要导入lucene-analyzers-smartcn-4.10.4.jar包
- 看似智能，实则还是没什么乱用

IKAnalyzer

真正的中文分词(中国人自己实现的中文分词器，建议使用)
导IKAnalyzer2012FF_u1.jar包(第三方包)

引入配置文件IKAnalyzer.cfg.xml 用于配置停止词或者拓展词

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">ext.dic;</entry> 
	
	<!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">stopword.dic;</entry> 
</properties>

Directory(抽象类)
- 索引存储目录
- FSDirectory.open(file); // 获取的方式
IndexWriterConfig
- 索引写入配置(默认是追加的方式)
- conf.setOpenMode(OpenMode.CREATE);// 每次重新写入索引
Document
- 文档(类似于数据库的一行)
- document.add(field); // 添加一个字段
Field
- 代表一个字段(类似于数据库的一个单元格)
- new Field("title", "doc1", type); // 常用构造方法
- 需要字段名称，要添加的数据，字段类型
IndexReader(抽象类)
- 封装了查询对象需要的的查询文件等信息(主要作用就是用于创建索引查询对象)
- DirectoryReader.open(d); // 创建对象的方式，d表示Directory对象(表示文档目录对象)
QueryParser
- 查询解析器
- 封装需要查询哪个字段，用什么分词器(注意：用什么分词器添加的就要用什么分词器解析)
- new QueryParser(“content”, analyzer); // 字段名称，分词器对象
Query(抽象类)
- parser.parse(“你好”); // 查询
TopDocs
- 封装查询信息的对象

高级查询

查询所有
- parser.parse("*:*"); // 表示查询所有
- parser.parse("content:java"); // 表示查询content中有java的
- searcher.search(new MatchAllDocsQuery(), 1000); // 查询所有
- searcher.search(new TermQuery(new Term("title", "doc1")), 1000); // 查询title为doc1的

段落查询

parser.parse("\"hello java\""); // 通过双引号引起来，把"hello java" 看做一个整体
用PhraseQuery对象封装要查询的条件

PhraseQuery query = new PhraseQuery();
query.add(new Term("content", "hello"));
query.add(new Term("content", "java"));
TopDocs tds = searcher.search(query, 1000);

通配符匹配
- parser.parse("jav?"); // 匹配jav后面+任意一个字符的单词
- parser.parse("jav*"); // 匹配jav后面+任意多个字符的单词
- 用WildcardQuery对象封装成查询对象
```
WildcardQuery query = new WildcardQuery(new Term("content", "jav?"));
TopDocs tds = searcher.search(query, 1000);
```
模糊查询
- parser.parse("Xava~1"); // X表示占位符，代表允许错误的位置，~1表示可以错一个字符(最多只允许出现2个错误)
- 用FuzzyQuery封装查询对象new FuzzyQuery(new Term("content", "exceptioX"), 1); // 默认是2允许2个出错
段落临近查询
- parser.parse("\"hello python\"~n"); // 表示hello和python之间还可以出现任意n个单词(可指定任意多个单词)
- 用PhraseQuery对象封装查询条件
```
PhraseQuery query = new PhraseQuery();
query.add(new Term("content", "hello"));
query.add(new Term("content", "python"));
query.setSlop(3);
```
范围查询
- parser.parse("time:{20181230 TO 20181232}"); 表示time字段在20181230到20181232区间的,{}表示开区间，[]表示闭区间
- 用TermRangeQuery对象封装查询条件, true表示包含，false表示不包含
```
TermRangeQuery query = new TermRangeQuery("inputtime", new BytesRef("20181229"), new BytesRef("20181231"), true, true);
```
多条件查询
- parser.parse("java AND python"); // 既包含java又包含python的(注意：AND要大写)
- parser.parse("java OR python"); // 包含java或者Python的
- parser.parse("java && python"); // 等价于AND
- parser.parse("java || python"); // 等价于OR
- parser.parse("*:* !content:python"); // 从前面查询出来的结果拍出后面的结果
- parser.parse("+content:java -content:python"); // 必须包含java的并且必须拍出python(+表示包含，-表示不包含)
- 用BooleanQuery对象封装查询条件(略)
增加权重
- parser.parse("java || python^10"); // 表示把python的权重增加10倍

高亮显示

导包lucene-highlighter-4.10.4.jar

    Formatter formatter = new SimpleHTMLFormatter("<font color='yellow'>", "</font>");
    Scorer scorer = new QueryScorer(query);
    Highlighter hl = new Highlighter(formatter, scorer);
    hl.setMaxDocCharsToAnalyze(1000); // 表示1000个字符以内的相关词都会被高亮显示
    
    String str = hl.getBestFragment(analyzer, "content", searcher.doc(scoreDoc.doc).get("content")); // 得到的高亮字符串