lucene使用

最新推荐文章于 2024-06-04 22:03:46 发布

原创最新推荐文章于 2024-06-04 22:03:46 发布 · 159 阅读

0 ·

CC 4.0 BY-SA版权

lucene 专栏收录该内容

0 篇文章

订阅专栏

本文详细介绍了Lucene全文检索技术的原理与应用，包括结构化与非结构化数据的概念，全文检索的流程，如创建索引、分析文档、查询索引等。特别关注了IK分词器的使用方法及代码实现细节。

一概述
结构化数据格式和长度固定，比如数据库表
非结构化数据  格式和长度不固定，比如word
全文检索    针对非结构化数据，采用先建立索引，然后再索引的基础上进行查询
java全文检索技术lucene  全文检索的工具包
应用场景：针对大数据量的情况下，对数据的模糊查询或者自然语言的检索。
二创建索引
1 获取原始文档  爬虫Nutch jsoup
2 创建文档对象 document 相当于表中一条记录 field 属于document 相当于表中的字段 field中存储内容
filed的属性判断原则：
是否需要分词需要在也页面进行查询的大部分都需要进行分词，除了一些拥有特殊业务含义，分割后会失去原有含义的不需要进行分词。比如订单编号，身份证号。
是否需要索引需要在页面进行查询的就需要进行索引
是否需要存储需要在页面显示的就需要存储
3 分析文档
对存放在field中的内容进行分析，得到term列表。
过程：对原始文档提取单词、将字母转为小写、去除标点符号、去除停用词
4 创建索引
   对term列表进行索引，每个term执行文档的id列表。
三查询索引
1 查询接口
2 创建查询
3 执行查询
4 渲染结果
四 IK分词器
ik分词器即支持中文也支持英文。
使用方法：
第一步：把jar包添加到工程中
第二步：把配置文件和扩展词典和停用词词典添加到classpath下
注意：mydict.dic和ext_stopword.dic文件的格式为UTF-8，注意是无BOM 的UTF-8 编码。
分词器的应用时机：创建索引时和查询索引时针对查询内容先进行分词。
五代码实现

[Java] 纯文本查看 复制代码

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

//创建索引

public void createIndex() throws Exception{

IndexWriter indexWriter = getIndexWriter();

File sourceFile = new File( "E:\\项目二\\lucene\\day06\\资料\\searchsource" );

for (File file :sourceFile.listFiles() ) {

String fileName = file.getName();

String filePath = file.getPath();

long size = FileUtils.sizeOf(file);

String fileContent = FileUtils.readFileToString(file);

Document document = new Document();

TextField fileNameField = new TextField( "name" , fileName, Store.YES);

StoredField pathField = new StoredField( "path" , filePath);

LongField sizeField = new LongField( "size" , size, Store.YES);

TextField contentFeild = new TextField( "content" ,fileContent,Store.NO);

document.add(fileNameField);

document.add(pathField);

document.add(sizeField);

document.add(contentFeild);

indexWriter.addDocument(document);

}

indexWriter.close();

}

//查询索引

public void queryIndex() throws Exception{

Directory directory = FSDirectory.open( new File( "E:\\index" ));

IndexReader indexReader = DirectoryReader.open(directory);

IndexSearcher indexSearcher = new IndexSearcher(indexReader);

//term查询相当于等值查询

Query query = new TermQuery( new Term( "name" , "lucene" ));

TopDocs topDocs = indexSearcher.search(query , 10 );

for (ScoreDoc scoreDoc : topDocs.scoreDocs) {

int id = scoreDoc.doc;

Document document = indexSearcher.doc(id);

System.out.println(document.getField( "name" ));

System.out.println(document.getField( "path" ));

System.out.println(document.getField( "content" ));

System.out.println(document.getField( "size" ));

}

indexReader.close();

}

// 添加文档

public void addDocument() throws Exception {

IndexWriter indexWriter = this .getIndexWriter();

Document document = new Document();

TextField fileNameField = new TextField( "name" , "测试文档.txt" , Store.YES);

StoredField pathField = new StoredField( "path" , "c:\\temp" );

LongField sizeField = new LongField( "size" , 1111 , Store.YES);

document.add(fileNameField);

document.add(pathField);

document.add(sizeField);

indexWriter.addDocument(document);

indexWriter.close();

}

// 根据query删除文档

public void deleteByQuery() throws Exception {

IndexWriter indexWriter = this .getIndexWriter();

Query query = new TermQuery( new Term( "name" , "测试" ));

indexWriter.deleteDocuments(query);

indexWriter.close();

}

// 删除全部文档

public void deleteAll() throws Exception {

IndexWriter indexWriter = this .getIndexWriter();

indexWriter.deleteAll();

indexWriter.close();

}

//更新文档

//先删除，然后再添加

public void update() throws Exception{

IndexWriter indexWriter = this .getIndexWriter();

Term term = new Term( "name" , "spring" );

Document doc = new Document();

TextField fileNameField = new TextField( "name" , "测试文档111.txt" , Store.YES);

StoredField pathField = new StoredField( "path" , "c:\\temp" );

LongField sizeField = new LongField( "size" , 1111 , Store.YES);

doc.add(fileNameField);

doc.add(pathField);

doc.add(sizeField);

indexWriter.updateDocument(term, doc);

indexWriter.close();

}

//创建索引写入器

private IndexWriter getIndexWriter() throws IOException {

//指定索引库存放路径，可以存放在内存也可以存放在硬盘，一般存放在硬盘上。

Directory directory = FSDirectory.open( new File( "E:\\index" ));

//创建ik分词器 ik分词器即支持中文也支持英文

Analyzer analyzer = new IKAnalyzer();

IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);

//创建索引写入器

IndexWriter indexWriter = new IndexWriter(directory, config);

return indexWriter;

}