使用Lucene2.3构建搜索引擎-优快云博客

本文介绍如何使用Lucene实现全文检索功能，包括创建索引、使用庖丁解牛中文分词器进行分词处理，以及如何进行高效检索，并展示了如何利用Lucene自带的高亮功能提升用户体验。

Lucene不是一个完整的全文索引应用，而是是一个用Java写的全文索引引擎工具包，它可以方便的嵌入到各种应用中实现针对应用的全文索引/检索功能。

Lucene的作者：Lucene的贡献者Doug Cutting是一位资深全文索引/检索专家，曾经是V-Twin搜索引擎(Apple的Copland操作系统的成就之一)的主要开发者，后在 Excite担任高级系统架构设计师，目前从事于一些INTERNET底层架构的研究。他贡献出的Lucene的目标是为各种中小型应用程序加入全文检索功能。

Lucene的发展历程：早先发布在作者自己的[url]www.lucene.com[/url]，后来发布在SourceForge，2001年年底成为APACHE基金会jakarta的一个子项目： [url]http://jakarta.apache.org/lucene/[/url]

已经有很多Java项目都使用了Lucene作为其后台的全文索引引擎

一、开始
首先在Apache下载Lucene 2.3.0包，其中包含了核心jar和LuceneAPI文档，解压后，将 lucene-core-2.3.0.jar放在classpath中。

二、创建索引

创建索引时需要指定存放索引的目录（将来检索时需要对这个目录中的索引进行检索），和文件的目录（如果是对文件进行索引的话）代码如下：

public void crateIndex() throws Exception {
File indexDir = new File( "D://luceneIndex" );

// 存储索引文件夹

File dataDir = new File( "D://test" );

// 需要检索文件夹

Analyzer luceneAnalyzer = new PaodingAnalyzer();

// PaodingAnalyzer这个类是庖丁解牛中文分词分析器类继承了Lucene的 Analyzer接口，对于检索中文分词有很大帮助

File[] dataFiles = dataDir.listFiles();

boolean fileIsExist = false ;

if (indexDir.listFiles(). length == 0)

fileIsExist = true ;

IndexWriter indexWriter = new IndexWriter(indexDir, luceneAnalyzer , fileIsExist);

// 第三个参数是一个布尔型的变量，如果为 true 的话就代表创建一个新的索引，为 false 的话就代表在原来索引的基础上进行操作。

long startTime = new Date().getTime();

this .doIndex(dataFiles, indexWriter);

indexWriter.optimize();//优化索引

indexWriter.close();//关闭索引

long endTime = new Date().getTime();

System. out .println( "It takes " + (endTime - startTime)

+ " milliseconds to create index for the files in directory " + dataDir.getPath());

{color:black}}

* private{*} void doIndex(File[] dataFiles, IndexWriter indexWriter) throws Exception {

for ( int i = 0; i < dataFiles. length ; i++) {

if (dataFiles[i].isFile() && dataFiles[i].getName().endsWith( ".html" )) {//索引所有html格式文件

System. out .println( "Indexing file " + dataFiles[i].getCanonicalPath());

Reader txtReader = new FileReader(dataFiles[i]);

Document document = new Document();

// Field.Store.YES 存储 Field.Store.NO 不存储

// Field.Index.TOKENIZED 分词 Field.Index.UN_TOKENIZED 不分词

document.add( new Field( "path" , dataFiles[i].getCanonicalPath(), Field.Store. YES , Field.Index. UN_TOKENIZED ));

document.add( new Field( "filename" , dataFiles[i].getName(), Field.Store. YES , Field.Index. TOKENIZED ));

// 另外一个构造函数 , 接受一个 Reader 对象

document.add( new Field( "contents" , txtReader));

indexWriter.addDocument(document);

{color:black}} else if (dataFiles[i].isFile() && dataFiles[i].getName().endsWith( ".doc" )) {//索引所有word文件

FileInputStream in = new FileInputStream(dataFiles[i]);// 获得文件流

WordExtractor extractor = new WordExtractor(in);// 使用POI对word文件进行解析

String str = extractor.getText();// 返回String

Document document = new Document();//生成 Document对象,其中有3个 Field,分别是 path , filename, contents

document.add( new Field( "path" , dataFiles[i].getCanonicalPath(), Field.Store. YES ,

Field.Index. UN_TOKENIZED ));

document.add( new Field( "filename" , dataFiles[i].getName(), Field.Store. YES , Field.Index. TOKENIZED ));

// 另外一个构造函数 , 接受一个 Reader 对象

document.add( new Field( "contents" , str, Field.Store. YES ,Field.Index. TOKENIZED ,
Field.TermVector. WITH_POSITIONS_OFFSETS ));

indexWriter.addDocument(document);

{color:black}} else {

if (dataFiles[i].isDirectory()) {

doIndex(dataFiles[i].listFiles(), indexWriter);//使用递归,继续索引文件夹

{color:black}}

从上面代码中可以看到对文件(或者说是数据)创建索引是一件很容易的事,首先确定需要索引的文件夹(或者数据库中的数据注:Lucene只接受数据,他不会区分数据的来源,也就是说不管是什么你只要把它转为String格式的数据,Lucene就能创建索引),然后指定创建后索引存放的地方,我们自己对数据处理后创建一个 Document对象这里面你可以自己定义放几个 Field,并定义 Field是否进行分词什么的,这样索引就创建好了.

注:使用庖丁解牛中文分词,需要将"庖丁"中的词典(dic文件夹)放到classpath 中再把 paoding-analyzer.properties文件也放到classpath中 properties文件内容如下:

paoding.imports = {color}

ifexists:classpath:paoding-analysis-default.properties;{color}

ifexists:classpath:paoding-analysis-user.properties;{color}

ifexists:classpath:paoding-knives-user.properties

paoding.dic.home = classpath:dic

三、检索

对于创建数据的索引我们已经了解了,下面介绍一下,检索数据, 检索数据的时候我们不用关心原始的数据或者文件,我们只关心lucene生成的索引, 但是要使用当初生成索引时的同一个分析器进行分析索引.

public void searchIndex() throws Exception {

String contents = " 项目 " ;//内容的关键字

String filename = " 测试 " ;//文件名的关键字

File indexDir = new File( "D:
luceneIndex" );//存放索引的文件夹

FSDirectory directory = FSDirectory. getDirectory (indexDir);

Searcher searcher = new IndexSearcher(directory);

QueryParser parserContents = new QueryParser( "contents" , luceneAnalyzer );

QueryParser parserFilename = new QueryParser( "filename" , luceneAnalyzer );

//使用同一个分析器 luceneAnalyzer分别生成两个 QueryParser对象

Query query1 = parserContents.parse(contents);

Query query2 = parserFilename.parse(filename);

BooleanQuery query = new BooleanQuery();

query.add(query1, BooleanClause.Occur. MUST );

query.add(query2, BooleanClause.Occur. MUST );

SimpleHTMLFormatter formatter = new SimpleHTMLFormatter( "" , "" );

Highlighter highlighter = new Highlighter(formatter, new QueryScorer(query));

highlighter.setTextFragmenter( new SimpleFragmenter(60));

//Lucene自带的高亮功能,在Lucene发布的bin中的 lucene-2.3.0\contrib\highlighter文件夹下 lucene-highlighter-2.3.0.jar 需要导入

Hits hits = searcher.search(query);

for ( int i=0;i<hits.length();i++){

TokenStream tokenStream = luceneAnalyzer .tokenStream( "contents" , new StringReader(hits.doc

.get( "contents" )));

this . pageContext .getOut().println( "<a href='" + hits.doc

.get( "path" ) + "'>" +ELFuncUtil. setStyle (hits.doc

.get( "filename" ), filename )+ "</a> " );

String str = highlighter.getBestFragment(tokenStream,hits.doc

.get( "contents" ) + "..." );

this . pageContext .getOut().println( "" +str+ "" );

this . pageContext .getOut().println( " <hr> " );

{color:black}}

{color:black}}

< style >

.highlight {

background : yellow ;

color : #CC0033 ;

{color:black}}

</ style >

这样外界的访问直接通过Lucene去检索索引,不去触及真正的文件,效率大大提高.页面再加上一点修饰一个使用Lucene构建的搜索引擎就完成了.

本文转自 tony_action 51CTO博客，原文链接：http://blog.51cto.com/tonyaction/62451，如需转载请自行联系原作者