lucene搜索引擎配置

最新推荐文章于 2019-01-17 20:13:00 发布

Liu-

最新推荐文章于 2019-01-17 20:13:00 发布

阅读量686

点赞数

分类专栏：大数据

大数据专栏收录该内容

10 篇文章

订阅专栏

Lucene是一个基于Java的全文索引工具包,，它可以方便的嵌入到各种应用中实现针对应用的全文索引/检索功能。要配置Lucene只需要到http://jakarta.apache.org/lucene/下载对应的jar插件。

Lucene的主要过程为首先给以后要查找的文件建立索引，这里的文件可以是任意的，只要是能够转换为document对象的即可。然后查询时将用户的查询语句组合成对应的查询条件对象即可。

1、建立索引

建立索引的过程为先根据要搜索的内容转化为document，一个document相当于数据库里面的一条记录，document中的Field对象相当于DB中一条记录的一个属性，将内容转化为document后，使用索引填写对象IndexWriter将对应的document写入到索引中。在写完索引之后关闭IndexWriter对象，关闭之后对象才会保存索引。参考代码如下：

 
  File indexDir = new File("D://luceneIndex");//保存索引目录，假如不想保存可以不用定义保存到内存中
         File dataDir = new File("D://luceneData");//要建立索引的文件目录
         Analyzer luceneAnalyzer = new MMAnalyzer();//支持中文的分词器
         File [] dataFiles = dataDir.listFiles();
         IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);
         long startTime = new Date().getTime();
         for(int i = 0; i<dataFiles.length; i++){
             if(dataFiles[i].isFile()&&dataFiles[i].getName().endsWith(".txt")){//是不是文本文件
                 System.out.println("Indexing file" + dataFiles[i].getCanonicalPath());
                 Document document = new Document();
                 Reader txtReader = new FileReader(dataFiles[i]);//读取文章内容
                 document.add(new Field("path",dataFiles[i].getCanonicalPath(),Field.Store.YES,Field.Index.UN_TOKENIZED));//保存文章路径
                 document.add(new Field("url","http://www.baidu.com",Field.Store.YES,Field.Index.UN_TOKENIZED));//保存其它信息
                 document.add(new Field("contents",txtReader));
                 indexWriter.addDocument(document);//保存文章索引，一条document相当于一条记录，一个Field相当于一个字段
             }
         }
         indexWriter.optimize();
         indexWriter.close();
         long endTime = new Date().getTime();
         System.out.println("It takes " + (endTime - startTime)
                 + "milliseconds to create index for the files in directory"
                 + dataDir.getPath());
 

（1）假如不想将内容存储到磁盘里面，则只需要新建一个磁盘目录，然后根据磁盘目录新建写索引的IndexWriter对象，参考代码如下：

 
  IndexWrtier indexWriter = new IndexWriter(new RAMDirectory(),new StandardAnlyazer(),true);//新建一个内存目录
 

StandardAnalyzer代表分词器，是Lucene包中有默认的分词器，但不支持中文，假如想支持中文，则必须采用其它分词器，如ChineseAnalyzer、MMAnalyzer，各个分词器原理不同，有的分词器按照二元语法即将词汇两个两格组合来解析词汇，有的按照词库来解析词汇，如按照二元语法解析"北京天安门" ==> "北京京天天安安门"，如果按照词库来解析"北京天安门" ==> "北京天安门"，特点是二元语法简单但准确度不高，词库解析准确但占用空间大。

（2）假如不想重新新建一个目录来存储索引文件，像要在以前目录的基础上添加，则只需要调用父目录的addIndexs方法将要添加的目录添加到父目录中。例如：

 
  IndexWriter.addIndexes(Directory[] dirs)//dirs代表存储要保存的索引的目录
 

2、查询信息

查询信息只需要使用Lucene的IndexSearcher对象即可，但该对象的查询条件不是字符串，而是根据字符串组织的查询条件变量。查询到结果之后使用Hits对象转载查询结果。参考代码如下：

 
  String queryStr = "工具包";
         File indexDir = new File("D://luceneIndex");
         FSDirectory directory = FSDirectory.getDirectory(indexDir);//读取索引目录
         IndexSearcher searcher = new IndexSearcher(directory);
         if(!indexDir.exists()){
             System.out.println("The Lucene index is not exist");
             return;
         }
         Term term = new Term("contents",queryStr.toLowerCase());//查询条件为文章内容contents为queryStr字符串
         TermQuery luceneQuery = new TermQuery(term);//根据term生成TermQuery查询条件
         Hits hits = searcher.search(luceneQuery);
         System.out.println(hits.length());
         for(int i = 0; i<hits.length(); i++){
             Document document = hits.doc(i);
             System.out.println("File:" + document.get("path"));
             System.out.println("相关度:" + hits.score(i));
         }
 

查询之后Lucene默认根据查询结果的相关度进行排序，即相关度高的排在前面，低的排在后面。我们也可以指定查询字符串的分词器，这样可以对查询字符串进行分词。参考代码如下：

 
      QueryParser queryParser = new QueryParser("contents",new MMAnalyzer());
     Query query = queryParser.parse(queryStr);
     Hits hits = searcher.search(query);
 

查询条件有各种各样的组合方式，TermQuery只是其中一种，它们都是Query的子类。其它的如：BooleanQuery, ConstantScoreQuery, ConstantScoreRangeQuery, DisjunctionMaxQuery, FilteredQuery, MatchAllDocsQuery, MultiPhraseQuery, MultiTermQuery, PhraseQuery, PrefixQuery, RangeQuery, SpanQuery等。

转载:http://blog.youkuaiyun.com/gongyongxing/article/details/2952349