Lucene

最新推荐文章于 2024-09-12 05:00:00 发布

大阔龙

最新推荐文章于 2024-09-12 05:00:00 发布

阅读量342

点赞数

CC 4.0 BY-SA版权

分类专栏： hadoop学习笔记

本文链接：https://blog.youkuaiyun.com/wang11yangyang/article/details/73994863

hadoop学习笔记专栏收录该内容

12 篇文章

订阅专栏

本文介绍了Lucene全文检索系统的原理及其实现方法，详细解释了Lucene如何通过倒排索引提高搜索效率，以及如何利用压缩算法和二元搜索加快检索速度。此外，还提供了创建索引和搜索索引的具体代码示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Lucene介绍

Lucene是一个全文搜索框架，而不是应用产品。因此它并不像http://www.baidu.com/ 或者google Desktop那么拿来就能用，它只是提供了一种工具让你能实现这些产品。　　
Lucene: 是一个索引与搜索类库，而不是完整的程序。

倒排索引

Lucene为什么快：
压缩算法

二元搜索

倒排索引：根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值，而是由属性值来确定记录的位置，因而称为倒排索引(invertedindex)

lucene的工作方式

lucene提供的服务实际包含两部分：一入一出。所谓入是写入，即将你提供的源（本质是字符串）写入索引或者将其从索引中删除；所谓出是读出，即向用户提供全文搜索服务，让用户可以通过关键词定位源

写入流程

源字符串首先经过analyzer处理，包括：分词，分成一个个单词；去除stopword（可选）。
将源中需要的信息加入Document的各个Field中，并把需要索引的Field索引起来，把需要存储的Field存储起来。
将索引写入存储器，存储器可以是内存或磁盘。

读出流程

用户提供搜索关键词，经过analyzer处理。
对处理后的关键词搜索索引找出对应的Document。
用户根据需要从找到的Document中提取需要的Field。
document
用户提供的源是一条条记录，它们可以是文本文件、字符串或者数据库表的一条记录等等。一条记录经过索引之后，就是以一个Document的形式存储在索引文件中的。用户进行搜索，也是以Document列表的形式返回。
field
一个Document可以包含多个信息域，例如一篇文章可以包含“标题”、“正文”、“最后修改时间”等信息域，这些信息域就是通过Field在Document中存储的。

Field有两个属性可选：存储和索引。通过存储属性你可以控制是否对这个Field进行存储；通过索引属性你可以控制是否对该Field进行索引。这看起来似乎有些废话，事实上对这两个属性的正确组合很重要

Lucene的使用

1.导入lucene的相关jar包

2.上代码

CreateIndex.java，写索引

package com.sxt.lucene;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.IOException;

/**
 * Created by kakaxi on 2017/6/29.
 */
public class CreateIndex {
    public static final String indexDir="E:\\index";
    public static final String dataDir="E:\\data";

    public static void createIndex(){
        try {
            Directory dir = FSDirectory.open(new File(indexDir));
            Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
            IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_9,analyzer);
            conf.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
            IndexWriter writer = new IndexWriter(dir,conf);

            File dataFile = new File(dataDir);
            File[] sourceFiles = dataFile.listFiles();
            for(File file : sourceFiles){
                Document fileDOC = new Document();
                fileDOC.add(new StringField("filename",file.getName(), Field.Store.YES));
                fileDOC.add(new TextField("content",FileUtils.readFileToString(file) ,Field.Store.YES));
                fileDOC.add(new LongField("lastModify",file.lastModified(),Field.Store.YES));
                writer.addDocument(fileDOC);
            }
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    public static void main(String[] args) {
        createIndex();
    }
}

2.SearchIndex.java读索引

package com.sxt.lucene;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

import java.io.File;

/**
 * Created by kakaxi on 2017/6/30.
 */
public class SearchIndex {
    public static final String indexDir="E:\\index";
    public static void search(){
        try {
            Directory dir = FSDirectory.open(new File(indexDir));
            DirectoryReader reader = DirectoryReader.open(dir);
            IndexSearcher searcher = new IndexSearcher(reader);
            QueryParser qp = new QueryParser(Version.LUCENE_4_9, "content", new StandardAnalyzer((Version.LUCENE_4_9)));
            Query query = qp.parse("from");
            TopDocs topDocs = searcher.search(query, 10);
            ScoreDoc[] scoreDocs = topDocs.scoreDocs;
            for(ScoreDoc scoreDoc:scoreDocs){
                int docId = scoreDoc.doc;
                Document document = reader.document(docId);
                System.out.println(document.get("filename"));
            }
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    public static void main(String[] args) {
        search();
    }
}