lucene的笔记

最新推荐文章于 2021-08-12 01:07:14 发布

流年，回不去的时光

最新推荐文章于 2021-08-12 01:07:14 发布

阅读量147

点赞数

分类专栏： lucene 文章标签： lucene

本文链接：https://blog.youkuaiyun.com/weixin_44683239/article/details/100119986

版权

lucene 专栏收录该内容

1 篇文章

订阅专栏

lucene原理

一、教学目标

1、什么是lucene

2、lucene的使用场景

3、索引的算法

4、lucene的原理

5、备注

二、什么是lucene

1. lucene就是apache下的一个全文检索工具，一堆的jar包，我们可以使用lucene做一个谷歌和百度一样的搜索引擎系统。
2. Lucene是有Doug Cutting 2000年时开发出的第一个版本，后捐献给apache基金会，doug cutting是Lucene、Hadoop（大数据领域的）等项目的发起人。
3. lucene 原理 --- solr 和 elasticSearch

三、lucene的使用场景

互联网：谷歌，百度，必应
站内搜索：淘宝，京东, 站内贴吧

四、常见的算法

顺序扫描法
	描述：带着关键字，一条一条的比较，逐字匹配，直到找到为止
  缺点：查询效率低（慢）, 随着数据量的大量增长效率会明显降低
  优点: 准确率高
  举例：数据库中like查询
全文检索算法（倒排索引算法）
	描述：把数据库中的所有内容都查询出来，然后进行切分词, 把切开分词组成索引（目录），把内容放到文档对象中,索引与文档组成索引库； 检索时，先查询到索引，索引与文档之间有联系,通过联系可以快速确定文档的位置，返回文档,这就是倒排索引算法.
	缺点：空间换时间
	优点：查询效率高，不会随着数据的大量增长而效率明显降低
	举例：字典：把所有的字偏旁部首都取出来，组成目录，目录与后面的内容有联系， 通过目录能快速的找到字的详细

五、lucene的原理

在这里插入图片描述

六、代码实现

1、引入依赖

<dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.9</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.6</version>
        </dependency>
    	<!-- ik中文分词器 -->
        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.10.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.10.3</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>4.10.3</version>
        </dependency>
    </dependencies>

2、数据准备

package com.itheima.dao;

import com.itheima.domain.Book;

import java.sql.*;
import java.util.ArrayList;
import java.util.List;

/**
 * @author 黑马程序员
 * @Company http://www.ithiema.com
 * @Version 1.0
 */
public class BookDao {

    /**
     * 查询全部
     * @return
     */
    public List<Book> findAll(){
        List<Book> bookList = new ArrayList<>();

        //1. 注册驱动
        try {
            Class.forName("com.mysql.jdbc.Driver");
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        }
        String url = "jdbc:mysql://localhost:3306/lucene_331";
        String username = "root";
        String password = "root";
        Connection conn = null;
        PreparedStatement pst = null;
        ResultSet rs = null;
        try {
            //2. 获取连接对象
            conn = DriverManager.getConnection(url, username, password);
            //3. SQL语句
            String sql = "select * from book";
            //4. 创建Statement对象
            pst = conn.prepareStatement(sql);
            //5. 执行sql语句，返回结果集
            rs = pst.executeQuery();
            //6. 处理结果集
            //如果有下一个元素：有一个Book对象
            while(rs.next()){
                //创建一个book对象
                Book  book = new Book();
                //封装book对象
                int id = rs.getInt("id");
                book.setId(id);
                book.setName(rs.getString("name"));
                book.setPic(rs.getString("pic"));
                book.setPrice(rs.getDouble("price"));
                book.setDescription(rs.getString("description"));
                //添加到集合中
                bookList.add(book);
            }
            //7. 返回结果
            return bookList;
        } catch (SQLException e) {
            e.printStackTrace();
        } finally {
            //8. 释放资源
            if(rs != null){
                try {
                    rs.close();
                } catch (SQLException e) {
                    e.printStackTrace();
                }
            }
            if(pst != null){
                try {
                    pst.close();
                } catch (SQLException e) {
                    e.printStackTrace();
                }
            }
            if(conn != null){
                try {
                    conn.close();
                } catch (SQLException e) {
                    e.printStackTrace();
                }
            }
        }
        return  null;

    }
}

3、创建索引

package com.itheima;

import com.itheima.dao.BookDao;
import com.itheima.domain.Book;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

/**
 * 创建索引
 *
 * @author 黑马程序员
 * @Company http://www.ithiema.com
 * @Version 1.0
 */
public class CreateIndex {

    @Test
    public void test() throws Exception {
        //1. 创建分词器对象
        Analyzer analyzer = new StandardAnalyzer();
        //2. 指定索引库的位置
        FSDirectory directory = FSDirectory.open(new File("f:/dic"));
        //3. 查询所有的内容
        BookDao bookDao = new BookDao();
        List<Book> bookList = bookDao.findAll();
        //4. 把内容放到文档对象中
        //创建文档集合对象
        List<Document> docList = new ArrayList<>();
        //一条记录对应一个文档，一条记录对应了一个Book对象, 一个book一个文档
        for (Book book : bookList) {
            //创建一个文档对象,一个book一个文档
            Document doc = new Document();
            //一列对应一个域，一列对应book中的一个属性，一个属性就是一个域
            //创建域对象
            /**
             * 参数1：域的名称
             * 参数2：域中存储的值
             * 参数3：是否存储-- 先选择yes
             */
            TextField idField = new TextField("id",String.valueOf(book.getId()) , Field.Store.YES);
            TextField nameFiled = new TextField("name",book.getName(), Field.Store.YES);
            TextField picFiled = new TextField("pic",book.getPic(), Field.Store.YES);
            TextField priceFiled = new TextField("price",String.valueOf(book.getPrice()), Field.Store.YES);
            TextField descriptionFiled = new TextField("description",book.getDescription(), Field.Store.YES);
            //把所有的域对象添加到文档中
            doc.add(idField);
            doc.add(nameFiled);
            doc.add(picFiled);
            doc.add(priceFiled);
            doc.add(descriptionFiled);
            //把文档对象添加到集合
            docList.add(doc);
        }
        //5. 获取索引输出流对象
        //索引输出流配置对象
        IndexWriterConfig  indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
        /**
         * 参数1：索引库的位置对象'
         * 参数2：索引输出流配置对象
         */
        IndexWriter indexWriter = new IndexWriter(directory , indexWriterConfig);
        //6. 把文档对象写入到索引库中
        //遍历文档对象，添加到索引库中
        for (Document document : docList) {
            indexWriter.addDocument(document);
        }
        //7. 提交数据
        indexWriter.commit();
        //8. 关闭流
        indexWriter.close();
    }
}

4、界面查看测试索引

java -jar lukeall-4.10.3.jar

5、使用索引查询

package com.itheima;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;

import java.io.File;

/**
 *
 *
 * 查询索引
 * @author 黑马程序员
 * @Company http://www.ithiema.com
 * @Version 1.0
 */
public class SearchIndex {

    @Test
    public void test() throws Exception {
        //1. 创建索引库的位置对象
        FSDirectory directory = FSDirectory.open(new File("f:/dic"));
        //2. 创建分词器对象--创建索引库与查询索引使用分词器对象必须是同一个
        Analyzer analyzer = new StandardAnalyzer();
        //3. 查询索引对象
        //创建输入流对象:参数：索引库的位置对象
        IndexReader reader = IndexReader.open(directory);
        /**
         * 参数：索引输入流对象
         */
        IndexSearcher indexSearcher = new IndexSearcher(reader);
        //4. 查询的关键字对象
        // 查询分析对象
        /**
         * 参数1：默认的域， 如过查询时没有指定域，则使用默认的域，如果指定，使用指定的域
         * 参数2： 分词器
         */
        QueryParser queryParser = new QueryParser("description", analyzer);
        //通过查询解析对象获取查询关键字对象
        Query query = queryParser.parse("java");
        /**
         * 参数1：查询的关键字对象 query
         * 参数2：查询的记录数
         * 返回值：顶部的文档对象
         */
        TopDocs topDocs = indexSearcher.search(query, 2);
        //显示结果
        // 分数文档对象数组
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        //遍历数组
        for (ScoreDoc scoreDoc : scoreDocs) {
            //获取的分数文档对象的编号
            int docId = scoreDoc.doc;
            // 根据文档id获取真正的文档对象
            Document doc = indexSearcher.doc(docId);
            //获取域中的值
            System.out.println("id域中的值："+doc.get("id"));
            System.out.println("name域中的值："+doc.get("name"));
            System.out.println("pic域中的值："+doc.get("pic"));
            System.out.println("price域中的值："+doc.get("price"));
            System.out.println("description域中的值："+doc.get("description"));
        }
    }
}

6、删除索引

package com.itheima;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

import java.io.File;

/**
 * 删除索引
 * @author 黑马程序员
 * @Company http://www.ithiema.com
 * @Version 1.0
 */
public class DeleteIndex {

    @Test
    public void test() throws Exception {
        //1. 索引库的位置
        FSDirectory directory = FSDirectory.open(new File("f:/dic"));
        //2. 分词器对象
        Analyzer analyzer = new StandardAnalyzer();
        //3. 删除的关键字对象 -- term -分词对象
        Term term = new Term("id","1");
        //4. 输出流对象
        //索引输出流配置对象
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
        //创建输出流对象
        IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
        //5. 执行删除操作
        //删除操作：只删除文档，不删除索引
//        indexWriter.deleteDocuments(term);
        //删除全部-- 删除了索引和文档对象
        indexWriter.deleteAll();
        //6. 提交
        indexWriter.commit();
        //7. 释放资源
        indexWriter.close();
    }
}

7、更新索引

package com.itheima;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

import java.io.File;

/**
 * 更新索引
 * @author 黑马程序员
 * @Company http://www.ithiema.com
 * @Version 1.0
 */
public class UpdateIndex {

    @Test
    public void test() throws Exception {
        // 1. 索引库位置
        FSDirectory directory = FSDirectory.open(new File("f:/dic"));
        //2. 分词器对象
        Analyzer analyzer = new StandardAnalyzer();
        //3. 关键字对象
        Term  term = new Term("id","2");
        //4. 输出流对象
        //创建的输出流配置对象
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
        //创建输出流对象
        IndexWriter  indexWriter = new IndexWriter(directory, indexWriterConfig);
        //5. 更新操作
        //创建更新后的文档对象
        Document document = new Document();
        //创建域对象
        TextField idField = new TextField("id", String.valueOf(6), Field.Store.YES);
        TextField nameField = new TextField("name", "水浒传", Field.Store.YES);
        TextField picField = new TextField("pic", "asdfasdfasdf.jpg", Field.Store.YES);
        TextField priceField = new TextField("price", String.valueOf(18.0), Field.Store.YES);
        TextField descriptionField = new TextField("description", "108好汉上梁山替天行道", Field.Store.YES);
        //将域添加到文档中
        document.add(idField);
        document.add(nameField);
        document.add(picField);
        document.add(priceField);
        document.add(descriptionField);
        //更新操作- 把原来的文档删除，保留索引，添加一个新的文档，构建索引
        indexWriter.updateDocument(term, document );
        //6.提交
        indexWriter.commit();
        //7. 释放资源
        indexWriter.close();
    }
}

七、Field域的类型

在这里插入图片描述

八、备注（名词解释）

1. 切分词: 把内容中的a  ，an ，the ，is ，的，地，得 ，啊等等不重要词，空格等等删除， 把大写字母变成小写字母.
2. 文档：一条记录对应一个文档
3. 索引库:电脑上的一个文件夹, 存储的就是索引和文档
4. 域的选择
	是否分词:分词的目的就是索引，分词后是否有意义,如果有意义，则分词，无意义，则不分词
		是：需要分词
		举例：name, description,price
		否：不要分词
		举例：id ,pic 
	是否索引: 查询时是否需要索引
		是：查询时需要索引
		举例：name,price ,descrption,id 
		否：查询是不需要索引
		举例：pic
	是否存储: 是否存储到索引库中, 在查询页面需要展示就需要存储，不需要展示则不需要存储
		是：需要展示
		举例：name, pic ,price ,id
		否：不要展示
		举例：description
5. 注意：如果在检索时需要区间（范围）检索, 则必须分词，必须索引，必须存储，这是lucene的底层规则
6. 注意：描述信息一般不存储，数据量太大, 如果需要描述信息,通过隐藏域中的id值，通过jdbc快速查询描述信息,返回