Apache Lucene 个人项目

不确定性确定你我

于 2025-01-11 12:13:06 发布

阅读量1k

点赞数 20

文章标签： apache lucene springboot java

本文链接：https://blog.youkuaiyun.com/qq_51149892/article/details/145075387

版权

一个简单教程在自己的电脑上使用 Apache Lucene 检索个人文档。

环境准备

安装 JDK
Lucene 是用 Java 开发的，需要安装 JDK。可以从 Oracle JDK 或 OpenJDK 下载并安装。
下载 Lucene
前往 Lucene 官网下载最新版本的二进制文件。
设置开发环境
使用一个支持 Java 的 IDE，如 IntelliJ IDEA 或 Eclipse。确保配置好 JDK 和 Maven（Lucene 的依赖通过 Maven 管理）。

项目设置

创建 Java 项目
创建一个新的 Maven 项目。
添加 Lucene 依赖
在 pom.xml 中添加以下依赖：

<dependencies>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        <version>最新版本号</version>
    </dependency>
    <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        <version>最新版本号</version>
    </dependency>
</dependencies>

编写代码

以下是一个简单的代码示例，展示如何用 Lucene 索引和检索个人文档。

步骤 1：创建索引

编写一个类，用于将文档内容索引化：

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;

import java.io.IOException;
import java.nio.file.*;

public class Indexer {
    private IndexWriter writer;

    public Indexer(String indexDir) throws IOException {
        Directory dir = FSDirectory.open(Paths.get(indexDir));
        writer = new IndexWriter(dir, new IndexWriterConfig(new StandardAnalyzer()));
    }

    public void close() throws IOException {
        writer.close();
    }

    public void indexFile(String filePath) throws IOException {
        Path path = Paths.get(filePath);
        Document doc = new Document();
        doc.add(new TextField("content", Files.readString(path), Field.Store.YES));
        doc.add(new StringField("path", path.toString(), Field.Store.YES));
        doc.add(new LongPoint("modified", Files.getLastModifiedTime(path).toMillis()));
        writer.addDocument(doc);
    }

    public static void main(String[] args) throws Exception {
        String indexDir = "index";  // 存放索引的位置
        String docsDir = "docs";   // 存放文档的位置

        Indexer indexer = new Indexer(indexDir);
        Files.walk(Paths.get(docsDir)).filter(Files::isRegularFile).forEach(file -> {
            try {
                indexer.indexFile(file.toString());
            } catch (IOException e) {
                e.printStackTrace();
            }
        });
        indexer.close();
        System.out.println("Indexing completed.");
    }
}

步骤 2：执行搜索

创建另一个类，执行用户查询：

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;

import java.nio.file.Paths;

public class Searcher {
    public static void search(String indexDir, String queryStr) throws Exception {
        FSDirectory dir = FSDirectory.open(Paths.get(indexDir));
        DirectoryReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);
        QueryParser parser = new QueryParser("content", new StandardAnalyzer());
        Query query = parser.parse(queryStr);

        TopDocs results = searcher.search(query, 10);
        System.out.println("Total hits: " + results.totalHits);
        for (ScoreDoc scoreDoc : results.scoreDocs) {
            Document doc = searcher.doc(scoreDoc.doc);
            System.out.println("Path: " + doc.get("path"));
            System.out.println("Score: " + scoreDoc.score);
        }
        reader.close();
    }

    public static void main(String[] args) throws Exception {
        String indexDir = "index";
        String query = "search term";  // 替换为实际查询内容
        search(indexDir, query);
    }
}

使用步骤

准备文档
将需要索引的文档（如 .txt 文件）放入 docs 文件夹。
运行索引程序
运行 Indexer 的 main 方法，生成索引文件。
搜索文档
运行 Searcher 的 main 方法，输入查询关键字，如 "Lucene"，查看搜索结果。

扩展功能

支持多种文件格式：可以使用库（如 Apache Tika）提取 .pdf、.docx 等文件的内容。
高亮显示：使用 Lucene 的高亮模块显示查询词的匹配结果。
Web 界面：将搜索功能集成到 Web 应用中，使用框架如 Spring Boot 构建。

1. 支持多种文件格式

使用 Apache Tika 来提取 .pdf 和 .docx 文件的内容。

步骤：引入 Tika 依赖

在 pom.xml 中添加以下依赖：

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>最新版本号</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>最新版本号</version>
</dependency>

修改 Indexer 的代码：

使用 Tika 提取文件内容：

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

public class Indexer {
    private Tika tika = new Tika();

    public void indexFile(String filePath) throws IOException {
        Path path = Paths.get(filePath);
        String content;
        try {
            content = tika.parseToString(path.toFile());
        } catch (TikaException e) {
            e.printStackTrace();
            return; // 如果无法解析，跳过该文件
        }
        Document doc = new Document();
        doc.add(new TextField("content", content, Field.Store.YES));
        doc.add(new StringField("path", path.toString(), Field.Store.YES));
        doc.add(new LongPoint("modified", Files.getLastModifiedTime(path).toMillis()));
        writer.addDocument(doc);
    }
}

2. 高亮显示

使用 Lucene 的高亮模块突出显示查询匹配的词。

步骤：引入高亮依赖

在 pom.xml 中添加：

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>最新版本号</version>
</dependency>

修改 Searcher 的代码：

添加高亮功能：

import org.apache.lucene.search.highlight.*;

public class Searcher {
    public static void search(String indexDir, String queryStr) throws Exception {
        FSDirectory dir = FSDirectory.open(Paths.get(indexDir));
        DirectoryReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);
        QueryParser parser = new QueryParser("content", new StandardAnalyzer());
        Query query = parser.parse(queryStr);

        TopDocs results = searcher.search(query, 10);
        System.out.println("Total hits: " + results.totalHits);

        // 配置高亮
        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b>", "</b>");
        QueryScorer scorer = new QueryScorer(query);
        Highlighter highlighter = new Highlighter(formatter, scorer);

        for (ScoreDoc scoreDoc : results.scoreDocs) {
            Document doc = searcher.doc(scoreDoc.doc);
            String content = doc.get("content");
            TokenStream tokenStream = new StandardAnalyzer().tokenStream("content", content);
            String highlighted = highlighter.getBestFragment(tokenStream, content);
            System.out.println("Path: " + doc.get("path"));
            System.out.println("Highlighted: " + highlighted);
            System.out.println("Score: " + scoreDoc.score);
        }
        reader.close();
    }
}

3. Web 界面

使用静态 HTML 实现简单的搜索界面。

HTML 文件：search.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document Search</title>
    <style>
        body { font-family: Arial, sans-serif; }
        .result { margin-bottom: 20px; }
        .highlight { background-color: yellow; }
    </style>
</head>
<body>
    <h1>Document Search</h1>
    <form id="searchForm">
        <input type="text" id="query" placeholder="Enter search term" required>
        <button type="submit">Search</button>
    </form>
    <div id="results"></div>

    <script>
        document.getElementById('searchForm').addEventListener('submit', async (event) => {
            event.preventDefault();
            const query = document.getElementById('query').value;
            const response = await fetch(`/search?query=${encodeURIComponent(query)}`);
            const results = await response.json();
            const resultsDiv = document.getElementById('results');
            resultsDiv.innerHTML = '';
            results.forEach(result => {
                const div = document.createElement('div');
                div.classList.add('result');
                div.innerHTML = `
                    <p><strong>Path:</strong> ${result.path}</p>
                    <p><strong>Highlighted:</strong> ${result.highlighted}</p>
                `;
                resultsDiv.appendChild(div);
            });
        });
    </script>
</body>
</html>

4. 简单 HTTP 服务

使用一个简单的 Java HTTP Server 提供接口。

步骤：创建一个 HTTP 端点

import com.sun.net.httpserver.*;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.*;

import java.io.IOException;
import java.net.InetSocketAddress;
import java.nio.file.Paths;
import java.util.*;
import com.google.gson.Gson;

public class WebServer {
    private static final int PORT = 8080;

    public static void main(String[] args) throws IOException {
        HttpServer server = HttpServer.create(new InetSocketAddress(PORT), 0);
        server.createContext("/search", exchange -> {
            if ("GET".equals(exchange.getRequestMethod())) {
                String queryStr = exchange.getRequestURI().getQuery().split("=")[1];
                List<Map<String, String>> results = search("index", queryStr);
                String response = new Gson().toJson(results);
                exchange.getResponseHeaders().set("Content-Type", "application/json");
                exchange.sendResponseHeaders(200, response.length());
                exchange.getResponseBody().write(response.getBytes());
                exchange.getResponseBody().close();
            }
        });
        server.setExecutor(null);
        server.start();
        System.out.println("Server started on port " + PORT);
    }

    public static List<Map<String, String>> search(String indexDir, String queryStr) throws Exception {
        List<Map<String, String>> resultsList = new ArrayList<>();
        FSDirectory dir = FSDirectory.open(Paths.get(indexDir));
        DirectoryReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);
        QueryParser parser = new QueryParser("content", new StandardAnalyzer());
        Query query = parser.parse(queryStr);

        TopDocs results = searcher.search(query, 10);
        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<b>", "</b>");
        QueryScorer scorer = new QueryScorer(query);
        Highlighter highlighter = new Highlighter(formatter, scorer);

        for (ScoreDoc scoreDoc : results.scoreDocs) {
            Document doc = searcher.doc(scoreDoc.doc);
            String content = doc.get("content");
            TokenStream tokenStream = new StandardAnalyzer().tokenStream("content", content);
            String highlighted = highlighter.getBestFragment(tokenStream, content);

            Map<String, String> result = new HashMap<>();
            result.put("path", doc.get("path"));
            result.put("highlighted", highlighted != null ? highlighted : "(No match found)");
            resultsList.add(result);
        }
        reader.close();
        return resultsList;
    }
}

5. 启动流程

运行 Indexer 程序为文档创建索引。
启动 WebServer。
打开浏览器，访问 http://localhost:8080/search.html。
输入查询词，查看高亮搜索结果。

总结

在个人电脑上支持 .pdf、.docx 等文件格式文档检索，提供高亮显示和简单的静态 HTML 界面，轻量级但功能齐全，适合个人或小型项目使用。