手把手教你打造简易搜索引擎

小蝴蝶001

已于 2025-05-08 13:11:46 修改

阅读量783

点赞数 19

CC 4.0 BY-SA版权

文章标签：搜索引擎

于 2025-05-08 13:07:38 首次发布

本文链接：https://blog.youkuaiyun.com/wvixnkodb/article/details/147785669

想不想知道搜索引擎是怎么工作的？其实，我们也能自己动手做一个简易的搜索引擎！接下来，就一步一步带你实现它。

实现思路

文件解析：使用Parser类递归遍历目标目录下的 HTML 文件，通过正则表达式等方法提取文件中的标题、网址和正文内容，生成Document对象。
索引构建：利用Index类创建正排索引（文档 ID 到文档的映射）和倒排索引（关键词到包含该词的文档列表的映射），并将索引持久化存储到文件，便于后续加载使用。
搜索实现：Searcher类接收用户搜索词，先分词并过滤停用词，再在倒排索引中查找相关文档列表，合并后按权重排序，最终将相关文档的关键信息封装成SearchResult返回给用户。
多线程优化：在文件解析阶段，通过线程池和CountDownLatch实现多线程处理，加快 HTML 文件的解析速度，减少整体索引构建时间。

1. 搜索引擎的核心逻辑

搜索引擎的第一步，就是把一堆静态网页里有用的信息挑出来，比如网页标题、网址、正文内容等，然后把这些信息打包成一个个 “文档卡片”（Document对象）。

为了能快速找到这些 “文档卡片”，我们要建立两个 “超级目录”：

正排索引：就像一本书的目录，按照页码顺序排列，通过文档编号就能快速找到对应的文档，用ArrayList实现，文档编号就是列表的下标。
倒排索引：类似字典，通过关键词能找到所有包含这个词的文档列表，用HashMap实现，每个关键词对应一个包含文档编号和相关度分数（权重）的列表。

当用户输入搜索词后，搜索引擎先把搜索词拆分成一个个关键词，然后在倒排索引里找到所有相关的文档，再按照相关度从高到低排序，最后把结果展示给用户，包括文档标题、网址和正文摘要，点击标题就能跳转到对应的网页。

2. 给文字 “分家”：分词操作

要实现搜索引擎，首先得让计算机能理解用户输入的文字。我们引入 Ansj 库，它就像一个超级翻译官，能把用户输入的句子拆分成一个个词语。

在代码里，只要调用ToAnalysis.parse(字符串).getTerms()方法，就能得到一个包含所有分词结果的List<Term>。Term就是每个分词，通过getName()方法可以获取具体的词语内容。比如，输入 “这是一个测试”，它能拆分成 “这是”“一个”“测试”。

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;
import java.util.List;

public class TokenizationExample {
    public static void main(String[] args) {
        String text = "这是一个测试句子";
        List<Term> terms = ToAnalysis.parse(text).getTerms();
        for (Term term : terms) {
            System.out.println(term.getName());
        }
    }
}

3. 处理网页文件

我们需要创建一个Parser类，专门负责处理网页文件，把它们变成搜索引擎能识别的 “文档卡片”，并添加到索引中。

3.1 枚举文件夹中所有文件

在run方法里，通过enumFile方法递归遍历整个文件夹。enumFile方法会判断每个文件是文件夹还是 HTML 文件，如果是文件夹就继续深入查找，如果是 HTML 文件就把它添加到allFiles列表里。

private void enumFile(File file, List<File> allFiles) {
    // 列出当前文件的所有文件 / 文件夹
    File[] files = file.listFiles();
    for (File curFile : files) {
        if (curFile.isDirectory()) {  // 如果是文件夹
            enumFile(curFile, allFiles);
        } else if (curFile.getName().endsWith(".html")) { // 是 HTML 文件
            allFiles.add(curFile);
        }
    }
}

这里定义了目标文件夹路径常量ROOT_PATH，在run方法中使用该路径获取目标文件夹对象：

private static final String ROOT_PATH = "D:\\MyJavaCode\\documentSearcher\\jdk-8u361-docs-all\\docs\\api\\";

public void run() {
    long beg = System.currentTimeMillis();
    // 1. 获取 api 这个文件对象
    File root = new File(ROOT_PATH);
    // ...
}

3.2 提取网页信息

parseHtml方法负责处理每个 HTML 文件：

提取标题：parseTitle方法直接把文件名的.html后缀去掉，当作标题。

private String parseTitle(File file) {
    String rm = ".html";
    return file.getName().substring(0, file.getName().length() - rm.length());
}

获取 URL：parseUrl方法用固定的前缀加上文件在本地的相对路径，生成完整的网址。

private String parseUrl(File file) {
    // 线上文档的前缀
    String prefix = "https://docs.oracle.com/javase/8/docs/api/";
    // 本地文档的后缀
    String suffix = file.getAbsolutePath().substring(ROOT_PATH.length());
    return prefix + suffix;
}

获取正文：parseContent方法先读取文件内容，再用正则表达式去除 HTML 标签、JavaScript 代码，合并多余的空格，得到干净的正文内容

private String readFile(File file) {
    final int BUFFERED_CAPACITY = 1024 * 1024;
    try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file), BUFFERED_CAPACITY)) {
        StringBuilder builder = new StringBuilder();
        while (true) {
            int ret = bufferedReader.read();
            if (ret == -1) {
                return builder.toString();
            }
            char ch = (char) ret;
            if (ch == '\r' || ch == '\n') { // 如果是换行符则用空格替换
                ch = ' ';
            }
            builder.append(ch);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
    return "";
}

public String parseContentByRegex(File file) {
    // 使用正则表达式，需要先将文章内容转化成字符串
    String content = readFile(file);
    // replaceAll 返回的是 一个新的 String 对象
    content = content.replaceAll("<script.*?>(.*?)</script>", " "); // 先去除 js 中的文本
    content = content.replaceAll("<.*?>", " "); // 去除 标签对
    content = content.replaceAll("\\s+", " ");  // 合并多余的空格
    return content;
}

4. 搭建索引系统

Index类就像搜索引擎的 “大脑中枢”，负责管理正排索引和倒排索引。

import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;

import java.io.File;
import java.io.IOException;
import java.util.*;

public class Index {
    // 正排索引：文档 ID -> 文档，文档 ID 就是 ArrayList 中的下标
    ArrayList<Document> forwardIndex = new ArrayList<>();

    // 倒排索引：词 -> 文档 ID 结合, 考虑到根据词所找到的 文档集合 应该具备优先级，所以此处存储的除了 Id 还 应该有优先值
    HashMap<String, ArrayList<Weight>> invertedIndex = new HashMap<>();

    // 索引存放目录
    private static final String INDEX_PATH = "D:\\MyJavaCode\\documentSearcher\\jdk-8u361-docs-all\\";

    // 正排索引和倒排索引的文件名，任意一个和 INDEX_PATH 拼接就可以得到完整路径
    private static final String FORWARD_PATH = "forward.txt";
    private static final String INVERTED_PATH = "inverted.txt";

    private ObjectMapper objectMapper = new ObjectMapper();

    // 构建正排索引
    public void buildForward(Document document) {
        // 待加入文档的ID
        document.setDocumentId(forwardIndex.size());
        forwardIndex.add(document);
    }

    // 构建倒排索引
    public void buildInverted(Document document) {
        class Counter {
            int titleCnt;
            int contentCnt;
            public Counter(int titleCnt, int contentCnt) {
                this.titleCnt = titleCnt;
                this.contentCnt = contentCnt;
            }
        }

        // 1. 记录关键词 出现的 次数
        HashMap<String, Counter> counterMap = new HashMap<>();

        // 2. 对标题进行分词，统计出现次数
        List<Term> terms = ToAnalysis.parse(document.getTitle()).getTerms();
        for (Term term : terms) {
            // 获取分词字符串
            String word = term.getName();
            Counter counter = counterMap.get(word);
            // 如果为空，说明还没有出现过这个关键词
            if (counter == null) {
                counter = new Counter(1, 0); // 标题出现次数赋值为 1
                counterMap.put(word, counter);
            } else { // 出现过这个分词
                counter.titleCnt ++;
            }
        }

        // 3. 对正文进行分词，统计出现次数
        terms = ToAnalysis.parse(document.getContent()).getTerms();
        for (Term term : terms) {
            String word = term.getName();
            Counter counter = counterMap.get(word);
            if (counter == null) {
                counter = new Counter(0, 1);
                counterMap.put(word, counter);
            } else {
                counter.contentCnt ++;
            }
        }

        // 4. 将分词结果整合到倒排索引中
        //    遍历文档的所有分词结果
        for (Map.Entry<String, Counter> entry : counterMap.entrySet()) {
            String word = entry.getKey();
            Counter counter = entry.getValue();

            // 将文档 ID 和 文档权值 包装起来
            Weight newWeight = new Weight(document.getDocumentId()); // 存入文档 ID
            newWeight.calWeight(counter.titleCnt, counter.contentCnt);

            // 取出该关键词相关联的 文档列表
            List<Weight> take = invertedIndex.get(word);
            // 倒排索引 中没有这个关键词
            if (take == null) {
                ArrayList<Weight> newList = new ArrayList<>();  // 新建列表
                newList.add(newWeight);
                invertedIndex.put(word, newList);
            } else { // 出现过
                take.add(newWeight); // 关联文档数增加
            }
        }
    }

    // 往索引中添加文档
    public void add(String title, String url, String content) {
        Document document = new Document(title, url, content);
        // 自此，Document还没设置ID，ID进入 buildForward() 会设置
        buildForward(document);
        buildInverted(document);
    }

    // 传入文档 Id，获取文档对象
    public Document getDocument(int documentId) {
        return forwardIndex.get(documentId);
    }

    // 获取和 关键词 相关的文档
    public List<Weight> getRelatedDocument(String word) {
        return invertedIndex.get(word);
    }

    // 持久化存储索引
    public void save() {
        System.out.println("开始持久化索引");
        long beg = System.currentTimeMillis();
        File indexFile = new File(INDEX_PATH);
        if (!indexFile.exists()) { // 如果不存在
            indexFile.mkdirs();
        }
        // 打开这两个文件
        File forwardFile = new File(INDEX_PATH + FORWARD_PATH);
        File invertedFile = new File(INDEX_PATH + INVERTED_PATH);

        try {
            objectMapper.writeValue(forwardFile, forwardIndex);
            objectMapper.writeValue(invertedFile, invertedIndex);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

        long end = System.currentTimeMillis();
        System.out.println("索引持久化完成，总共耗时 " + (end - beg) + " ms");
    }

    // 将索引读取到内存中
    public void load() {
        File forwardFile = new File(INDEX_PATH + FORWARD_PATH);
        File invertedFile = new File(INDEX_PATH + INVERTED_PATH);

        try {
            forwardIndex = objectMapper.readValue(forwardFile, new TypeReference<ArrayList<Document>>() {});
            invertedIndex = objectMapper.readValue(invertedFile, new TypeReference<HashMap<String, List<Weight>>>() {});
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

    }
}

4.1 索引数据结构

正排索引：用ArrayList<Document>存储所有文档，每个文档都有唯一的编号。

import lombok.Data;

@Data  // lombok 中提供的注解，提供 toString, get 和 set 等常用方法
public class Document {
    private int documentId;
    private String title;
    private String url;
    private String content;

    public Document() {};

    public Document(String title, String url, String content) {
        this.title = title;
        this.url = url;
        this.content = content;
    }
}

倒排索引：用HashMap<String, ArrayList<Weight>>，键是关键词，值是包含文档编号和权重的列表。Weight类记录文档编号和关键词在该文档中的相关度分数。

import lombok.Data;

@Data
public class Weight {
    private int documentId;
    private int weight;     // 文档权重

    public Weight() {}

    public Weight(int documentId) {
        this.documentId = documentId;
    }

    // 不过这里还需要一个方法，用来后续计算权重，这里先不做实现
    public int calWeight() {
        return 0;
    }

    /**
     * 计算权重，并赋值给 weight 属性
     * @param titleCnt   关键词在标题中出现的次数
     * @param contentCnt    关键词在正文中出现的次数
     */
    public void calWeight(int titleCnt, int contentCnt) {
        int w =  titleCnt * 10 + contentCnt; // 计算权重
        this.setWeight(w); // 赋值
    }
}

4.2 添加文档到索引

add方法会调用buildForward和buildInverted方法，分别把文档添加到正排索引和倒排索引。添加到正排索引时，给文档分配一个编号；添加到倒排索引时，先统计关键词在标题和正文中的出现次数，再计算权重，最后把文档信息添加到对应关键词的列表里。

public void buildForward(Document document) {
    // 待加入文档的ID
    document.setDocumentId(forwardIndex.size());
    forwardIndex.add(document);
}

4.3 往倒排索引中添加元素

统计文档中关键词在标题和正文的出现次数，计算权重（标题出现次数 ×10 + 正文出现次数）。将关键词及其相关文档 ID 和权重存入倒排索引，相同关键词的文档列表合并。代码如下：

public void buildInverted(Document document) {
    class Counter {
        int titleCnt;
        int contentCnt;
        public Counter(int titleCnt, int contentCnt) {
            this.titleCnt = titleCnt;
            this.contentCnt = contentCnt;
        }
    }

    // 1. 记录关键词 出现的 次数
    HashMap<String, Counter> counterMap = new HashMap<>();

    // 2. 对标题进行分词，统计出现次数
    List<Term> terms = ToAnalysis.parse(document.getTitle()).getTerms();
    for (Term term : terms) {
        // 获取分词字符串
        String word = term.getName();
        Counter counter = counterMap.get(word);
        // 如果为空，说明还没有出现过这个关键词
        if (counter == null) {
            counter = new Counter(1, 0); // 标题出现次数赋值为 1
            counterMap.put(word, counter);
        } else { // 出现过这个分词
            counter.titleCnt ++;
        }
    }

    // 3. 对正文进行分词，统计出现次数
    terms = ToAnalysis.parse(document.getContent()).getTerms();
    for (Term term : terms) {
        String word = term.getName();
        Counter counter = counterMap.get(word);
        if (counter == null) {
            counter = new Counter(0, 1);
            counterMap.put(word, counter);
        } else {
            counter.contentCnt ++;
        }
    }

    // 4. 将分词结果整合到倒排索引中
    //    遍历文档的所有分词结果
    for (Map.Entry<String, Counter> entry : counterMap.entrySet()) {
        String word = entry.getKey();
        Counter counter = entry.getValue();

        // 将文档 ID 和 文档权值 包装起来
        Weight newWeight = new Weight(document.getDocumentId()); // 存入文档 ID
        newWeight.calWeight(counter.titleCnt, counter.contentCnt);

        // 取出该关键词相关联的 文档列表
        List<Weight> take = invertedIndex.get(word);
        // 倒排索引 中没有这个关键词
        if (take == null) {
            ArrayList<Weight> newList = new ArrayList<>();  // 新建列表
            newList.add(newWeight);
            invertedIndex.put(word, newList);
        } else { // 出现过
            take.add(newWeight); // 关联文档数增加
        }
    }
}

4.4 往索引中添加元素

add()方法同时构建正排和倒排索引。代码如下：

public void add(String title, String url, String content) {
    Document document = new Document(title, url, content);
    // 自此，Document还没设置ID，ID进入 buildForward() 会设置
    buildForward(document);
    buildInverted(document);
}

4.5 补充 parseHtml () 方法

在Parser类的parseHtml()方法中调用index.add()将文档加入索引。代码如下：

private void parseHtml(File file) {
    String title = parseTitle(file);
    String url = parseUrl(file);
    String content = parseContent(file);
    // 将这三个变量包装成Document，并添加到索引结构中
    index.add(title, url, content);
}

4.6 获取文档

提供根据文档 ID 获取文档和根据关键词获取相关文档列表的方法。代码如下：

// 传入文档 Id，获取文档对象
public Document getDocument(int documentId) {
    return forwardIndex.get(documentId);
}

// 获取和 关键词 相关的文档
public List<Weight> getRelatedDocument(String word) {
    return invertedIndex.get(word);
}

4.7 测试

在Parser类的run()方法中测试索引构建结果。代码如下：

public void run() {
    long beg = System.currentTimeMillis();
    // 1. 获取 api 这个文件对象
    File root = new File(ROOT_PATH);
    // 2. 用来接收 api 文件夹中的所有 HTML 文件, 是一个输入型参数
    List<File> allFiles = new ArrayList<>();
    enumFile(root, allFiles);
    System.out.println("总共 " + allFiles.size() + " 个文件");

    // 3. 解析这里面的所有文件，然后包装 Document 对象，添加到索引中
    for (File file : allFiles) {
        System.out.println("开始解析文件：" + file.getAbsolutePath());
        parseHtml(file);
    }

    index.save();
    long end = System.currentTimeMillis();
    System.out.println("索引制作完成，总共耗时 " + (end - beg) + " ms");
}

4.8 持久化保存索引结构

将索引保存到文件，避免重启后重新构建。代码如下：

// 持久化存储索引
public void save() {
    System.out.println("开始持久化索引");
    long beg = System.currentTimeMillis();
    File indexFile = new File(INDEX_PATH);
    if (!indexFile.exists()) { // 如果不存在
        indexFile.mkdirs();
    }
    // 打开这两个文件
    File forwardFile = new File(INDEX_PATH + FORWARD_PATH);
    File invertedFile = new File(INDEX_PATH + INVERTED_PATH);

    try {
        objectMapper.writeValue(forwardFile, forwardIndex);
        objectMapper.writeValue(invertedFile, invertedIndex);
    } catch (IOException e) {
        throw new RuntimeException(e);
    }

    long end = System.currentTimeMillis();
    System.out.println("索引持久化完成，总共耗时 " + (end - beg) + " ms");
}

4.9 将索引结构从文件中加载到内存中

从文件读取索引到内存，方便后续使用。代码如下：

// 将索引读取到内存中
public void load() {
    File forwardFile = new File(INDEX_PATH + FORWARD_PATH);
    File invertedFile = new File(INDEX_PATH + INVERTED_PATH);

    try {
        forwardIndex = objectMapper.readValue(forwardFile, new TypeReference<ArrayList<Document>>() {});
        invertedIndex = objectMapper.readValue(invertedFile, new TypeReference<HashMap<String, List<Weight>>>() {});
    } catch (IOException e) {
        throw new RuntimeException(e);
    }

}

5. 多线程加速解析

解析大量网页文件很耗时，我们可以用多线程来提速。在Parser类里稍作修改：

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;

public class Parser {
    private static final String ROOT_PATH = "D:\\MyJavaCode\\documentSearcher\\jdk-8u361-docs-all\\docs\\api\\";
    private Index index = new Index();
    private final Object forwardLock = new Object();
    private final Object invertedLock = new Object();

    public void run() {
        long startTime = System.currentTimeMillis();
        File root = new File(ROOT_PATH);
        List<File> allFiles = new ArrayList<>();
        enumFile(root, allFiles);
        System.out.println("一共找到 " + allFiles.size() + " 个文件");

        long parseStartTime = System.currentTimeMillis();

        // 创建线程池，最多同时运行5个线程
        ExecutorService threadPool = Executors.newFixedThreadPool(5);
        // 用来等待所有任务完成
        CountDownLatch latch = new CountDownLatch(allFiles.size());

        for (File file : allFiles) {
            System.out.println("开始解析文件：" + file.getAbsolutePath());
            threadPool.submit(() -> {
                parseHtml(file);
                latch.countDown();
            });
        }

        try {
            latch.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        threadPool.shutdown();

        long parseEndTime = System.currentTimeMillis();
        System.out.println("文件解析完成，耗时 " + (parseEndTime - parseStartTime) + " 毫秒");

        index.save();
        long endTime = System.currentTimeMillis();
        System.out.println("索引创建完成，总共耗时 " + (endTime - startTime) + " 毫秒");
    }

    // 省略其他相同方法...
}

6. 实现搜索功能

Searcher类负责处理用户的搜索请求，找到相关文档并返回结果。

import org.ansj.domain.Term;
import org.ansj.splitWord.analysis.ToAnalysis;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;

@Service
public class Searcher {
    private Index index;
    // 停用词文件路径
    private static final String STOP_WORD_PATH = "D:\\MyJavaCode\\documentSearcher\\jdk-8u361-docs-all\\stopWords.txt";
    // 存放停用词的集合
    private HashSet<String> stopDict = new HashSet<>();

    @Autowired
    public Searcher(Index index) {
        this.index = index;
        index.load();
        loadStopWords();
    }

    // 加载停用词
    private void loadStopWords() {
        try (BufferedReader bufferedReader = new BufferedReader(new FileReader(STOP_WORD_PATH), 1024 * 1024)) {
            while (true) {
                String line = bufferedReader.readLine();
                if (line == null) {
                    break;
                }
                stopDict.add(line);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public List<SearchResult> search(String query) {
        // 对搜索词进行分词
        List<Term> terms = ToAnalysis.parse(query).getTerms();
        // 过滤掉停用词
        List<Term> tempTerms = new ArrayList<>();
        for (Term term : terms) {
            String word = term.getName();
            if (!stopDict.contains(word)) {
                tempTerms.add(term);
            }
        }
        terms = tempTerms;

        // 获取每个关键词对应的文档列表
        List<List<Weight>> waitToMerge = new ArrayList<>();
        for (Term term : terms) {
            String word = term.getName();
            List<Weight> weights = index.getRelatedDocument(word);
            if (weights != null) {
                waitToMerge.add(weights);
            }
        }

        // 合并文档列表
        List<Weight> relatedDocuments = merge(waitTo