仅用不到 150 行代码，我开发出了一个搜索引擎_搜索引擎多少行代码-优快云博客

本文链接：https://blog.youkuaiyun.com/chenchen5152/article/details/116013410

本文介绍了如何使用Python在不到150行代码内构建一个搜索引擎，涵盖了数据准备、建立倒排索引、搜索和相关度计算等关键步骤。示例使用了英文维基百科的摘要数据，通过简单的文本解析和过滤，实现对海量文档的快速搜索和相关性排名。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

全文搜索无处不在。在 Scribd(一个文档分享平台)上搜索一本书，在 Netflix 上搜索一部电影，在亚马逊上搜索卫生纸商品，或者通过谷歌搜索东西，你都在搜索大量的非结构化数据。更令人感到惊奇地是，即使你搜索的是数百万(或数十亿)条记录，也能够获得毫秒级的响应体验。在这篇文章中，我们将探索全文搜索引擎的基本组件，并用它们来构建一个可以搜索数百万个文档、根据相关性对文档进行排名的搜索引擎。我们将用不到 150 行的 Python 代码来开发这个搜索引擎

数据

这篇文章中所有的代码都可以在 Github 上找到（https://github.com/bartdegoede/python-searchengine/）。我将在文章中提供代码片段和链接，你可以尝试自己运行它们。你可以安装运行示例所需的组件(pip install -r requirements.txt)，然后运行 python run.py（https://github.com/bartdegoede/python-searchengine/blob/master/run.py）。它会下载所有的数据，并运行带排名和不带排名的搜索示例。

在开始构建搜索引擎之前，我们需要一些非结构化的数据。我们将搜索英文维基百科中的文章摘要。维基百科被打包成一个约 785MB 的压缩 XML 文件包，其中包含了约 627 万篇摘要。我写了一个简单的函数（https://github.com/bartdegoede/python-searchengine/blob/master/download.py）用来下载 XML 压缩包，当然你也可以手动下载这个文件。

数据准备

这个文件是一个包含所有摘要的大型 XML 文件。每一个摘要内容都包含在标签中，看起来大致如下所示(我省略了我们不感兴趣的标签)：

<doc>
    <title>Wikipedia: London Beer Flood</title>
    <url>https://en.wikipedia.org/wiki/London_Beer_Flood</url>
    <abstract>The London Beer Flood was an accident at Meux & Co's Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the  wooden vats of fermenting porter burst.</abstract>
    ...
</doc>

我们感兴趣的是 title、url 和 abstract 这几个标签。为了方便访问数据，我们将用 Python 数据类（https://realpython.com/python-data-classes/）来表示文档。我们将添加一个属性来连接标题和摘要内容，代码可以在这里找到（https://github.com/bartdegoede/python-searchengine/blob/master/search/documents.py）。

from dataclasses import dataclass
@dataclass
class Abstract:
    """Wikipedia abstract"""
    ID: int
    title: str
    abstract: str
    url: str
    @property
    def fulltext(self):
        return ' '.join([self.title, self.abstract])

然后，我们从 XML 中提取摘要数据，对其进行解析，并创建 Abstract 实例。我们将通过流的方式来读取 XML，不会将整个文件加载到内存中。我们将按照加载顺序为每个文档分配一个 ID(即第一个文档 ID=1，第二个文档 ID=2，以此类推)。相关代码可以在这里找到（https://github.com/bartdegoede/python-searchengine/blob/master/load.py）。

import gzip
from lxml import etree
from search.documents import Abstract
def load_documents():
    # open a filehandle to the gzipped Wikipedia dump
    with gzip.open('data/enwiki.latest-abstract.xml.gz', 'rb') as f:
        doc_id = 1
        # iterparse will yield the entire `doc` element once it finds the
        # closing `</doc>` tag
        for _, element in etree.iterparse(f, events=('end',), tag='doc'):
            title = element.findtext('./title')
            url = element.findtext('./url')
            abstract = element.findtext('./abstract')
            yield Abstract(ID=doc_id, title=title, url=url, abstract=abstract)
            doc_id += 1
            # the `element.clear()` call will explicitly free up the memory
            # used to store the element
            element.clear()