文库类网站文档匹配搜索方法的python相关调取方法

本文探讨了如何利用Elasticsearch高效处理文库类网站的海量文档，通过标题、关键字和摘要索引，实现实时、高精度的文档查找，并介绍了Python实现的示例和TF-IDF相关性算法的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文库类网站的文档数量相对较多，如果只根据文档标题和关键字查看，对文档的真实匹配程度达不到一定的效果。

一般文档站在1-2亿文档的情况下，可直接使用 elasticsearch。Elasticsearch是一个基于Lucene的搜索服务器。它提供了一个分布式多用户能力的全文搜索引擎，基于RESTful web接口。

Elasticsearch是用Java语言开发的，并作为Apache许可条款下的开放源码发布，是一种流行的企业级搜索引擎。Elasticsearch用于云计算中，能够达到实时搜索，稳定，可靠，快速，安装使用方便。官方客户端在Java、.NET（C#）、PHP、Python、Apache Groovy、Ruby和许多其他语言中都是可用的。根据DB-Engines的排名显示，Elasticsearch是最受欢迎的企业搜索引擎，其次是Apache Solr，也是基于Lucene。

以三九文库网站为例，目前常见文库网站是使用，标题，关键词，正文摘要，部分正文段落进行索引全部文档。

python部分代码片段如下：

#文档索引
class SiteDocinfo(DocType):
    suggest = Completion(analyzer=ik_analyzer)
    title = Text(analyzer="ik_max_word")#标题
    create_date = Date()#上传时间
    dtype= Integer() #文档类型
    kwds=Text(analyzer="ik_max_word")#关键字
    tags = Text(analyzer="ik_max_word")#附加标记，暂不使用
    price= Integer() #文档价格，单位：分
    catalog1=Text() #分类1
    catalog2 = Text()  # 分类2
    authid=Integer()  # 作者编号
    lensize=Integer()  # 文件大小
    ext_int=Integer()#扩展数据
    ext_txt=Text
    ext_date=Date()

    class Meta:
        index = "t39"
        doc_type="site_docinfo"

Elasticsearch ，和大多数 NoSQL 数据库类似，是扁平化的。索引是独立文档的集合体。文档是否匹配搜索请求取决于它是否包含所有的所需信息。

Elasticsearch 中单个文档的数据变更是 ACIDic 的，而涉及多个文档的事务则不是。当一个事务部分失败时，无法回滚索引数据到前一个状态。

扁平化有以下优势：

索引过程是快速和无锁的。
搜索过程是快速和无锁的。
因为每个文档相互都是独立的，大规模数据可以在多个节点上进行分布。

对接到elasticsearch的基础工具方法如下：

from elasticsearch import Elasticsearch

client = Elasticsearch(hosts=["127.0.0.1"])

querydicts = {"bool": {"should": [{"multi_match": {"query": queryStr, "fields": ["title", "kwds"]}}, ]}}

rows=query_rel_doc_nums-len(selfAppendDocs)
response = client.search(
                    index="t39",
                    body={
                        "query": querydicts,
                        "from": 1,
                        "size": rows,
                    }
                )
rel_ids=[] #从es中取出的关联的id
for hit in response["hits"]["hits"]:
    if index1<view_detail_nums:
       hit_dict1.append(hit["_id"])
    else:
       hit_dict2.append(hit["_id"])

    rel_ids.append(str(hit["_id"]))

搜索应用示例，通过关键字匹配出文档id，再通过数据库中的in条件查询，三九文库的主数据库，获得文档匹配信息。

文档的相关性调取也可以通过这种方式实现，在获取本文档的keywords信息后，使用keyword进行模糊匹配，可指定相似度，指定匹配相似度，可使用如下的查询参数：

#按关键字查询相关文章
def getrelinkbykwds(key_words,docid,leftParamsLen,needfilter):
    #随机获取相似度
    matchPercents=random.randint(3, 20)
    # 随机1-10开始页
    startPage = random.randint(1, 30)
    print("es got :matchpercent="+str(matchPercents)+"@"+str(needfilter)+"@startpage="+str(startPage))
    if needfilter==True:
        querydicts = {
            "bool": {"must": [{"multi_match": {"query": key_words, "fields": ["title"], "minimum_should_match": ""+str(matchPercents)+"%"}},
                              {"match": {"useflag": 0}}]
                , "must_not": [{"term": {"docid": str(docid)}}]}}
    else:
        querydicts = {"bool": {"must": [{"multi_match": {"query": key_words, "fields": ["title"],"minimum_should_match": ""+str(matchPercents)+"%"}}]
            , "must_not": [{"term": {"docid": str(docid)}}]
                           }}


    response = client.search(
        index="t31",
        body={
            "query": querydicts,
            "from": startPage,
            "size": leftParamsLen, #一共取的条数
            "highlight": {
                "pre_tags": ['<font color=red>'],
                "post_tags": ['</font>'],
                "fields": {
                    "title": {},
                    "kwds": {},
                }
            }
        }
    )
    return response["hits"]["hits"]

es的相关度算法，relevance score算法：计算出一个索引中的文本，他们之间的关联匹配程度
es使用的是term frequency/inverse document frequency算法，简称TF/IDF算法。上述代码中的matchpercent为es中的关联程度，值越小，关联程度越大。