前沿重器[46] RAG开源项目Qanything源码阅读2-离线文件处理

最新推荐文章于 2024-12-19 22:36:08 发布

机智的叉烧

最新推荐文章于 2024-12-19 22:36:08 发布

阅读量1.5k

点赞数 24

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/baidu_25854831/article/details/138879319

前沿重器

栏目主要给大家分享各种大厂、顶会的论文和分享，从中抽取关键精华的部分和大家分享，和大家一起把握前沿技术。具体介绍：仓颉专项：飞机大炮我都会，利器心法我还有。（算起来，专项启动已经是20年的事了！）

2023年文章合集发布了！在这里：又添十万字-CS的陋室2023年文章合集来袭

往期回顾

书接上文，最近选了一个开源的RAG项目进行进一步学习：https://github.com/netease-youdao/QAnything，后续一连几篇，会分几篇，从我的角度，给大家介绍这个项目，预计的目录如下：

概述+服务：项目设计、模块划分以及部署细节。
文件解析处理：上传文件和文件处理的方法。（本期）
在线推理流程：给定query进入后到给出回复结果的全流程处理。

本期是离线的文件处理，即对多种不同的文件进行详细的阐述，之前我也有写过类似的文章（心法利器[110] | 知识文档处理和使用流程），不过没落到代码层面，这次借着源码阅读的机会，正好介绍一下：

文件上传。
文件读取和切片。
索引构造。

提前说明，这里忽略了大量的业务代码，聚焦在文件处理和相关算法本身，如新建用户、知识库、文件删除，会有选择的忽略，有需要的可以参考我在文中的思路，在代码里找到对应的位置。

文件上传

文件上传是指将文件从前端传到后端的流程，这个流程的工作在docs\API.md有提到。首先是接口字段：

参数名	参数值	是否必填	参数类型	描述说明
files	文件二进制	是	File	需要上传的文件，可多选，目前仅支持[md,txt,pdf,jpg,png,jpeg,docx,xlsx,pptx,eml,csv]
user_id	zzp	是	String	用户 id
kb_id	KBb1dd58e8485443ce81166d24f6febda7	是	String	知识库 id
mode	soft	否	String	上传模式，soft：知识库内存在同名文件时当前文件不再上传，strong：文件名重复的文件强制上传，默认值为 soft

至于文件的上传，作者给出了两种模式，分别是同步和异步。

客户端

客户端只需要请求服务即可，这里穿插一下同步异步请求，以及文件上传的细节，这个直接参考源码就好了，首先是同步的请求源码：

import os
import requests

url = "http://{your_host}:8777/api/local_doc_qa/upload_files"
folder_path = "./docx_data"  # 文件所在文件夹，注意是文件夹！！
data = {
    "user_id": "zzp",
    "kb_id": "KB6dae785cdd5d47a997e890521acbe1c9",
 "mode": "soft"
}

files = []
for root, dirs, file_names in os.walk(folder_path):
    for file_name in file_names:
        if file_name.endswith(".md"):  # 这里只上传后缀是md的文件，请按需修改，支持类型：
            file_path = os.path.join(root, file_name)
            files.append(("files", open(file_path, "rb")))

response = requests.post(url, files=files, data=data)
print(response.text)

发请求用的是通用的requests包。
因为是本地测试，所以使用的就是比较直接的本地文件，直接open就行，文件字段存的是open变量，注意打开方式是rb。

至于异步，则会会复杂一些。

import argparse
import os
import sys
import json
import aiohttp
import asyncio
import time
import random
import string

files = []
for root, dirs, file_names in os.walk("./docx_data"):  # 文件夹
    for file_name in file_names:
        if file_name.endswith(".docx"):  # 只上传docx文件
            file_path = os.path.join(root, file_name)
            files.append(file_path)
print(len(files))
response_times = []

async def send_request(round_, files):
    print(len(files))
    url = 'http://{your_host}:8777/api/local_doc_qa/upload_files'
    data = aiohttp.FormData()
    data.add_field('user_id', 'zzp')
    data.add_field('kb_id', 'KBf1dafefdb08742f89530acb7e9ed66dd')
    data.add_field('mode', 'soft')

    total_size = 0
    for file_path in files:
        file_size = os.path.getsize(file_path)
        total_size += file_size
        data.add_field('files', open(file_path, 'rb'))
    print('size:', total_size / (1024 * 1024))
    try:
        start_time = time.time()
        async with aiohttp.ClientSession() as session:
            async with session.post(url, data=data) as response:
                end_time = time.time()
                response_times.append(end_time - start_time)
                print(f"round_:{round_}, 响应状态码: {response.status}, 响应时间: {end_time - start_time}秒")
    except Exception as e:
        print(f"请求发送失败: {e}")

async def main():
    start_time = time.time()
    num = int(sys.argv[1])  // 一次上传数量，http协议限制一次请求data不能大于100M，请自行控制数量
    round_ = 0
    r_files = files[:num]
    tasks = []
    task = asyncio.create_task(send_request(round_, r_files))
    tasks.append(task)
    await asyncio.gather(*tasks)

    print(f"请求完成")
    end_time = time.time()
    total_requests = len(response_times)
    total_time = end_time - start_time
    qps = total_requests / total_time
    print(f"total_time:{total_time}")

if __name__ == '__main__':
    asyncio.run(main())

请求用的是aiohttp，而且使用的是python的协程，即asyncio一套的python技术，具体细节可以参考这篇博客：https://blog.youkuaiyun.com/m0_68949064/article/details/132805165。协程在高密度的http请求下，能有效提升CPU的使用率，提升综合性能，毕竟在请求等待过程，可以做很多别的事，就避免CPU空跑了。

服务端

服务端则比较复杂了，文件上传后要经过大量的校验，并且需要返回最终的处理结果。

文件上传的接口是/api/local_doc_qa/upload_files，我们可以在handlers.py里面找到，排除掉一些校验代码，handlers里面的核心代码是这段（upload_files函数下）：

for file, file_name in zip(files, file_names):
    if file_name in exist_file_names:
        continue
    file_id, msg = local_doc_qa.milvus_summary.add_file(user_id, kb_id, file_name, timestamp)
    debug_logger.info(f"{file_name}, {file_id}, {msg}")
    local_file = LocalFile(user_id, kb_id, file, file_id, file_name, local_doc_qa.embeddings)
    local_files.append(local_file)
    local_doc_qa.milvus_summary.update_file_size(file_id, len(local_file.file_content))
    data.append(
        {"file_id": file_id, "file_name": file_name, "status": "gray", "bytes": len(local_file.file_content),
            "timestamp": timestamp})
asyncio.create_task(local_doc_qa.insert_files_to_milvus(user_id, kb_id, local_files))

这里面的几个关键的函数：