【Paper Tips】随记9-基于大模型API实现批量文献总结

SI_Gamer

已于 2025-06-12 16:09:38 修改

阅读量589

点赞数 29

CC 4.0 BY-SA版权

分类专栏： Papre Tips 文章标签：网络大模型

于 2025-06-10 17:21:35 首次发布

本文链接：https://blog.youkuaiyun.com/lxlyhwl/article/details/148559021

Papre Tips 专栏收录该内容

9 篇文章

订阅专栏

写paper时随心记录一些对自己有用的skills与tips。

文章目录

一、待解决问题
- 1.1 问题描述
- 1.2 解决方法
二、方法详述
三、疑问
四、总结

一、待解决问题

1.1 问题描述

针对大量阅读、总结文献的需求，使用大模型进行分析总结能够快速地总览文献的核心要义，若一篇一篇进行手动上传并下载总结文本，耗时耗力。

1.2 解决方法

（1）方法一：基于大模型api实现文件的上传与分析（建议）
（2）方法二：基于webdriver实现网页端访问

二、方法详述

2.1 必要说明

暂无

2.2 方法一：基于大模型api实现文件的上传与分析（建议）

小结：
优势： 响应速度快，可以自定义选择模型（短、长、超长文本任务偏好可选）
劣势： 存在访问限制，大文本任务比较费Money

2.2.1 事前准备

（1）注册：以Kimi为例，在开发者平台注册，链接：https://platform.moonshot.cn/console/account
并在菜单栏中API Key管理中新建一个密钥。

（2）查看使用手册：使用手册内提供了文件上传与对话的接口，链接：https://platform.moonshot.cn/docs/api/files#%E4%B8%8A%E4%BC%A0%E6%96%87%E4%BB%B6

（3）安装依赖包：conda install openai

2.2.2 调用示例

其中，$MOONSHOT_API_KEY替换为新建的密钥；
xlnet.pdf替换为需要分析的文件。

from pathlib import Path
from openai import OpenAI
 
client = OpenAI(
    api_key = "$MOONSHOT_API_KEY",
    base_url = "https://api.moonshot.cn/v1",
)
 
# xlnet.pdf 是一个示例文件, 我们支持 pdf, doc 以及图片等格式, 对于图片和 pdf 文件，提供 ocr 相关能力
file_object = client.files.create(file=Path("xlnet.pdf"), purpose="file-extract")
 
# 获取结果
# file_content = client.files.retrieve_content(file_id=file_object.id)
# 注意，之前 retrieve_content api 在最新版本标记了 warning, 可以用下面这行代替
# 如果是旧版本，可以用 retrieve_content
file_content = client.files.content(file_id=file_object.id).text
 
# 把它放进请求中
messages = [
    {
        "role": "system",
        "content": "你是 Kimi，由 Moonshot AI 提供的人工智能助手，你更擅长中文和英文的对话。你会为用户提供安全，有帮助，准确的回答。同时，你会拒绝一切涉及恐怖主义，种族歧视，黄色暴力等问题的回答。Moonshot AI 为专有名词，不可翻译成其他语言。",
    },
    {
        "role": "system",
        "content": file_content,
    },
    {"role": "user", "content": "请简单介绍 xlnet.pdf 讲了啥"},
]
 
# 然后调用 chat-completion, 获取 Kimi 的回答
completion = client.chat.completions.create(
  model="moonshot-v1-32k",
  messages=messages,
  temperature=0.3,
)
 
print(completion.choices[0].message)

2.2.3 注意事项

part1：访问次数、字数有限制。
为了整体资源分配的公平性，同时防止恶意攻击，我们目前将基于账户的累计充值金额进行速率限制，具体如下表，如有更高需求请联系人工客服：

在这里插入图片描述

限速概念解释

并发: 同一时间内我们最多处理的来自您的请求数

RPM: request per minute 指一分钟内您最多向我们发起的请求数

TPM: token per minute 指一分钟内您最多和我们交互的token数

TPD: token per day 指一天内您最多和我们交互的token数

part2：不同模型基于token收费。
新注册用户有15元免费额度。
在这里插入图片描述
经测试，一篇文献pdf，总结与分析下来大概是0.11元，具体换算成多少token不好计算，但是有说明，文件上传应该是不算token。

文件相关接口（文件内容抽取/文件存储）接口限时免费，即您只上传并抽取文档，这个API本身不会产生费用。

2.3 方法二：基于webdriver实现网页端访问（备用方法）

小结：
优势： 暂时用起来感觉不受访问限制；免费。
劣势： 生成速度堪忧，费时；模型版本不可选。

2.2.1 事前准备

（1）安装依赖包：安装firefox驱动——geckodriver，安装selenium包

（2）文献pdf统一放在一个文件夹内。

2.2.2 详细代码

代码功能逻辑阐述：

（1）顺序读取文件夹内文献pdf；
（2）访问网页版，让用户自行登录；
（3）将提示词prompt写入输入框；（提示词可更改）
（4）上传对应pdf；
（5）将回复内容下载到本地txt；
（6）重新访问网页版，新建交互窗口。


import os
import time
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
import shutil


def configure_firefox_driver():

    firefox_options = Options()
    firefox_options.add_argument('-profile')
    firefox_options.add_argument(
        '/home/XXX/snap/firefox/common/.mozilla/firefox')

    firefox_options.set_preference("browser.download.folderList", 2)
    # firefox_options.set_preference("browser.download.dir", download_dir)
    firefox_options.set_preference(
        "browser.helperApps.neverAsk.saveToDisk", "application/pdf")
    firefox_options.set_preference("pdfjs.disabled", True)

    geckodriver_path = "/usr/local/bin/geckodriver"
    service = Service(geckodriver_path)

    driver = webdriver.Firefox(service=service, options=firefox_options)
    return driver


def main():
    # 多文献pdf保存路径
    papers_path = "XXX/paper(Selected)"
    pdf_files = [file for file in os.listdir(papers_path) if file.endswith('.pdf')]
    num_selectedPapers = len(pdf_files)
    print(f"目录下预总结分析的文献有 {num_selectedPapers} 个！")
    num_analysePapers = 0
    
    # 创建保存AI总结的目录
    ai_summaries_path = "./paper(AI)"
    if not os.path.exists(ai_summaries_path):
        os.makedirs(ai_summaries_path)

    driver = configure_firefox_driver()
    driver.get("https://www.kimi.com")
    input("用户请自行登录KIMI，已登录后输入 'y' 并按回车继续: ")

    for pdf_file in pdf_files:
        # （1）在输入框中输入内容
        try:
            input_text_area = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'div.chat-input-editor'))
            )
            # 清空内容
            input_text_area.clear()
            # 要输入的文本
            text_to_send = "你作为一个具有丰富学术经验的领域专家，帮我总结一下这篇文献，要求以标题翻译、研究背景、待解决问题、研究现状、提出方法、实验设置、结果分析7个方面进行阐述，用尽量简洁、专业性的话语来阐述。"
            # 逐个字符发送
            for char in text_to_send:
                input_text_area.send_keys(char)
                time.sleep(0.05)  # 模拟输入间隔
            print("成功在输入框中输入内容")
            time.sleep(2)
        except Exception as e:
            print(f"在输入框中输入内容失败：{e}")
            continue

        # （2）上传文献pdf
        try:
            # 等待文件上传输入框出现
            file_input = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'input[type="file"]'))
            )
            # 文件路径
            paper_file = os.path.join(papers_path, pdf_file)

            # 上传文件
            file_input.send_keys(paper_file)
            print("文件上传中...")
            time.sleep(30)
        except Exception as e:
            print(f"文件上传失败：{e}")
            continue

        # （3）点击“发送”按钮
        try:
            send_button = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.CSS_SELECTOR, 'div.send-button'))
            )
            send_button.click()
            print("成功点击'发送'按钮，等待回复...")
            time.sleep(90)

        except Exception as e:
            print(f"点击'发送'按钮失败：{e}")
            continue


        
        # （4）生成本地txt文档
        try:
            # 等待包含总结的元素出现
            markdown_element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'div.markdown'))
            )
            
            summary_text = markdown_element.get_attribute("outerHTML")
            summary_text = summary_text.replace("><", ">\n<")
        
            # 判断summary_text是否非空
            if summary_text.strip():  # 使用strip()去除可能的空白字符
                txt_filename = f"{{AI总结}}{pdf_file.replace('.pdf', '.txt')}"
                txt_file_path = os.path.join(ai_summaries_path, txt_filename)
                with open(txt_file_path, 'w', encoding='utf-8') as txt_file:
                    txt_file.write(summary_text)
                print(f"成功生成本地txt文档: {txt_file_path}")
                num_analysePapers += 1
            else:
                print("未获取到有效的总结内容，跳过生成txt文档。")
        except Exception as e:
            print(f"生成本地txt文档失败：{e}")
        

        # （5）重新访问kimi链接
        driver.get("https://www.kimi.com")
        time.sleep(1)

    driver.quit()
    print(f"已分析完{num_analysePapers}篇文献，浏览器已关闭")


if __name__ == "__main__":
    main()

2.2.3 附加功能-1（列出未分析文献）

（2025.06.10补充）
若其中没有完全总结所有文献，可以选择以下代码去实现未分析文献名的输出：

import os

def compare_folders(folder1_path, folder2_path):
    # 获取文件夹1中的所有文件名，去掉后缀
    folder1_files = []
    for file in os.listdir(folder1_path):
        if file.endswith(".pdf"):
            # 去掉后缀.pdf
            name_without_extension = file[:-4]
            folder1_files.append(name_without_extension)

    # 获取文件夹2中的所有文件名，去掉前缀和后缀
    folder2_files = []
    for file in os.listdir(folder2_path):
        if file.endswith(".txt") and file.startswith("{AI总结}"):
            # 去掉前缀{AI总结}和后缀.txt
            cleaned_name = file[6:-4]
            folder2_files.append(cleaned_name)

    # 找出文件夹1中有但文件夹2中没有的文件
    unique_files = set(folder1_files) - set(folder2_files)

    print(f"以下文件存在于文件夹1但不存在于文件夹2：")
    for file in unique_files:
        print(file)



# 使用示例
folder1_path = "./paper(Selected)"  # 替换为实际路径
folder2_path = "./paper(AI)"  # 替换为实际路径

compare_folders(folder1_path, folder2_path)

2.2.4 附加功能-2（清除html格式）

（2025.06.12补充）
由于保存的内容包含html格式的内容，可用以下代码进行一键清除html格式

import os
import re
from bs4 import BeautifulSoup

def extract_text_from_html(html_content):
    """使用BeautifulSoup提取纯文本内容"""
    soup = BeautifulSoup(html_content, 'html.parser')
    text = soup.get_text(separator='\n', strip=True)
    # 处理特殊空白字符
    text = re.sub(r'[\xa0\u3000]+', ' ', text)
    # 合并多余空行
    text = re.sub(r'\n\s*\n', '\n\n', text)
    return text.strip()

def process_txt_files(input_dir, output_dir):
    """处理目录下所有txt文件并保存到输出目录"""
    # 确保输出目录存在
    os.makedirs(output_dir, exist_ok=True)
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    processed_count = 0
    
    for filename in os.listdir(input_dir):
        if filename.endswith('.txt'):
            input_path = os.path.join(input_dir, filename)
            print(f"处理文件中: {filename}")
            
            try:
                with open(input_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                
                # 提取纯文本内容
                clean_text = extract_text_from_html(content)
                
                # 创建输出文件路径
                output_filename = os.path.splitext(filename)[0] + '_clean.txt'
                output_path = os.path.join(output_dir, output_filename)
                
                with open(output_path, 'w', encoding='utf-8') as f:
                    f.write(clean_text)
                
                print(f"已保存清洁版到: {output_path}")
                processed_count += 1
                
            except Exception as e:
                print(f"处理文件 {filename} 时出错: {str(e)}")
    print(f"总计处理文献{processed_count}篇！")
    
    return processed_count

if __name__ == "__main__":
    input_directory = "./paper(AI_ieeeXplore)"
    output_directory = "./paper(AI_ieeeXplore)/CleanFormat"
    
    print("\n开始处理文件...")
    count = process_txt_files(input_directory, output_directory)
    
    print(f"\n处理完成！共处理 {count} 个文件")
    print(f"清洁文件已保存至: {output_directory}")