使用代码进行文档解析——大型语言模型

原创于 2025-12-02 09:25:23 发布 · 747 阅读

6 ·

CC 4.0 BY-SA版权

License CC BY-NC-SA 4.0 / 自豪地采用谷歌翻译

文章标签：

#语言模型 #mysql #数据库

榛樿鍒嗙被专栏收录该内容

711 篇文章

订阅专栏

原文：towardsdatascience.com/document-parsing-using-large-language-models-with-code-9229fda09cdf

动机

多年来，正则表达式一直是我在解析文档时使用的首选工具，我相信它对许多其他技术人士和行业也是如此。

尽管正则表达式在某些情况下非常强大且成功，但它们往往难以处理现实世界文档的复杂性和多样性。

另一方面，大型语言模型提供了一种更强大、更灵活的方法来处理多种类型的文档结构和内容类型。

系统的一般工作流程

总是了解正在构建的系统的主要组件很重要。为了简化问题，让我们关注一个研究论文处理的场景。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/5cb3f6f2ed56653fb191e1383a428aaa.png

使用 LLM（作者：Zoumana Keita）的文档解析工作流程

工作流程总体上包含三个主要组件：输入、处理和输出。
首先，文档，在这种情况下，是 PDF 格式的科学研究论文，被提交进行处理。
处理组件的第一个模块从每个 PDF 中提取原始数据，并将其与包含大型语言模型提取数据指令的提示结合，以有效地提取数据。
大型语言模型随后使用提示来提取所有元数据。
对于每个 PDF 文件，最终结果以 JSON 格式保存，可用于进一步分析。

但是，为什么不用正则表达式，而要麻烦使用 LLMs（大型语言模型）呢？

当处理研究论文结构的复杂性时，正则表达式（Regex）存在显著的局限性，以下是一些示例：

1. 文档结构的灵活性

正则表达式需要为每种文档结构指定特定模式，当给定文档偏离预期格式时，将无法正常工作。
LLMs自动理解和适应广泛的文档结构，并且它们能够识别文档中任何位置的相关信息。

2. 理解上下文

正则表达式匹配模式时没有任何上下文或意义的理解。
LLMs对每份文档的意义有细致的理解，这使得它们能够更准确地提取相关信息。

3. 维护和可扩展性

正则表达式需要随着文档格式的变化进行持续更新。添加对新类型信息支持会导致编写全新的正则表达式。
LLMs可以很容易地适应新的文档类型，只需对初始提示进行最小更改，这使得它们更具可扩展性。

构建文档解析工作流程

上述原因足以采用LLMs来解析如研究论文等复杂文档。

我们用于说明的文档如下：

Attention is all you need 论文来自 Arxiv 网站
YOLOv5 和 Faster R-CNN 在非合作目标周围自主导航的性能研究，也来自 Arxiv 网站

本节提供了构建利用大型语言模型构建现实世界文档解析系统的所有步骤，我相信这有可能改变你对人工智能及其能力的看法。

如果你更倾向于视频，我将在另一边等你。

cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F37i6Gx1uvVs%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D37i6Gx1uvVs&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F37i6Gx1uvVs%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube

代码结构

代码结构如下：

project
   |
   |---Extract_Metadata_With_Large_Language_Models.ipynb
   |
  data
   |
   |---- extracted_metadata/
   |---- 1706.03762v7.pdf
   |---- 2301.09056v1.pdf
   |---- prompts
           |
           |------ scientific_papers_prompt.txt

project 文件夹是根文件夹，包含 data 文件夹和笔记本
data 文件夹包含两个文件夹和上述两篇论文：extracted_metadata 和 prompts
extracted_metadata 目前为空，将包含 json 文件
prompts 文件夹包含文本格式的提示

要提取的元数据

我们首先需要明确需要提取的属性的目标，为了简单起见，让我们关注我们场景中的六个属性。

论文标题
出版年份
作者
作者联系方式
摘要
摘要摘要

这些属性随后被用来定义提示，该提示清楚地解释了每个属性的含义以及最终输出的格式。

文档的成功解析依赖于一个清晰的提示，该提示明确解释了每个属性的含义以及提取最终结果的格式。

Scientific research paper:
---
{document}
---

You are an expert in analyzing scientific research papers. Please carefully read the provided research paper above and extract the following key information:

Extract these six (6) properties from the research paper:
- Paper Title: The full title of the research paper
- Publication Year: The year the paper was published
- Authors: The full names of all authors of the paper
- Author Contact: A list of dictionaries, where each dictionary contains the following keys for each author:
  - Name: The full name of the author
  - Institution: The institutional affiliation of the author
  - Email: The email address of the author (if provided)
- Abstract: The full text of the paper's abstract
- Summary Abstract: A concise summary of the abstract in 2-3 sentences, highlighting the key points

Guidelines:
- The extracted information should be factual and accurate to the document.
- Be extremely concise, except for the Abstract which should be copied in full.
- The extracted entities should be self-contained and easily understood without the rest of the paper.
- If any property is missing from the paper, please leave the field empty rather than guessing.
- For the Summary Abstract, focus on the main objectives, methods, and key findings of the research.
- For Author Contact, create an entry for each author, even if some information is missing. If an email or institution is not provided for an author, leave that field empty in the dictionary.

Answer in JSON format. The JSON should contain 6 keys: "PaperTitle", "PublicationYear", "Authors", "AuthorContact", "Abstract", and "SummaryAbstract". The "AuthorContact" should be a list of dictionaries as described above.

提示中发生了六件主要事情，让我们逐一分析。

文档占位符

Scientific research paper:
---
{document}
---

使用 {} 符号定义，它表示文档全文将包含在分析中。

2. 角色分配

模型被分配了一个角色以更好地执行任务，这将在下一行定义，设置上下文并指导 AI 成为科学论文分析的专家。

You are an expert in analyzing scientific research papers.

3. 提取说明

本节指定了应从文档中提取的信息片段。

Extract these six (6) properties from the research paper:

4. 属性定义

这里定义了上述每个属性的具体细节，包括要包含的信息以及它们的格式化策略。例如，作者联系方式 是包含额外详细信息的字典列表。

5. 指南

指南告诉 AI 在提取过程中应遵循的规则，例如保持准确性以及如何处理缺失信息。

6. 预期输出格式

这是最后一步，它指定了回答时需要考虑的确切格式，即json。

Answer in JSON format. The JSON should contain 6 keys: ...

库

很好，现在让我们开始安装必要的库。

我们的文档解析系统是用几个库构建的，下面分别说明了每个组件的主要库：

PDF 处理：pdfminer.six，PyPDF2和poppler-utils用于处理各种 PDF 格式和结构。
文本提取：unstructured及其依赖包（unstructured-inference，unstructured-pytesseract）用于从文档中提取智能内容。
OCR 功能：tesseract-ocr用于识别图像或扫描文档中的文本。
图像处理：pillow-heif用于图像处理任务。
AI 集成：openai库用于在我们的信息提取过程中利用 GPT 模型。

%%bash

pip -qqq install pdfminer.six
pip -qqq install pillow-heif==0.3.2
pip -qqq install matplotlib
pip -qqq install unstructured-inference
pip -qqq install unstructured-pytesseract
pip -qqq install tesseract-ocr
pip -qqq install unstructured
pip -qqq install openai
pip -qqq install PyPDF2

apt install -V tesseract-ocr
apt install -V libtesseract-dev

sudo apt-get update
apt-get install -V poppler-utils

安装成功后，导入操作如下：

import os
import re
import json
import openai
from pathlib import Path
from openai import OpenAI
from PyPDF2 import PdfReader
from google.colab import userdata
from unstructured.partition.pdf import partition_pdf
from tenacity import retry, wait_random_exponential, stop_after_attempt

设置凭证

在深入研究核心功能之前，我们需要设置我们的环境，包括必要的 API 凭证。

OPENAI_API_KEY = userdata.get('OPEN_AI_KEY')
model_ID = userdata.get('GPT_MODEL')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

client = OpenAI(api_key = OPENAI_API_KEY)

在这里，我们使用userdata.get()函数在 Google Colab 中安全地访问凭证。
我们检索我们想要使用的特定 GPT 模型 ID，在我们的用例中是gpt-4o。

使用这样的环境变量来设置我们的凭证，确保对模型凭证的安全访问，同时保持我们对模型选择的灵活性。

管理 API 密钥和模型也是一个更好的方法，尤其是在不同的环境或多个项目中工作时。

工作流程实现

我们现在已经拥有了高效构建端到端工作流程的所有资源。现在是时候开始实现每个工作流程组件的技术细节，从数据处理辅助函数开始。

数据处理

我们工作流程的第一步是对 PDF 文件进行预处理并提取其文本内容，这通过extract_text_from_pdf函数实现。

它接受一个 PDF 文件作为输入，并返回其原始文本数据。

def extract_text_from_pdf(pdf_path: str):
    """
    Extract text content from a PDF file using the unstructured library.
    """
    elements = partition_pdf(pdf_path, strategy="hi_res")
    return "n".join([str(element) for element in elements])

提示读取器

提示信息存储在一个单独的.txt文件中，并使用以下函数进行加载。

def read_prompt(prompt_path: str):
    """
    Read the prompt for research paper parsing from a text file.
    """
    with open(prompt_path, "r") as f:
        return f.read()

元数据提取

这个功能实际上是我们的工作流程的核心。它利用 OpenAI API 来处理给定 PDF 文件的内容。

如果不使用装饰器@retry，我们可能会遇到Error Code 429 - Rate limit reached for requests问题。这主要发生在处理过程中达到速率限制时。我们希望函数在成功达到目标之前不断尝试，而不是失败。

@retry(wait=wait_random_exponential(min=1, max=120), stop=stop_after_attempt(10))
def completion_with_backoff(**kwargs):
    return client.chat.completions.create(**kwargs)

通过在extract_metadata函数中使用completion_with_backoff：

它在重新运行失败的 API 调用之前等待 1 到 120 秒。
每次重试时，上述等待时间都会增加，但始终保持在 1 到 120 秒的范围内。
这个过程被称为指数退避，并且对于管理 API 速率限制，包括临时问题，非常有用。

def extract_metadata(content: str, prompt_path: str, model_id: str):
    """
    Use GPT model to extract metadata from the research paper content based on the given prompt.
    """
    prompt_data = read_prompt(prompt_path)

    try:
        response = completion_with_backoff(
            model=model_id,
            messages=[
                {"role": "system", "content": prompt_data},
                {"role": "user", "content": content}
            ],
            temperature=0.2,
        )

        response_content = response.choices[0].message.content
        # Process and return the extracted metadata
        # ...
    except Exception as e:
        print(f"Error calling OpenAI API: {e}")
        return {}

通过发送论文内容和提示，gpt-4o模型提取了提示中指定的结构化信息。

整合一切

通过将所有逻辑组合在一起，我们可以使用process_research_paper函数对单个 PDF 文件执行端到端执行，从提取预期的元数据到将最终结果保存为.json格式。

def process_research_paper(pdf_path: str, prompt: str,
                           output_folder: str, model_id: str):
    """
    Process a single research paper through the entire pipeline.
    """
    print(f"Processing research paper: {pdf_path}")

    try:
        # Step 1: Extract text content from the PDF
        content = extract_text_from_pdf(pdf_path)

        # Step 2: Extract metadata using GPT model
        metadata = extract_metadata(content, prompt, model_id)

        # Step 3: Save the result as a JSON file
        output_filename = Path(pdf_path).stem + '.json'
        output_path = os.path.join(output_folder, output_filename)

        with open(output_path, 'w') as f:
            json.dump(metadata, f, indent=2)
        print(f"Saved metadata to {output_path}")

    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")

下面是一个将逻辑应用于单个文档处理的示例：

# Example for a single document

pdf_path = "./data/1706.03762v7.pdf"
prompt_path =  "./data/prompts/scientific_papers_prompt.txt"
output_folder = "./data/extracted_metadata"

process_research_paper(pdf_path, prompt_path, output_folder, model_ID)

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/30ace1cb383ef9936e25bf71a76ef019.png

PDF 文档的处理步骤（作者图片）

从上面的图像中，我们可以看到生成的.json文件已保存到./data/extracted_metadata/文件夹下，文件名为1706.0376v7.json，这与 PDF 文件名完全相同，但扩展名不同。

下面给出了 json 文件的内容，以及突出显示已提取目标属性的研究论文：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/7c9b8747bbe8c4fe17164554b920a054.png

带有要提取的目标属性的原始论文（作者图片）

从json数据中我们注意到所有属性都已成功提取。令人高兴的是，Illia Polosukhin的机构在论文中未提供，人工智能将其留为空字段。

{
  "PaperTitle": "Attention Is All You Need",
  "PublicationYear": "2017",
  "Authors": [
    "Ashish Vaswani",
    "Noam Shazeer",
    "Niki Parmar",
    "Jakob Uszkoreit",
    "Llion Jones",
    "Aidan N. Gomez",
    "Lukasz Kaiser",
    "Illia Polosukhin"
  ],
  "AuthorContact": [
    {
      "Name": "Ashish Vaswani",
      "Institution": "Google Brain",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Noam Shazeer",
      "Institution": "Google Brain",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Niki Parmar",
      "Institution": "Google Research",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Jakob Uszkoreit",
      "Institution": "Google Research",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Llion Jones",
      "Institution": "Google Research",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Aidan N. Gomez",
      "Institution": "University of Toronto",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Lukasz Kaiser",
      "Institution": "Google Brain",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Illia Polosukhin",
      "Institution": "",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    }
  ],
  "Abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
  "SummaryAbstract": "The paper introduces the Transformer, a novel network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions. The Transformer achieves superior performance on machine translation tasks, setting new state-of-the-art BLEU scores while being more parallelizable and requiring less training time. Additionally, it generalizes well to other tasks such as English constituency parsing."
}

此外，附加属性摘要摘要的值如下所示，它完美地总结了初始摘要，同时遵守提示中提供的两到三句话的约束。

The paper introduces the Transformer, a novel network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions. 
The Transformer achieves superior performance on machine translation tasks, setting new state-of-the-art BLEU scores while being more parallelizable and requiring less training time. 
Additionally, it generalizes well to other tasks such as English constituency parsin

现在管道对单个文档有效，我们可以实现逻辑以运行给定文件夹中的所有文档，这是通过使用process_directory函数实现的。

它处理每个文件并将其保存到相同的extracted_metadata文件夹中。

# Parse documents from a folder
def process_directory(prompt_path: str, directory_path: str, output_folder: str, model_id: str):
    """
    Process all PDF files in the given directory.
    """

    # Iterate through all files in the directory
    for filename in os.listdir(directory_path):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(directory_path, filename)
            process_research_paper(pdf_path, prompt_path, output_folder, model_id)

下面是如何使用正确参数调用函数的方法。

# Define paths
prompt_path = "./data/prompts/scientific_papers_prompt.txt"
directory_path = "./data"
output_folder = "./data/extracted_metadata"

process_directory(prompt_path, directory_path, output_folder, model_ID)

成功处理显示以下消息，我们可以看到每篇研究论文都已处理。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/4c1e8c3a53d8a6ce0085eba0ec1e6edc.png

研究论文的处理步骤（作者图片）

下面给出了 YOLOv5 论文的生成json文件内容，类似于上面的论文。

{
  "PaperTitle": "Performance Study of YOLOv5 and Faster R-CNN for Autonomous Navigation around Non-Cooperative Targets",
  "PublicationYear": "2022",
  "Authors": [
    "Trupti Mahendrakar",
    "Andrew Ekblad",
    "Nathan Fischer",
    "Ryan T. White",
    "Markus Wilde",
    "Brian Kish",
    "Isaac Silver"
  ],
  "AuthorContact": [
    {
      "Name": "Trupti Mahendrakar",
      "Institution": "Florida Institute of Technology",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Andrew Ekblad",
      "Institution": "Florida Institute of Technology",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Nathan Fischer",
      "Institution": "Florida Institute of Technology",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Ryan T. White",
      "Institution": "Florida Institute of Technology",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Markus Wilde",
      "Institution": "Florida Institute of Technology",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Brian Kish",
      "Institution": "Florida Institute of Technology",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    },
    {
      "Name": "Isaac Silver",
      "Institution": "Energy Management Aerospace",
      "Email": "[[email protected]](/cdn-cgi/l/email-protection)"
    }
  ],
  "Abstract": "Autonomous navigation and path-planning around non-cooperative space objects is an enabling technology for on-orbit servicing and space debris removal systems. The navigation task includes the determination of target object motion, the identification of target object features suitable for grasping, and the identification of collision hazards and other keep-out zones. Given this knowledge, chaser spacecraft can be guided towards capture locations without damaging the target object or without unduly the operations of a servicing target by covering up solar arrays or communication antennas. One way to autonomously achieve target identification, characterization and feature recognition is by use of artificial intelligence algorithms. This paper discusses how the combination of cameras and machine learning algorithms can achieve the relative navigation task. The performance of two deep learning-based object detection algorithms, Faster Region-based Convolutional Neural Networks (R-CNN) and You Only Look Once (YOLOv5), is tested using experimental data obtained in formation flight simulations in the ORION Lab at Florida Institute of Technology. The simulation scenarios vary the yaw motion of the target object, the chaser approach trajectory, and the lighting conditions in order to test the algorithms in a wide range of realistic and performance limiting situations. The data analyzed include the mean average precision metrics in order to compare the performance of the object detectors. The paper discusses the path to implementing the feature recognition algorithms and towards integrating them into the spacecraft Guidance Navigation and Control system.",
  "SummaryAbstract": "This paper evaluates the performance of two deep learning-based object detection algorithms, YOLOv5 and Faster R-CNN, for autonomous navigation around non-cooperative space objects. Experimental data from formation flight simulations were used to test the algorithms under various conditions. The study found that while Faster R-CNN is more accurate, YOLOv5 offers significantly faster inference times, making it more suitable for real-time applications."
}

人工智能为初始摘要创建了以下总结，并且再次，这看起来很棒！

This paper evaluates the performance of two deep learning-based object detection algorithms, YOLOv5 and Faster R-CNN, for autonomous navigation around non-cooperative space objects. 
Experimental data from formation flight simulations were used to test the algorithms under various conditions. 
The study found that while Faster R-CNN is more accurate, YOLOv5 offers significantly faster inference times, making it more suitable for real-time applications.