[如何处理长文本中的信息提取：实用指南]

最新推荐文章于 2025-09-24 11:01:40 发布

原创最新推荐文章于 2025-09-24 11:01:40 发布 · 660 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#windows #python #linux

如何处理长文本中的信息提取：实用指南

在处理PDF等文件时，你可能会遇到超过语言模型上下文窗口长度的文本。为处理这些文本，以下几种策略值得考虑：

更换LLM：选择支持更大上下文窗口的不同语言模型。
暴力法：将文档分块，并从每个块中提取内容。
RAG：将文档分块，索引块，并只从看似“相关”的部分块中提取内容。

请注意，这些策略有不同的权衡，最佳策略可能取决于你设计的应用程序。

本文将展示如何实现策略二和三。

设置

我们需要一些示例数据！让我们下载维基百科上的一篇关于汽车的文章，并将其加载为LangChain文档。

import re
import requests
from langchain_community.document_loaders import BSHTMLLoader

# 下载内容
response = requests.get("https://en.wikipedia.org/wiki/Car")
# 将内容写入文件
with open("car.html", "w", encoding="utf-8") as f:
    f.write(response.text)
# 使用HTML解析器加载内容
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
# 清理代码
document.page_content = re.sub("\n\n+", "\n", document.page_content)

print(len(document.page_content))  # 输出文章长度

定义提取架构

我们将使用Pydantic定义我们想要提取的信息架构：例如，包含年份和描述的“关键发展”列表。

from typing import List
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

class KeyDevelopment(BaseModel):
    year: int = Field(..., description="The year of the development.")
    description: str = Field(..., description="What happened in this year?")
    evidence: str = Field(..., description="The sentence(s) from which the information was extracted.")

class ExtractionData(BaseModel):
    key_developments: List[KeyDevelopment]

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert at identifying key historic development in text."),
    ("human", "{text}"),
])

创建一个提取器

选择支持工具调用功能的LLM。以下是OpenAI的示例设置：

import os
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)

extractor = prompt | llm.with_structured_output(schema=ExtractionData, include_raw=False)

暴力法

将文档分割成LLM上下文窗口可以容纳的块。

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=2000, chunk_overlap=20)
texts = text_splitter.split_text(document.page_content)

并行提取内容：

first_few = texts[:3]

extractions = extractor.batch(
    [{"text": text} for text in first_few],
    {"max_concurrency": 5},
)

合并结果：

key_developments = []
for extraction in extractions:
    key_developments.extend(extraction.key_developments)

print(key_developments[:10])

基于RAG的方法

这个方法关注相关块，而非从每个块提取信息。

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

rag_extractor = {
    "text": retriever | (lambda docs: docs[0].page_content)
} | extractor

results = rag_extractor.invoke("Key developments associated with cars")

for key_development in results.key_developments:
    print(key_development)