提升数据抽取精度：使用参考例子的实战指南

最新推荐文章于 2025-12-17 15:36:40 发布

原创最新推荐文章于 2025-12-17 15:36:40 发布 · 383 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#python #windows #开发语言

在进行数据抽取时，通过为大型语言模型（LLM）提供参考例子，常常可以显著提高抽取的质量。本指南将演示如何构建工具调用的少量例子，以引导抽取及类似应用的行为。

技术背景介绍

数据抽取旨在生成从文本和其他非结构化或半结构化格式中提取的结构化信息表示。在这种背景下，LLM工具调用特性经常被应用。使用参考例子是一个有效的策略，它不仅适用于工具调用模型，也适用于JSON模式或基于提示的技术。

核心原理解析

LangChain 使用工具调用属性来标记 LLM 消息中的工具调用。通过构建一个包含以下内容的聊天历史，我们为数据抽取创建参考例子：

HumanMessage：包含示例输入
AIMessage：包含示例工具调用
ToolMessage：包含示例工具输出

代码实现演示

以下代码展示了如何使用 LangChain 创建参考例子并进行数据抽取：

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List, Optional, Dict, TypedDict
import uuid

# 定义Person数据模型
class Person(BaseModel):
    """关于人的信息。"""
    name: Optional[str] = Field(..., description="The name of the person")
    hair_color: Optional[str] = Field(..., description="The color of the person's hair if known")
    height_in_meters: Optional[str] = Field(..., description="Height in METERs")

class Data(BaseModel):
    """提取的人的数据。"""
    people: List[Person]

# 创建自定义提示模板
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an expert extraction algorithm. "
         "Only extract relevant information from the text. "
         "If you do not know the value of an attribute asked "
         "to extract, return null for the attribute's value."),
        MessagesPlaceholder("examples"),
        ("human", "{text}"),
    ]
)

# 定义示例
examples = [
    (
        "The ocean is vast and blue. It's more than 20,000 feet deep. There are many fish in it.",
        Data(people=[]),
    ),
    (
        "Fiona traveled far from France to Spain.",
        Data(people=[Person(name="Fiona", height_in_meters=None, hair_color=None)]),
    ),
]

# 转换示例为消息格式
def tool_example_to_messages(example: Dict) -> List:
    messages = [HumanMessage(content=example["input"])]
    tool_calls = [{"id": str(uuid.uuid4()), "args": tool_call.dict(), "name": tool_call.__class__.__name__} for tool_call in example["tool_calls"]]
    messages.append(AIMessage(content="", tool_calls=tool_calls))
    messages.extend(ToolMessage(content="You have correctly called this tool.", tool_call_id=call["id"]) for call in tool_calls)
    return messages

messages = []
for text, tool_call in examples:
    messages.extend(tool_example_to_messages({"input": text, "tool_calls": [tool_call]}))

# 测试 Prompt
example_prompt = prompt.invoke({"text": "this is some text", "examples": messages})

for message in example_prompt.messages:
    print(f"{message.type}: {message}")

# 创建提取器
import openai
client = openai.OpenAI(
    base_url='https://yunwu.ai/v1',  # 国内稳定访问
    api_key='your-api-key'
)

# 测试提取
runnable = prompt | client.with_structured_output(
    schema=Data,
    method="function_calling",
    include_raw=False,
)

# 输出示例
for _ in range(5):
    text = "The solar system is large, but earth has only 1 moon."
    print(runnable.invoke({"text": text, "examples": []}))