Claude与工具调用：结构化JSON数据提取实战

最新推荐文章于 2025-07-15 17:57:47 发布

原创最新推荐文章于 2025-07-15 17:57:47 发布 · 765 阅读

24 ·

CC 4.0 BY-SA版权

文章标签：

#json #服务器 #linux

摘要

本文系统介绍如何利用Claude大模型的工具调用能力，实现高效、灵活的结构化JSON数据提取。通过五大典型实战案例，帮助中国AI开发者掌握从文本到结构化数据的全流程，提升AI应用开发效率。

结构化数据提取的意义与应用场景

数据驱动AI：结构化JSON数据是AI自动化、知识管理、数据分析的基础。
典型场景：舆情监测、智能客服、自动摘要、知识图谱、信息抽取、数据标注等。
Claude优势：支持复杂文本理解与结构化输出，极大简化开发流程。

Claude工具调用机制简介

Claude支持通过自定义工具（Tool Use）实现结构化数据输出。开发者可定义输入输出Schema，Claude根据Schema自动生成JSON结果。

核心流程：
1. 定义工具及输入Schema
2. 构造Prompt引导Claude调用工具
3. 获取结构化JSON结果

五大实战案例详解

1. 文章摘要结构化

场景：自动提取文章作者、主题、摘要、条理性、说服力等结构化信息。

# 安装依赖
# %pip install anthropic requests beautifulsoup4
from anthropic import Anthropic
import requests
from bs4 import BeautifulSoup
import json

client = Anthropic()
MODEL_NAME = "claude-3-haiku-20240307"

# 定义工具Schema
tools = [
    {
        "name": "print_summary",
        "description": "Prints a summary of the article.",
        "input_schema": {
            "type": "object",
            "properties": {
                "author": {"type": "string", "description": "文章作者"},
                "topics": {"type": "array", "items": {"type": "string"}, "description": "主题列表"},
                "summary": {"type": "string", "description": "摘要"},
                "coherence": {"type": "integer", "description": "条理性评分，0-100"},
                "persuasion": {"type": "number", "description": "说服力评分，0.0-1.0"},
                "counterpoint": {"type": "string", "description": "反方观点"}
            },
            "required": ['author', 'topics', 'summary', 'coherence', 'persuasion', 'counterpoint']
        }
    }
]

# 获取文章内容
url = "https://www.anthropic.com/news/third-party-testing"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
article = " ".join([p.text for p in soup.find_all("p")])

query = f"""
<article>
{article}
</article>

Use the `print_summary` tool.
"""

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=4096,
    tools=tools,
    messages=[{"role": "user", "content": query}]
)
json_summary = None
for content in response.content:
    if content.type == "tool_use" and content.name == "print_summary":
        json_summary = content.input
        break

if json_summary:
    print("JSON Summary:")
    print(json.dumps(json_summary, indent=2, ensure_ascii=False))
else:
    print("未提取到结构化摘要。")

2. 命名实体识别（NER）

场景：自动识别文本中的人名、机构、地点等实体，并结构化输出。

# 定义工具Schema
tools = [
    {
        "name": "print_entities",
        "description": "Prints extract named entities.",
        "input_schema": {
            "type": "object",
            "properties": {
                "entities": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string", "description": "实体名称"},
                            "type": {"type": "string", "description": "实体类型（如PERSON, ORGANIZATION, LOCATION）"},
                            "context": {"type": "string", "description": "实体出现上下文"}
                        },
                        "required": ["name", "type", "context"]
                    }
                }
            },
            "required": ["entities"]
        }
    }
]

text = "John works at Google in New York. He met with Sarah, the CEO of Acme Inc., last week in San Francisco."

query = f"""
<document>
{text}
</document>

Use the print_entities tool.
"""

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=4096,
    tools=tools,
    messages=[{"role": "user", "content": query}]
)

json_entities = None
for content in response.content:
    if content.type == "tool_use" and content.name == "print_entities":
        json_entities = content.input
        break

if json_entities:
    print("Extracted Entities (JSON):")
    print(json.dumps(json_entities, indent=2, ensure_ascii=False))
else:
    print("未提取到实体信息。")

3. 情感分析

场景：自动分析文本情感倾向，输出正面、负面、中性分数。

# 定义工具Schema
tools = [
    {
        "name": "print_sentiment_scores",
        "description": "Prints the sentiment scores of a given text.",
        "input_schema": {
            "type": "object",
            "properties": {
                "positive_score": {"type": "number", "description": "正面分数，0.0-1.0"},
                "negative_score": {"type": "number", "description": "负面分数，0.0-1.0"},
                "neutral_score": {"type": "number", "description": "中性分数，0.0-1.0"}
            },
            "required": ["positive_score", "negative_score", "neutral_score"]
        }
    }
]

text = "The product was okay, but the customer service was terrible. I probably won't buy from them again."

query = f"""
<text>
{text}
</text>

Use the print_sentiment_scores tool.
"""

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=4096,
    tools=tools,
    messages=[{"role": "user", "content": query}]
)

json_sentiment = None
for content in response.content:
    if content.type == "tool_use" and content.name == "print_sentiment_scores":
        json_sentiment = content.input
        break

if json_sentiment:
    print("Sentiment Analysis (JSON):")
    print(json.dumps(json_sentiment, indent=2, ensure_ascii=False))
else:
    print("未提取到情感分析结果。")

4. 文本分类

场景：将文本自动归类到多个预设类别，并输出置信分数。

# 定义工具Schema
tools = [
    {
        "name": "print_classification",
        "description": "Prints the classification results.",
        "input_schema": {
            "type": "object",
            "properties": {
                "categories": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string", "description": "类别名称"},
                            "score": {"type": "number", "description": "置信分数，0.0-1.0"}
                        },
                        "required": ["name", "score"]
                    }
                }
            },
            "required": ["categories"]
        }
    }
]

text = "The new quantum computing breakthrough could revolutionize the tech industry."

query = f"""
<document>
{text}
</document>

Use the print_classification tool. The categories can be Politics, Sports, Technology, Entertainment, Business.
"""

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=4096,
    tools=tools,
    messages=[{"role": "user", "content": query}]
)

json_classification = None
for content in response.content:
    if content.type == "tool_use" and content.name == "print_classification":
        json_classification = content.input
        break

if json_classification:
    print("Text Classification (JSON):")
    print(json.dumps(json_classification, indent=2, ensure_ascii=False))
else:
    print("未提取到分类结果。")

5. 开放属性抽取

场景：针对描述性文本，自动抽取所有可识别属性，支持开放Schema。

# 定义开放Schema工具
tools = [
    {
        "name": "print_all_characteristics",
        "description": "Prints all characteristics which are provided.",
        "input_schema": {
            "type": "object",
            "additionalProperties": True
        }
    }
]

query = f"""Given a description of a character, your task is to extract all the characteristics of the character and print them using the print_all_characteristics tool.

The print_all_characteristics tool takes an arbitrary number of inputs where the key is the characteristic name and the value is the characteristic value (age: 28 or eye_color: green).

<description>
The man is tall, with a beard and a scar on his left cheek. He has a deep voice and wears a black leather jacket.
</description>

Now use the print_all_characteristics tool."""

response = client.messages.create(
    model=MODEL_NAME,
    max_tokens=4096,
    tools=tools,
    tool_choice={"type": "tool", "name": "print_all_characteristics"},
    messages=[{"role": "user", "content": query}]
)

tool_output = None
for content in response.content:
    if content.type == "tool_use" and content.name == "print_all_characteristics":
        tool_output = content.input
        break

if tool_output:
    print("Characteristics (JSON):")
    print(json.dumps(tool_output, indent=2, ensure_ascii=False))
else:
    print("未提取到属性信息。")