摘要
本文系统介绍如何利用Claude大模型的工具调用能力,实现高效、灵活的结构化JSON数据提取。通过五大典型实战案例,帮助中国AI开发者掌握从文本到结构化数据的全流程,提升AI应用开发效率。
目录
结构化数据提取的意义与应用场景
- 数据驱动AI:结构化JSON数据是AI自动化、知识管理、数据分析的基础。
- 典型场景:舆情监测、智能客服、自动摘要、知识图谱、信息抽取、数据标注等。
- Claude优势:支持复杂文本理解与结构化输出,极大简化开发流程。
Claude工具调用机制简介
Claude支持通过自定义工具(Tool Use)实现结构化数据输出。开发者可定义输入输出Schema,Claude根据Schema自动生成JSON结果。
- 核心流程:
- 定义工具及输入Schema
- 构造Prompt引导Claude调用工具
- 获取结构化JSON结果
五大实战案例详解
1. 文章摘要结构化
场景:自动提取文章作者、主题、摘要、条理性、说服力等结构化信息。
# 安装依赖
# %pip install anthropic requests beautifulsoup4
from anthropic import Anthropic
import requests
from bs4 import BeautifulSoup
import json
client = Anthropic()
MODEL_NAME = "claude-3-haiku-20240307"
# 定义工具Schema
tools = [
{
"name": "print_summary",
"description": "Prints a summary of the article.",
"input_schema": {
"type": "object",
"properties": {
"author": {"type": "string", "description": "文章作者"},
"topics": {"type": "array", "items": {"type": "string"}, "description": "主题列表"},
"summary": {"type": "string", "description": "摘要"},
"coherence": {"type": "integer", "description": "条理性评分,0-100"},
"persuasion": {"type": "number", "description": "说服力评分,0.0-1.0"},
"counterpoint": {"type": "string", "description": "反方观点"}
},
"required": ['author', 'topics', 'summary', 'coherence', 'persuasion', 'counterpoint']
}
}
]
# 获取文章内容
url = "https://www.anthropic.com/news/third-party-testing"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
article = " ".join([p.text for p in soup.find_all("p")])
query = f"""
<article>
{article}
</article>
Use the `print_summary` tool.
"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
messages=[{"role": "user", "content": query}]
)
json_summary = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_summary":
json_summary = content.input
break
if json_summary:
print("JSON Summary:")
print(json.dumps(json_summary, indent=2, ensure_ascii=False))
else:
print("未提取到结构化摘要。")
2. 命名实体识别(NER)
场景:自动识别文本中的人名、机构、地点等实体,并结构化输出。
# 定义工具Schema
tools = [
{
"name": "print_entities",
"description": "Prints extract named entities.",
"input_schema": {
"type": "object",
"properties": {
"entities": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "实体名称"},
"type": {"type": "string", "description": "实体类型(如PERSON, ORGANIZATION, LOCATION)"},
"context": {"type": "string", "description": "实体出现上下文"}
},
"required": ["name", "type", "context"]
}
}
},
"required": ["entities"]
}
}
]
text = "John works at Google in New York. He met with Sarah, the CEO of Acme Inc., last week in San Francisco."
query = f"""
<document>
{text}
</document>
Use the print_entities tool.
"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
messages=[{"role": "user", "content": query}]
)
json_entities = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_entities":
json_entities = content.input
break
if json_entities:
print("Extracted Entities (JSON):")
print(json.dumps(json_entities, indent=2, ensure_ascii=False))
else:
print("未提取到实体信息。")
3. 情感分析
场景:自动分析文本情感倾向,输出正面、负面、中性分数。
# 定义工具Schema
tools = [
{
"name": "print_sentiment_scores",
"description": "Prints the sentiment scores of a given text.",
"input_schema": {
"type": "object",
"properties": {
"positive_score": {"type": "number", "description": "正面分数,0.0-1.0"},
"negative_score": {"type": "number", "description": "负面分数,0.0-1.0"},
"neutral_score": {"type": "number", "description": "中性分数,0.0-1.0"}
},
"required": ["positive_score", "negative_score", "neutral_score"]
}
}
]
text = "The product was okay, but the customer service was terrible. I probably won't buy from them again."
query = f"""
<text>
{text}
</text>
Use the print_sentiment_scores tool.
"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
messages=[{"role": "user", "content": query}]
)
json_sentiment = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_sentiment_scores":
json_sentiment = content.input
break
if json_sentiment:
print("Sentiment Analysis (JSON):")
print(json.dumps(json_sentiment, indent=2, ensure_ascii=False))
else:
print("未提取到情感分析结果。")
4. 文本分类
场景:将文本自动归类到多个预设类别,并输出置信分数。
# 定义工具Schema
tools = [
{
"name": "print_classification",
"description": "Prints the classification results.",
"input_schema": {
"type": "object",
"properties": {
"categories": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "类别名称"},
"score": {"type": "number", "description": "置信分数,0.0-1.0"}
},
"required": ["name", "score"]
}
}
},
"required": ["categories"]
}
}
]
text = "The new quantum computing breakthrough could revolutionize the tech industry."
query = f"""
<document>
{text}
</document>
Use the print_classification tool. The categories can be Politics, Sports, Technology, Entertainment, Business.
"""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
messages=[{"role": "user", "content": query}]
)
json_classification = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_classification":
json_classification = content.input
break
if json_classification:
print("Text Classification (JSON):")
print(json.dumps(json_classification, indent=2, ensure_ascii=False))
else:
print("未提取到分类结果。")
5. 开放属性抽取
场景:针对描述性文本,自动抽取所有可识别属性,支持开放Schema。
# 定义开放Schema工具
tools = [
{
"name": "print_all_characteristics",
"description": "Prints all characteristics which are provided.",
"input_schema": {
"type": "object",
"additionalProperties": True
}
}
]
query = f"""Given a description of a character, your task is to extract all the characteristics of the character and print them using the print_all_characteristics tool.
The print_all_characteristics tool takes an arbitrary number of inputs where the key is the characteristic name and the value is the characteristic value (age: 28 or eye_color: green).
<description>
The man is tall, with a beard and a scar on his left cheek. He has a deep voice and wears a black leather jacket.
</description>
Now use the print_all_characteristics tool."""
response = client.messages.create(
model=MODEL_NAME,
max_tokens=4096,
tools=tools,
tool_choice={"type": "tool", "name": "print_all_characteristics"},
messages=[{"role": "user", "content": query}]
)
tool_output = None
for content in response.content:
if content.type == "tool_use" and content.name == "print_all_characteristics":
tool_output = content.input
break
if tool_output:
print("Characteristics (JSON):")
print(json.dumps(tool_output, indent=2, ensure_ascii=False))
else:
print("未提取到属性信息。")
整体流程图与架构图
图1:Claude结构化JSON提取流程图
注意事项与最佳实践
- Schema设计要详细:字段描述越清晰,Claude输出越准确。
- 异常处理:需判断返回内容类型,防止解析失败。
- Token限制:长文本需分段处理,避免超限。
- 开放Schema需加强Prompt约束。
- 调试建议:多用print输出中间结果,便于排查问题。
常见问题与扩展阅读
常见问题:
- Claude未返回结构化内容?检查Prompt和Schema定义。
- 字段缺失或类型不符?优化Schema和示例。
- 超长文本报错?分段处理或调整max_tokens。
扩展阅读:
总结与参考资料
本文详细介绍了Claude结合工具调用实现结构化JSON数据提取的全流程与实战案例。建议开发者结合自身业务场景,灵活设计Schema与Prompt,持续优化提取效果。
参考资料: