使用Anthropic Claude工具提取结构化JSON数据的技术实践

娄朋虎Imogene

于 2025-06-03 09:04:46 发布

阅读量385

点赞数 4

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/gitblog_01199/article/details/148392872

使用Anthropic Claude工具提取结构化JSON数据的技术实践

anthropic-cookbook A collection of notebooks/recipes showcasing some fun and effective ways of using Claude. 项目地址: https://gitcode.com/gh_mirrors/an/anthropic-cookbook

概述

在现代自然语言处理应用中，将非结构化文本转换为结构化数据是一项关键任务。Anthropic的Claude模型通过其工具使用功能，提供了一种高效的方式来实现这一转换。本文将详细介绍如何利用Claude模型从各种文本输入中提取结构化JSON数据。

环境准备

在开始之前，我们需要设置开发环境：

%pip install anthropic requests beautifulsoup4

from anthropic import Anthropic
import requests
from bs4 import BeautifulSoup
import json

client = Anthropic()
MODEL_NAME = "claude-3-haiku-20240307"

这些依赖项包括Anthropic的Python客户端库、用于HTTP请求的requests库，以及用于HTML解析的BeautifulSoup库。

文章摘要提取

应用场景

文章摘要提取功能特别适用于内容聚合平台、新闻分析系统或研究辅助工具，能够快速获取文章的核心信息。

实现方法

我们定义一个print_summary工具，指定输出JSON的结构：

tools = [
    {
        "name": "print_summary",
        "description": "Prints a summary of the article.",
        "input_schema": {
            "type": "object",
            "properties": {
                "author": {"type": "string", "description": "Name of the article author"},
                "topics": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": 'Array of topics, e.g. ["tech", "current affairs"]. Should be as specific as possible, and can overlap.'
                },
                "summary": {"type": "string", "description": "Summary of the article. One or two paragraphs max."},
                "coherence": {"type": "integer", "description": "Coherence of the article's key points, 0-100 (inclusive)"},
                "persuasion": {"type": "number", "description": "Article's persuasion score, 0.0-1.0 (inclusive)"}
            },
            "required": ['author', 'topics', 'summary', 'coherence', 'persuasion', 'counterpoint']
        }
    }
]

技术要点

结构化输出设计：明确指定了作者、主题、摘要等字段及其数据类型
评分系统：包含连贯性和说服力评分，为内容分析提供量化指标
必填字段：通过required属性确保关键信息不会被遗漏

命名实体识别

应用场景

命名实体识别在信息提取、知识图谱构建、智能搜索等领域有广泛应用。

实现方法

tools = [
    {
        "name": "print_entities",
        "description": "Prints extract named entities.",
        "input_schema": {
            "type": "object",
            "properties": {
                "entities": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string", "description": "The extracted entity name."},
                            "type": {"type": "string", "description": "The entity type (e.g., PERSON, ORGANIZATION, LOCATION)."},
                            "context": {"type": "string", "description": "The context in which the entity appears in the text."}
                        },
                        "required": ["name", "type", "context"]
                    }
                }
            },
            "required": ["entities"]
        }
    }
]

技术要点

实体类型分类：支持人物、组织、地点等多种实体类型
上下文保留：记录实体出现的原始上下文，便于后续分析
数组结构：使用数组存储多个实体，适应文本中可能出现的多个实体

情感分析

应用场景

情感分析广泛应用于客户反馈分析、社交媒体监控、市场研究等领域。

实现方法

tools = [
    {
        "name": "print_sentiment_scores",
        "description": "Prints the sentiment scores of a given text.",
        "input_schema": {
            "type": "object",
            "properties": {
                "positive_score": {"type": "number", "description": "The positive sentiment score, ranging from 0.0 to 1.0."},
                "negative_score": {"type": "number", "description": "The negative sentiment score, ranging from 0.0 to 1.0."},
                "neutral_score": {"type": "number", "description": "The neutral sentiment score, ranging from 0.0 to 1.0."}
            },
            "required": ["positive_score", "negative_score", "neutral_score"]
        }
    }
]

技术要点

多维度评分：提供正面、负面和中性三个维度的情感评分
归一化数值：使用0-1范围的标准化分数，便于比较和分析
精确数值：采用浮点数而非整数，提高评分精度

文本分类

应用场景

文本分类在内容管理、信息过滤、主题建模等场景中非常有用。

实现方法

tools = [
    {
        "name": "print_classification",
        "description": "Prints the classification results.",
        "input_schema": {
            "type": "object",
            "properties": {
                "categories": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string", "description": "The category name."},
                            "score": {"type": "number", "description": "The classification score for the category, ranging from 0.0 to 1.0."}
                        },
                        "required": ["name", "score"]
                    }
                }
            },
            "required": ["categories"]
        }
    }
]

技术要点

多标签分类：支持一个文本属于多个类别的情况
置信度评分：为每个分类结果提供置信度分数
灵活扩展：可以轻松添加或修改分类类别

处理未知键名的JSON结构

应用场景

当需要提取的字段不固定或无法预先确定时，这种灵活的结构特别有用。

实现方法

tools = [
    {
        "name": "print_all_characteristics",
        "description": "Prints all characteristics which are provided.",
        "input_schema": {
            "type": "object",
            "additionalProperties": True
        }
    }
]