LangChain 实现文本分类任务

原创已于 2025-11-18 22:20:39 修改 · 651 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#langchain #分类

于 2025-11-18 22:14:56 首次发布

LangChain基础专栏收录该内容

13 篇文章

订阅专栏

LangChain 实现文本分类任务

任务背景

利用 LangChain 的结构化输出能力，将 LLM（此处为 DashScope 兼容的 Qwen 系列模型）封装为可控的“信息抽取器”。
针对输入文本自动输出结构化情感标签、攻击性评分与语言类别，便于后续分析或自动化处理。

任务目标

连接 Qwen 模型，确保能以结构化格式返回结果。
基于 Pydantic 定义分类 Schema，限制返回字段的类型和取值。
构建 Prompt → LLM → 结构化输出的 LangChain Chain。
演示两个不同 Schema（自由输出与枚举约束）的分类流程。

关键依赖库

langchain_core：提供 ChatPromptTemplate，用于构建提示模板。
langchain_openai：封装 DashScope 兼容接口，负责与 Qwen 模型通信。
pydantic.v1：通过 BaseModel 与 Field 描述结构化 Schema，驱动模型输出校验。

解决问题的核心逻辑

1. 初始化 LLM

通过 ChatOpenAI 连接 DashScope 提供的 Qwen 模型，配置 model、api_key 与 base_url。

model = ChatOpenAI(
    model='qwen-turbo',
    api_key=qwen_api_key,
    base_url='https://dashscope.aliyuncs.com/compatible-mode/v1'
)

2. 定义结构化输出 Schema

使用 Pydantic BaseModel 声明分类任务需要的字段，并通过 Field 提供描述或枚举约束。LangChain 会返回符合 Schema 的对象。

class Classification(BaseModel):
	"""
         定义一个Pydantic的数据模型，未来需要根据该类型，完成文本的分类
    """
	# 文本的情感倾向，预期为字符串类型
	sentiment: str = Field(description="文本的情感")
	# 文本的攻击性，预期为1到10的整数
	aggressiveness: int = Field(
		description="描述文本的攻击性，数字越大表示越攻击性"
	)
	# 文本使用的语言，预期为字符串类型
	language: str = Field(description="文本使用的语言")

class Classification1(BaseModel):
    """
        定义一个Pydantic的数据模型，未来需要根据该类型，完成文本的分类
    """
    # 文本的情感倾向，预期为字符串类型
    # ...没有默认值
    # 枚举限制：这个字段的值只能是列表里的字符
	# description告诉模型这个属性是干什么的
    sentiment: str = Field(..., enum=["happy", "neutral", "sad"], description="文本的情感")

    # 文本的攻击性，预期为1到5的整数
    aggressiveness: int = Field(..., enum=[1, 2, 3, 4, 5], description="描述文本的攻击性，数字越大表示越攻击性")

    # 文本使用的语言，预期为字符串类型
    language: str = Field(..., enum=["spanish", "english", "french", "中文", "italian"], description="文本使用的语言")

3. 构建 Prompt 与 Chain

ChatPromptTemplate 负责把输入文本嵌入提示词；model.with_structured_output 让 LLM 根据 Schema 自动解析输出；最后通过 | 组合成 Chain。

tagging_prompt = ChatPromptTemplate.from_template(
    """
    从以下段落中提取所需信息。
    只提取'Classification'类中提到的属性。
    段落：
    {input}
    """
)

chain = tagging_prompt | model.with_structured_output(Classification)
chain1 = tagging_prompt1 | model.with_structured_output(Classification1)

4. 调用 Chain 并解释结果

准备输入文本，执行 chain.invoke({'input': input_text})，即可获得 Pydantic 对象，并直接访问字段或打印。

input_text = "中国人民大学的王教授：我靠，真的是师德败坏..."
result: Classification = chain.invoke({'input': input_text})
result1: Classification = chain.invoke({'input': input_text})
print(result)
# sentiment='愤怒' aggressiveness=8 language='中文'
print(result1)
# sentiment='sad' aggressiveness=5 language='中文'

文本流程图

输入文本
↓
ChatPromptTemplate 注入上下文
↓
ChatOpenAI (Qwen) 生成结构化响应
↓
with_structured_output 校验并转换为 Pydantic 模型
↓
打印或消费分类结果

总结

通过 LangChain 的结构化输出能力，可以让大模型直接返回经过 Schema 校验的数据对象。
Pydantic Schema 提供了类型与取值的硬约束，使模型输出更可控。
该示例展示了自由 Schema 与严格枚举 Schema 的对比，为情感分析、内容审核等场景提供了模版。

完整代码

import os

from langchain_core.prompts import ChatPromptTemplate
from langchain_experimental.tabular_synthetic_data.prompts import SYNTHETIC_FEW_SHOT_PREFIX, SYNTHETIC_FEW_SHOT_SUFFIX
from langchain_openai import ChatOpenAI
from pydantic.v1 import BaseModel, Field

# Qwen（通义千问）API Key
qwen_api_key = 'sk-TripleH'  # 请替换为您的 DashScope API Key

# 创建 Qwen LLM 模型
# 可选模型：qwen-turbo, qwen-plus, qwen-max, qwen-max-longcontext
model = ChatOpenAI(
    model='qwen-turbo',  # 可以根据需要改为 qwen-plus 或 qwen-max
    api_key=qwen_api_key,
    base_url='https://dashscope.aliyuncs.com/compatible-mode/v1'
)

class Classification(BaseModel):
	"""
         定义一个Pydantic的数据模型，未来需要根据该类型，完成文本的分类
    """
	# 文本的情感倾向，预期为字符串类型
	sentiment: str = Field(description="文本的情感")
	# 文本的攻击性，预期为1到10的整数
	aggressiveness: int = Field(
		description="描述文本的攻击性，数字越大表示越攻击性"
	)
	# 文本使用的语言，预期为字符串类型
	language: str = Field(description="文本使用的语言")

# 创建一个用于提取信息的提示模板
tagging_prompt = ChatPromptTemplate.from_template(
    """
    从以下段落中提取所需信息。
    只提取'Classification'类中提到的属性。
    段落：
    {input}
    """
)


chain = tagging_prompt | model.with_structured_output(Classification)

input_text = "中国人民大学的王教授：我靠，真的是师德败坏，做出的事情实在让我生气！"
# input_text = "Estoy increiblemente contento de haberte conocido! Creo que seremos muy buenos amigos!"

result: Classification = chain.invoke({'input' : input_text})

print(result)
# sentiment='愤怒' aggressiveness=8 language='中文'



class Classification1(BaseModel):
    """
        定义一个Pydantic的数据模型，未来需要根据该类型，完成文本的分类
    """
    # 文本的情感倾向，预期为字符串类型
    # ...没有默认值
    # 枚举限制：这个字段的值只能是列表里的字符
	# description告诉模型这个属性是干什么的
    sentiment: str = Field(..., enum=["happy", "neutral", "sad"], description="文本的情感")

    # 文本的攻击性，预期为1到5的整数
    aggressiveness: int = Field(..., enum=[1, 2, 3, 4, 5], description="描述文本的攻击性，数字越大表示越攻击性")

    # 文本使用的语言，预期为字符串类型
    language: str = Field(..., enum=["spanish", "english", "french", "中文", "italian"], description="文本使用的语言")
    
# 创建一个用于提取信息的提示模板
tagging_prompt1 = ChatPromptTemplate.from_template(
    """
    从以下段落中提取所需信息。
    只提取'Classification'类中提到的属性。
    段落：
    {input}
    """
)

chain = tagging_prompt1 | model.with_structured_output(Classification1)

input_text1 = "中国人民大学的王教授：我靠，真的是师德败坏，做出的事情实在让我生气！"
# input_text1 = "Estoy increiblemente contento de haberte conocido! Creo que seremos muy buenos amigos!"

result1: Classification1 = chain.invoke({'input' : input_text1})

print(result1)
# sentiment='sad' aggressiveness=5 language='中文'