PyTorch TorchTune 中的消息(Message)机制详解-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00014/article/details/148506141

PyTorch TorchTune 中的消息(Message)机制详解

torchtune A Native-PyTorch Library for LLM Fine-tuning 项目地址: https://gitcode.com/gh_mirrors/to/torchtune

什么是消息(Message)机制

在PyTorch TorchTune项目中，消息(Message)是一个核心组件，它定义了文本和多模态内容如何被分词(tokenized)。消息机制为所有分词器和数据集API提供了一个统一的接口，是模型处理输入输出的基础单元。

消息包含以下几个关键信息：

文本内容(content)
发送内容的角色(role)
与模型分词器特殊标记相关的其他信息

创建消息的两种方式

TorchTune提供了两种创建消息的方法，开发者可以根据实际场景选择最合适的方式。

1. 直接使用构造函数

from torchtune.data import Message

msg = Message(
    role="user",        # 消息发送者角色
    content="Hello world!",  # 消息内容
    masked=True,       # 是否在训练时屏蔽该消息
    eot=True,          # 是否添加结束标记
    ipython=False,     # 是否为IPython特殊格式
)

2. 从字典创建

msg = Message.from_dict(
    {
        "role": "user",
        "content": "Hello world!",
        "masked": True,
        "eot": True,
        "ipython": False,
    }
)

这两种方式是等价的，后者在处理JSON等结构化数据时特别方便。

多模态消息处理

TorchTune的消息机制不仅支持纯文本，还能处理包含图像的多模态内容。消息内容(content)实际上是一个字典列表，每个字典包含类型(type)和具体内容(content)。

添加图像到消息

import PIL
from torchtune.data import Message

img_msg = Message(
    role="user",
    content=[
        {
            "type": "image",
            "content": PIL.Image.new(mode="RGB", size=(4, 4)),  # 图像对象
        },
        {"type": "text", "content": "What's in this image?"},  # 文本内容
    ],
)

从文件加载图像

实际应用中，我们通常从文件路径加载图像：

from torchtune.data import load_image

image_path = "path/to/image.jpg"
img_msg = Message(
    role="user",
    content=[
        {"type": "image", "content": load_image(image_path)},
        {"type": "text", "content": "What's in this image?"},
    ],
)

处理带图像标记的文本

如果原始数据中包含图像占位标记，可以使用format_content_with_images函数自动处理：

content = format_content_with_images(
    "<|image|>hello <|image|>world",  # 包含图像标记的文本
    image_tag="<|image|>",            # 图像标记
    images=[image1, image2]           # 图像对象列表
)

消息转换工具

TorchTune提供了便捷的消息转换工具，帮助将原始数据转换为消息列表：

from torchtune.data import InputOutputToMessages

sample = {
    "input": "What is your name?",
    "output": "I am an AI assistant."
}
transform = InputOutputToMessages()
output = transform(sample)  # 转换为对话格式的消息列表

使用提示模板格式化消息

提示模板(Prompt Template)可以将消息按照特定模型要求的格式进行结构化：

from torchtune.models.mistral import MistralChatTemplate

msg = Message(role="user", content="Hello world!")
template = MistralChatTemplate()
templated_msg = template([msg])  # 应用Mistral模型的对话模板

访问消息内容

获取纯文本内容

print(msg.text_content)  # 获取消息中的纯文本部分

获取媒体内容

if msg.contains_media:
    print(msg.get_media())  # 获取消息中的图像等媒体内容

消息分词处理

所有模型分词器都提供了tokenize_messages方法，将消息列表转换为token ID和损失掩码：

from torchtune.models.mistral import mistral_tokenizer

m_tokenizer = mistral_tokenizer(...)
msgs = [...]  # 消息列表
tokens, mask = m_tokenizer.tokenize_messages(msgs)  # 分词处理