PyTorch TorchTune 多模态数据集使用指南-优快云博客

PyTorch TorchTune 多模态数据集使用指南

torchtune A Native-PyTorch Library for LLM Fine-tuning 项目地址: https://gitcode.com/gh_mirrors/to/torchtune

多模态数据集概述

在深度学习领域，多模态数据集是指包含多种数据形式（如文本+图像、文本+音频等）的数据集。TorchTune 目前专注于支持视觉语言模型(VLMs)的训练，特别是文本+图像形式的聊天数据集。

多模态数据集的核心价值在于能够训练模型理解和关联不同模态信息的能力。例如，让模型不仅能理解"这是一只猫"的文本描述，还能将描述与实际的猫图片关联起来。

多模态数据集格式规范

TorchTune 要求多模态数据集遵循特定的格式标准：

对话格式：采用类似 ShareGPT 的结构，包含用户和助手的对话轮次
图像关联：每个对话样本关联一个图像路径
图像占位符：在文本中使用特殊标记（如<image>）指示图像插入位置

典型的数据结构如下：

[
    {
        "dialogue": [
            {"from": "human", "value": "<image>图片中是什么动物？"},
            {"from": "gpt", "value": "这是一只金毛犬。"}
        ],
        "image_path": "images/golden_retriever.jpg"
    }
]

数据集加载方式

TorchTune 提供了灵活的数据集加载方案，支持多种来源：

1. 本地JSON文件加载

from torchtune.models.llama3_2_vision import llama3_2_vision_transform
from torchtune.datasets.multimodal import multimodal_chat_dataset

# 初始化模型转换器
model_transform = llama3_2_vision_transform(
    path="/path/to/tokenizer.model",
    prompt_template="torchtune.data.QuestionAnswerTemplate",
    max_seq_len=8192,
    image_size=560,
)

# 加载数据集
ds = multimodal_chat_dataset(
    model_transform=model_transform,
    source="json",
    data_files="data/my_data.json",
    column_map={
        "dialogue": "conversations",
        "image_path": "image",
    },
    image_dir="/path/to/images/",
    image_tag="<image>",
    split="train",
)

2. 从Hugging Face加载

ds = multimodal_chat_dataset(
    model_transform=model_transform,
    source="Lin-Chen/ShareGPT4V",
    split="train",
    name="ShareGPT4V",
    image_dir="/path/to/images/",
    image_tag="<image>",
)

图像处理机制

TorchTune 提供了完善的图像处理流程：

图像加载：自动从路径加载图像
图像转换：将图像转换为模型可接受的格式
图像嵌入：将图像嵌入到文本序列中

当使用multimodal_chat_dataset时，图像处理是自动完成的。但开发者也可以手动处理：

from torchtune.data import load_image
from pathlib import Path

# 手动加载图像
image_path = Path("/path/to/images/clock.jpg")
pil_image = load_image(image_path)

多图像交错处理

TorchTune 支持在文本中任意位置插入多个图像：

from torchtune.data import Message, format_content_with_images

# 准备多个图像
image1 = PIL.Image.new(mode="RGB", size=(4, 4))
image2 = PIL.Image.new(mode="RGB", size=(4, 4))

# 创建包含多个图像的消息
text = "[img]第一张图 [img]第二张图"
user_message = Message(
    role="user",
    content=format_content_with_images(
        content=text,
        image_tag="[img]",
        images=[image1, image2],
    ),
)