使用PyTorch Torchtune微调Llama3聊天模型教程-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00977/article/details/148506170

使用PyTorch Torchtune微调Llama3聊天模型教程

torchtune A Native-PyTorch Library for LLM Fine-tuning 项目地址: https://gitcode.com/gh_mirrors/to/torchtune

前言

在自然语言处理领域，聊天模型的微调是一个重要且实用的技术。本文将详细介绍如何使用PyTorch Torchtune项目对Llama3 Instruct模型进行聊天数据的微调。通过本教程，您将掌握从数据准备到模型微调的完整流程。

准备工作

在开始之前，请确保您已经：

熟悉数据集配置的基本概念
了解如何下载Llama3 Instruct模型的权重文件
准备好您的自定义聊天数据集

Llama2与Llama3的提示模板差异

Llama2的提示模板

Llama2聊天模型使用特定的提示模板格式。例如：

<s>[INST] <<SYS>>
You are a helpful, respectful, and honest assistant.
<</SYS>>

Hi! I am a human. [/INST] Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant </s>

这种模板使用了特殊的标记符号来区分系统提示、用户输入和助手回复。

Llama3的改进

Llama3 Instruct对模板进行了全面改进，更好地支持多轮对话。同样的内容在Llama3中的表示如下：

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful, respectful, and honest assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Hi! I am a human.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant<|eot_id|>

Llama3使用了完全不同的标记系统，这些标记实际上是作为特殊令牌(token)进行编码的。

令牌化与特殊令牌

示例数据

考虑以下对话样本：

sample = [
    {
        "role": "system",
        "content": "You are a helpful, respectful, and honest assistant.",
    },
    {
        "role": "user",
        "content": "Who are the most influential hip-hop artists of all time?",
    },
    {
        "role": "assistant",
        "content": "Here is a list of some of the most influential hip-hop "
        "artists of all time: 2Pac, Rakim, N.W.A., Run-D.M.C., and Nas.",
    },
]

Llama2的令牌化

在Llama2中，我们使用Llama2ChatTemplate类来格式化提示：

from torchtune.data import Llama2ChatTemplate, Message

messages = [Message.from_dict(msg) for msg in sample]
formatted_messages = Llama2ChatTemplate.format(messages)

Llama2使用<s>和</s>作为序列开始(BOS)和结束(EOS)的特殊令牌，这些令牌在令牌化器中具有独立的ID。

Llama3的令牌化

Llama3的处理方式更为简洁：

from torchtune.models.llama3 import llama3_tokenizer

tokenizer = llama3_tokenizer("/path/to/tokenizer.model")
messages = [Message.from_dict(msg) for msg in sample]
tokens, mask = tokenizer.tokenize_messages(messages)

Llama3的所有特殊标记(如<|begin_of_text|>、<|eot_id|>等)都是作为特殊令牌处理的，令牌化器会自动处理所有格式化工作。

何时使用提示模板

是否使用提示模板取决于您的具体需求：

基础模型推理：如果基础模型在预训练时使用了特定的提示模板，推理时也应使用相同模板
特定任务微调：为特定任务(如摘要)微调模型时，可以使用专门的提示模板
聊天模型：Llama3 Instruct已经内置了聊天格式处理，通常不需要额外模板

自定义聊天数据集的微调实践

数据准备

假设我们有一个JSON格式的本地聊天数据集：

[
    {
        "dialogue": [
            {
                "from": "human",
                "value": "What is your name?"
            },
            {
                "from": "gpt",
                "value": "I am an AI assistant, I don't have a name."
            }
        ]
    }
]

数据集构建

我们可以使用chat_dataset来加载和准备数据：

from torchtune.datasets import chat_dataset
from torchtune.models.llama3 import llama3_tokenizer

tokenizer = llama3_tokenizer("/path/to/tokenizer.model")
ds = chat_dataset(
    tokenizer=tokenizer,
    source="json",
    data_files="data/my_data.json",
    split="train",
    conversation_column="dialogue",
    conversation_style="sharegpt",
)

对应的YAML配置如下：

tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /path/to/tokenizer.model

dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  data_files: data/my_data.json
  split: train
  conversation_column: dialogue
  conversation_style: sharegpt

开始微调

准备好数据集后，我们可以使用内置的LoRA单设备配方进行微调。复制8B_lora_single_device.yaml配置文件并更新您的数据集配置。

运行微调命令：

tune run lora_finetune_single_device --config custom_8B_lora_single_device.yaml epochs=15

总结

通过本教程，我们了解了：

Llama3 Instruct与Llama2在提示模板上的重要区别
令牌化过程中特殊令牌的处理方式
如何准备自定义聊天数据集进行微调
实际微调Llama3-8B模型的完整流程

Llama3的令牌化器简化了聊天数据的处理流程，使得微调过程更加直观和高效。希望本教程能帮助您成功地对Llama3进行聊天任务的微调。

torchtune A Native-PyTorch Library for LLM Fine-tuning 项目地址: https://gitcode.com/gh_mirrors/to/torchtune

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考