Llama3-8B基于peft+trl进行SFT监督微调（Python代码模式）

zhujiahui622

已于 2024-04-30 17:58:55 修改

阅读量7.3k

点赞数 33

分类专栏： LLM 文章标签： LLM微调 Llama3 PEFT trl

于 2024-04-30 17:47:26 首次发布

本文链接：https://blog.youkuaiyun.com/zhujiahui622/article/details/138196101

版权

LLM 专栏收录该内容

2 篇文章

订阅专栏

本文介绍了Meta发布的Llama3模型，特别是8B版本的SFT监督微调方法，以及如何在命令行模式下进行数据预处理、模型推理和LoRA参数高效微调的过程，涉及环境配置、工具包安装和常见问题解决。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

4月19日Meta终于发布了Llama3，包含8B和70B两种模型，本次我们就来试着微调下8B的模型。

命令行模式的SFT见：Llama3-8B基于trl进行SFT监督微调（命令行模式）-优快云博客

环境：

GPU：NVIDIA A100 80G

CUDA：12.3

Python 3.11+PyTorch 2.1.2+transformers 4.40.0

一、准备环境

安装Python依赖，重点是

pip install trl
pip install peft
pip install bitsandbytes

其他缺什么补什么。

下载Llama3-8B模型，国内可以从modelscope上下载：

git clone https://www.modelscope.cn/LLM-Research/Meta-Llama-3-8B.git

下载完毕后有如下文件：

下载微调数据：

本次我们使用ruozhiba_qa数据。

HuggingFace下载地址：https://huggingface.co/datasets/LooksJuicy/ruozhiba

核心是其中的ruozhiba_qa.json。数据格式：

二、数据集处理

如上图，原始的ruozhiba_qa.json中问题和答案分别在不同的字段上，即"instruction"和“output”。而待会训练时SFTTrainer中要求传入的数据集的“dataset_text_field”字段只能为string类型，不能传list<string>，而对于问答微调任务，显然是要把问题和答案一起喂给训练器的。因此需要对原始数据进行合并处理，将instruction和output合并在一个新的"text"字段中。合并的代码如下：

# -*- coding: utf-8 -*-

"""
数据集处理
"""

import json


def process():
    input_filename = "/Users/zhujiahui/Downloads/ruozhiba_qa.json"
    output_filename = "/Users/zhujiahui/Local/model/Llama3-Chinese-Dataset/ruozhiba_qa.json"

    result_json_list = []
    with open(output_filename, "w") as write_file:
        with open(input_filename, 'r') as read_file:
            data_json = json.load(read_file)
            for each_json in data_json:
                each_result_json = {"text": "<s>[INST]" + each_json["instruction"]
                                            + "[/INST] " + each_json["output"] + "</s>"}
                result_json_list.append(each_result_json)

            write_file.write(json.dumps(result_json_list, ensure_ascii=False, indent=4))


if __name__ == '__main__':
    process()

处理之后新的数据集格式如下：

处理完后将数据上传至服务器。

参考：python - TRL SFTTrainer - llama2 finetuning on Alpaca - datasettext field - Stack Overflow

三、直接推理

采用如下代码先直接对原始的Llama3-8B进行推理：

import torch

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    AutoModel
)

# 根据不同的环境设置GPU名称
DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"


def origin_main():

    origin_model = "/Users/zhujiahui/Local/model/Meta-Llama-3-8B"
    tokenizer = AutoTokenizer.from_pretrained(origin_model, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(origin_model).to(DEVICE)

    prompt = "只剩一个心脏了还能活吗？"
    inputs = tokenizer([prompt], max_length=128)
    input_ids = torch.tensor(inputs["input_ids"]).to(DEVICE)
    print(input_ids)

    outputs = model.generate(input_ids, max_length=128)
    final_result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(final_result)



if __name__ == '__main__':
    origin_main()

笔者的Mac可以直接基于mps本地跑，结果如下：

可以发现Llama对于中文问题的回答实在是惨不忍睹。

感觉有点问题，怕是tokenizer分词之后映射有些问题，基于transformers.pipeline换另外一种方式推理：

import torch

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    AutoModel,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)

# 根据不同的环境设置GPU名称
DEVICE = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"


def pipeline_main():
    origin_model = "/Users/zhujiahui/Local/model/Meta-Llama-3-8B"
    tokenizer = AutoTokenizer.from_pretrained(origin_model, trust_remote_code=True)
    llama_pipeline = pipeline("text-generation", model=origin_model, tokenizer=tokenizer, device=DEVICE)
    sentences = llama_pipeline("只剩一个心脏了还能活吗？", do_sample=True, top_k=10,
                               eos_token_id=tokenizer.eos_token_id, max_length=128)
    for seq in sentences:
        print(seq["generated_text"])


if __name__ == '__main__':
    pipeline_main()

结果如下：

结果正常许多。

四、使用peft+trl进行LoRA指令微调

trl (Transformer Reinforcement Learning)，Transformer强化学习，它提供了在训练和微调LLM的各个步骤中的实现，包括监督微调步骤(SFT)，奖励建模步骤(RM)和近端策略优化(PPO)等。

peft (Parameter-Efficient Fine-Tuning)，是HuggingFace推出的一个参数高效微调库，详见：GitHub - huggingface/peft: 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.。

微调的方法采用LoRA（Low-Rank Adaptation），LoRA是微软研究团队提出的一种通过冻结预训练模型参数，在Transformer每一层中加入2个可供训练的A、B低秩旁路矩阵（其中一个矩阵负责降维、另一个负责升维），可大幅减少微调参数量的方法。详见论文：LoRA: Low-Rank Adaptation of Large Language Models （2021）

本次微调相关代码如下（注：以下代码未使用量化）：

# -*- coding: utf-8 -*-

"""
Llama3 Lora PEFT
"""

# 服务器上运行时需要设置
import time
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '5'

import torch

from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer


def peft_fine_tune():
    # 基础模型路径
    base_model_path = "/home/work/xxx/llm/Meta-Llama-3-8B"
    # 数据集路径
    ruozhiba_dataset = "/home/work/xxx/dataset/Llama3-Chinese-Dataset/ruozhiba_qa.json"
    # 采用json格式的数据集加载方式
    dataset = load_dataset("json", data_files=ruozhiba_dataset, split="train")
    print(dataset)
    # 用于线性层计算的数据类型
    compute_dtype = getattr(torch, "float16")
    # 量化参数
    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,  # 启用 4 位加载
        bnb_4bit_quant_type="nf4",  # 指定用于量化的数据类型。支持两种量化数据类型： fp4 （四位浮点）和 nf4 （常规四位浮点）
        bnb_4bit_compute_dtype=compute_dtype,  # 用于线性层计算的数据类型
        bnb_4bit_use_double_quant=False  # 是否使用嵌套量化来提高内存效率
    )

    # 加载基础模型
    base_model = AutoModelForCausalLM.from_pretrained(base_model_path)#.to(DEVICE)
    # use_cache是对解码速度的优化，在解码器解码时，存储每一步输出的hidden-state用于下一步的输入
    # 因为后面会开启gradient checkpoint，中间激活值不会存储，因此use_cahe=False
    base_model.config.use_cache = False
    # 设置张量并行
    base_model.config.pretraining_tp = 1

    # 加载tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)
    # 指定填充标记(pad_token)使用结束标记(eos_token)。pad_token是tokenizer中用于补足输入序列长度的填充标记,默认是 [PAD]。
    # eos_token是tokenizer中用于表示序列结束的标记,默认是 [SEP]
    tokenizer.pad_token = tokenizer.eos_token
    # padding_side 设置为“right”以修复 fp16 的问题
    # train的时候需要padding在右边，并在句末加入eos，否则模型永远学不会什么时候停下
    # test的时候需要padding在左边，否则模型生成的结果可能全为eos
    tokenizer.padding_side = "right"

    # LoRA微调参数
    peft_params = LoraConfig(
        lora_alpha=16,  # LoRA超参数，用于缩放低秩适应的权重
        lora_dropout=0.1,  # LoRA层的丢弃率
        r=64,  # LoRA中的秩
        bias="none",
        task_type="CAUSAL_LM"  # Llama属于因果语言模型
    )

    # 训练器参数
    training_params = TrainingArguments(
        output_dir="./Llama-3-8B-ruozhiba",  # 结果路径
        num_train_epochs=1000,  # 总的训练轮数
        per_device_train_batch_size=2,  # 这是每个GPU的训练批次大小
        gradient_accumulation_steps=1,  # 累积多个步骤的梯度，以有效地增加批次大小
        gradient_checkpointing=True,  # 模型支持梯度检查点
        gradient_checkpointing_kwargs={"use_reentrant": False},  # 解决use_reentrant警告
        optim="paged_adamw_32bit",  # 优化器
        save_steps=200,  # 保存检查点之间的步数
        logging_steps=100,  # 训练日志输出之间的步数
        learning_rate=2e-4,  # 初始学习率
        weight_decay=0.001,  # 权重衰减率
        fp16=False,  # 不启用混合精度训练
        bf16=False,  # 不启用BF16
        max_grad_norm=0.3,  # 裁剪梯度
        max_steps=1000,  # 最大训练迭代次数
        warmup_ratio=0.03,  # 训练开始时的预热样本比例
        group_by_length=True,  # 将训练数据集中大致相同长度的样本分组到同一batch中，提升prefill效率
        lr_scheduler_type="constant",  # 学习率调度器将使用常数衰减策略
        report_to=["tensorboard"]  # 将指标记录到Tensorboard
    )

    # 训练器
    trainer = SFTTrainer(
        model=base_model,
        train_dataset=dataset,
        peft_config=peft_params,
        dataset_text_field="text",  # 数据集中用于训练的文本字段
        max_seq_length=1024,  # 序列长度
        tokenizer=tokenizer,
        args=training_params,
        packing=False,  # 不将多个权重参数打包成更少的数据单元进行存储和传输
    )

    print("开始训练")
    start_time = time.time()
    trainer.train()
    trainer.save_model()
    end_time = time.time()
    print("训练结束")
    print("耗时：", end_time - start_time)


if __name__ == '__main__':
    peft_fine_tune()

将其命名为llama3_lora.py，放到服务器的某个路径下（比如XXXX/peft），用以下命令开始运行：

nohup python llama3_lora.py > llama3_lora.log &

llama3_lora.log是记录训练过程的日志文件。

显存占用大概35G左右：

微调前	微调后

微调完后在当前代码所在路径下生成Llama-3-8B-ruozhiba文件夹，里面有如下文件：

从大小上可知微调完只保存LoRA权重参数，不包含原始模型。

五、对微调后的模型进行推理

推理代码：

# -*- coding: utf-8 -*-

"""
Llama3微调结果推理
"""

# 服务器上运行时需要设置
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '5'

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    pipeline
)
from peft import PeftModel


def infer_only_lora():
    """
    只使用LoRA部分推理
    """
    new_model = "/home/work/zhujiahui1/peft/Llama-3-8B-ruozhiba/checkpoint-1000"
    tokenizer = AutoTokenizer.from_pretrained(new_model, trust_remote_code=True)
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    llama_pipeline = pipeline("text-generation", model=new_model, tokenizer=tokenizer)
    sentences = llama_pipeline("<s>[INST]只剩一个心脏了还能活吗？[/INST]",
                               eos_token_id=tokenizer.eos_token_id, max_new_tokens=256)
    for seq in sentences:
        print(seq["generated_text"])


def infer_merge_llama_lora():
    """
    合并原始模型和LoRA后进行推理
    """
    # 基础模型路径
    base_model_path = "/home/work/zhujiahui1/llm/Meta-Llama-3-8B"
    # 加载基础模型
    base_model = AutoModelForCausalLM.from_pretrained(base_model_path)
    # 微调模型路径
    new_model = "/home/work/zhujiahui1/peft/Llama-3-8B-ruozhiba/checkpoint-1000"
    # 加载两者
    merge_model = PeftModel.from_pretrained(base_model, new_model)
    # 物理合并
    merge_model = merge_model.merge_and_unload()
    # 加载tokenizer
    tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    llama_pipeline = pipeline("text-generation", model=merge_model, tokenizer=tokenizer)
    sentences = llama_pipeline("<s>[INST]只剩一个心脏了还能活吗？[/INST]",
                               eos_token_id=tokenizer.eos_token_id, max_new_tokens=256)
    for seq in sentences:
        print(seq["generated_text"])


if __name__ == '__main__':
    infer_only_lora()
    infer_merge_llama_lora()

结果如下：

只使用LoRA权重推理结果（上）：重复现象较为明显，偏差较大。

使用原始模型+LoRA权重推理结果（下）相对正常一些，虽然离标准答案有些距离。毕竟微调数据量不大，且次数不多。

五、常见问题

1. torch._C._cuda_setDevice(device) RuntimeError: CUDA error: out of memory

training_params = TrainingArguments(

^^^^^^^^^^^^^^^^^^

File "<string>", line 121, in __init__

File "/home/work/miniconda3/lib/python3.11/site-packages/transformers/training_args.py", line 1483, in __post_init__

and (self.device.type != "cuda")

   ^^^^^^^^^^^

File "/home/work/miniconda3/lib/python3.11/site-packages/transformers/training_args.py", line 1921, in device

return self._setup_devices

   ^^^^^^^^^^^^^^^^^^^

File "/home/work/miniconda3/lib/python3.11/site-packages/transformers/utils/generic.py", line 54, in __get__

cached = self.fget(obj)

   ^^^^^^^^^^^^^^

File "/home/work/miniconda3/lib/python3.11/site-packages/transformers/training_args.py", line 1912, in _setup_devices

torch.cuda.set_device(device)

File "/home/work/miniconda3/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device

torch._C._cuda_setDevice(device)

RuntimeError: CUDA error: out of memory

CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

原因：乍一看是显存不足，实际是因为没有指定到合适的GPU上，笔者实验的是一台8卡的机器，最初指定使用GPU编号5通过以下代码：

base_model = AutoModelForCausalLM.from_pretrained(base_model_path).to("cuda:5")
# 或
base_model = AutoModelForCausalLM.from_pretrained(base_model_path, device_map={"": 5})

发现以上两种方式实际运行中并不能奏效，实际选中的还是编号为0的GPU，而编号为0的GPU确实显存不足。

解决方案：在代码开头加上：

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '5'

通过以上方式来选择GPU编号。

2. 警告use_reentrant=False

warnings.warn(

/home/work/miniconda3/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.

解决方案：在TrainingArguments中加上gradient_checkpointing_kwargs={"use_reentrant": False}

# 训练器参数
    training_params = TrainingArguments(
        ...
        gradient_checkpointing_kwargs={"use_reentrant": False},  # 解决use_reentrant警告
        ...
    )