目录
I. 前言
大模型的后训练,也就是post-training,是指在预训练模型的基础上,针对特定的任务或数据集进行额外的训练。这个阶段通常包括Fine-tuning和RLHF等方法,通过调整预训练模型的参数以适应新的任务。
目前比较主流的post-training框架包括LLaMA-Factory、trl和verl,三个框架中,基于transformers的trl库是完全封闭的,没有太多可操作空间,而LLaMA-Facotry和verl可操作性比较强,可以自定义很多东西。
verl和trl主要是针对强化学习设计,但也可以实现SFT,二者目前都实现了主流的RL算法。verl还具备与现有LLM基础设施无缝集成的能力,无论是PyTorch FSDP、Megatron-LM还是vLLM,veRL都能轻松对接,实现高效的资源利用与扩展。
回到这篇文章本身,本文主要想以一个小白的视角,介绍一下如何使用verl实现SFT和RL(GRPO为例),其中可能涉及到对verl官方代码的一些简单修改。
II. verl
2.1 准备工作
官方在给出了好几种使用方法,详见verl Installation。个人比较推荐将verl库clone到本地使用:
git clone https://github.com/volcengine/verl && cd verl && pip3 install -e .
verl需要的依赖比较简单,这里给大家提供一下我使用的其中几个关键包的版本:
flash-attn==2.5.9.post1
numpy==1.26.4
pandas==2.2.3
peft==0.14.0
ray=2.42.1
torch=2.4.0+cu124
transformers=4.47.1
vllm==0.5.4
2.2 SFT
SFT的主文件路径为:verl/trainer/fsdp_sft_trainer.py
,官方提供的运行脚本在examples/sft/gsm8k
路径下,随便看下其中一个:
set -x
if [ "$#" -lt 2 ]; then
echo "Usage: run_qwen_05_peft.sh <nproc_per_node> <save_path> [other_configs...]"
exit 1
fi
nproc_per_node=$1
save_path=$2
# Shift the arguments so $@ refers to the rest
shift 2
torchrun --standalone --nnodes=1 --nproc_per_node=$nproc_per_node \
-m verl.trainer.fsdp_sft_trainer \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.prompt_key=extra_info \
data.response_key=extra_info \
optim.lr=1e-4 \
+data.prompt_dict_keys=['question'] \
+data.response_dict_keys=['answer'] \
data.micro_batch_size_per_gpu=4 \
model.partial_pretrain=Qwen/Qwen2.5-0.5B-Instruct \
trainer.default_local_dir=$save_path \
trainer.project_name=gsm8k-sft \
trainer.experiment_name=gsm8k-sft-qwen-2.5-0.5b-instruct \
trainer.logger=['console'] \
trainer.total_epochs=1 \
trainer.default_hdfs_dir=null $@ \
model.lora_rank=32\
model.lora_alpha=16 \
model.target_modules=all-linear
# Or you can do this:
# model.target_modules=[q_proj,v_proj] \
可以发现,脚本中传入了很多参数,主要分为data
、optim
、trainer
和model
等几类,而观察main文件verl/trainer/fsdp_sft_trainer.py
:
@hydra.main(config_path='config', config_name='sft_trainer', version_base=None)
def main(config):
local_rank, rank, world_size = initialize_global_process_group()
device_mesh = init_device_mesh(device_type='cuda', mesh_shape=(world_size,), mesh_dim_names=('fsdp',))
dp_size = world_size // config.ulysses_sequence_parallel_size
ulysses_device_mesh = init_device_mesh(device_type='cuda',
mesh_shape=(dp_size, config.ulysses_sequence_parallel_size),
mesh_dim_names=('dp', 'sp'))
trainer = FSDPSFTTrainer(config=config, device_mesh=device_mesh, ulysses_device_mesh=ulysses_device_mesh)
trainer.fit()
if __name__ == '__main__':
main()
main函数中使用hydra动态加载了一个config文件,根据其描述,其路径为verl/trainer/config/sft_trainer.yaml
:
data:
train_batch_size: 256
micro_batch_size: null # will be deprecated, use micro_batch_size_per_gpu
micro_batch_size_per_gpu: 4 # 每个gpu上的batch_size
train_files: ~/data/gsm8k/train.parquet # 训练集文件路径
val_files: ~/data/gsm8k/test.parquet # 验证集文件路径
prompt_key: question # 你的json格式数据中的prompt字段
response_key: answer # response字段
max_length: 1024 # tokenize时的最大长度
truncation: error # left, right or erro,前两个表示截断策略
balance_dp_token: False
chat_template: null
model:
partial_pretrain: ~/models/gemma-1.1-7b-it # 模型路径
fsdp_config: # 下面开始是FSDP训练配置
wrap_policy:
min_num_params: 0
cpu_offload: False # 是否加载到CPU
offload_params: False # 是否将参数加载到CPU
external_lib: null
enable_gradient_checkpointing: False # 是否启用梯度检查点策略
trust_remote_code: False # 模型加载时的配置
lora_rank: 0 # lora的rank,为0表示不启用lora,即全量训练
lora_alpha: 16 # lora的scaling factor
target_modules: all-linear # lora的Target modules
use_liger: False
optim:
lr: 1e-5
betas: [0.9, 0.95]
weight_decay: 0.01
warmup_steps_ratio: 0.1
clip_grad: 1.0
ulysses_sequence_parallel_size: 1
use_remove_padding: False
trainer:
default_local_dir: /tmp/sft_model # 训练完后模型的保存路径
default_hdfs_dir: hdfs://tmp/experiments/gsm8k/gemma-1.1-7b-it/ # change the hdfs path here
resume_path: null # 中断后启用继续训练时的加载路径
project_name: gsm8k-sft
experiment_name: test
total_epochs: 4 # 训练轮次
total_training_steps: null # 这里一般不写,代码中自动计算
logger: ['console']
seed: 1
2.2.1 配置文件
一般我习惯将所有参数写在一个yaml文件中,而不是在运行脚本中传入所有参数。如果你需要自定义一个训练,可以重写一个yaml文件,例如叫sft.yaml,然后修改一下源代码,直接传入一个yaml文件路径:
# @hydra.main(config_path='config', config_name='sft_trainer', version_base=None)
def main(args):
config = load_config(args.config_path)
local_rank, rank, world_size = initialize_global_process_group()
device_mesh = init_device_mesh(device_type='cuda', mesh_shape=(world_size,), mesh_dim_names=('fsdp',))
dp_size = world_size // config.ulysses_sequence_parallel_size
ulysses_device_mesh = init_device_mesh(device_type='cuda',
mesh_shape=(dp_size, config.ulysses_sequence_parallel_size),
mesh_dim_names=('dp', 'sp'))
trainer = FSDPSFTTrainer(config=config, device_mesh=device_mesh, ulysses_device_mesh=ulysses_device_mesh)
trainer.fit()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Load and process YAML configuration.")
parser.add_argument("--config_path", type=str, required=True, help="Path to the YAML configuration file.")
args = parser.parse_args()
main(args)
这样,运行脚本就可以简化为:
set -x
nproc_per_node=8
CONFIG_PATH="/.../sft.yaml"
torchrun --standalone --nnodes=1 --nproc_per_node=$nproc_per_node --master_port=123456 \
-m verl.trainer.fsdp_sft_trainer \
--config_path=$CONFIG_PATH
2.2.2 remove validation
很多时候可能我们只需要一个训练集,不需要在训练过程中进行验证,那么可以修改fsdp_sft_trainer.py
中的相关代码:
2.3 RL
RL以GRPO为例,主文件路径为verl/trainer/main_ppo.py
,运行脚本以examples/grpo_trainer/run_qwen2-7b.sh
为例:
set -x
export VLLM_ATTENTION_BACKEND=XFORMERS
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=$HOME/data/gsm8k/train.parquet \
data.val_files=$HOME/data/gsm8k/test.parquet \
data.train_batch_size=1024 \
data.val_batch_size=1312 \
data.max_prompt_length=512 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
...
2.3.1 配置文件
与SFT一样,可以看到main_ppo.py
文件中同样是动态加载了一个yaml文件:
@hydra.main(config_path='config', config_name='ppo_trainer', version_base=None)
def main(config):
run_ppo(config)
即verl/trainer/config/ppo_trainer.yaml
,下面也对其中所有参数进行简单介绍:
data:
tokenizer: null # 这里不需要传入路径,会根据模型路径来选择
train_files: ~/data/rlhf/gsm8k/train.parquet # 训练集路径
val_files: ~/data/rlhf/gsm8k/test.parquet # 验证集路径
prompt_key: prompt # 你的json格式数据中的prompt字段,grpo过程中会生成response,所以这里不需要response字段
max_prompt_length: 512 # prompt的最大长度
max_response_length: 512 # 生成的response的最大长度
train_batch_size: 1024 # 训练集batch_size
val_batch_size: 1312
return_raw_input_ids: False # This should be set to true when the tokenizer between policy and rm differs
return_raw_chat: False
shuffle: True # 是否打乱数据集
actor_rollout_ref:
hybrid_engine: True
model:
path: ~/models/deepseek-llm-7b-chat # actor模型路径
external_lib: null
override_config: { } # 自定义config来加载模型
enable_gradient_checkpointing: True # 是否启用梯度检查点策略
use_remove_padding: False
actor:
strategy: fsdp # 并行策略
ppo_mini_batch_size: 256 # actor的update bsz
ppo_micro_batch_size: null # will be deprecated, use ppo_micro_batch_size_per_gpu
ppo_micro_batch_size_per_gpu: null
use_dynamic_bsz: False
ppo_max_token_len_per_gpu: 16384 # n * ${data.max_prompt_length} + ${data.max_response_length}
grad_clip: 1.0 # 梯度裁剪
clip_ratio: 0.2
entropy_coeff: 0.001
use_kl_loss: False # True for GRPO
kl_loss_coef: 0.001 # for grpo KL散度的系数
kl_loss_type: low_var_kl # for grpo
ppo_epochs: 1
shuffle: False
ulysses_sequence_parallel_size: 1 # sp size
optim:
lr: 1e-6
lr_warmup_steps_ratio: 0. # the total steps will be injected during runtime
min_lr_ratio: null # only useful for warmup with cosine
warmup_style: constant # select from constant/cosine 一般使用cosine
total_training_steps: -1 # must be override by program 不需要设置
fsdp_config:
wrap_policy:
# transformer_layer_cls_to_wrap: None
min_num_params: 0
param_offload: False # 是否将部分模型参数加载到CPU
optimizer_offload: False # 是否将部分优化器状态加载到CPU
fsdp_size: -1
ref:
fsdp_config:
param_offload: False
wrap_policy:
# transformer_layer_cls_to_wrap: None
min_num_params: 0
log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
log_prob_micro_batch_size_per_gpu: null
log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: ${actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
rollout: # 这部分很重要
name: vllm
temperature: 1.0
top_k: -1 # 0 for hf rollout, -1 for vllm rollout 一般采用vllm推理
top_p: 1 # vllm的tpo_p参数
prompt_length: ${data.max_prompt_length} # not use for opensource
response_length: ${data.max_response_length}
# for vllm rollout
dtype: bfloat16 # should align with FSDP
gpu_memory_utilization: 0.5 # vllm的GPU利用率,如果显存够用,可以设置大一点
ignore_eos: False
enforce_eager: True
free_cache_engine: True
load_format: dummy_dtensor
tensor_model_parallel_size: 2 # 用于vllm推理的显卡数量
max_num_batched_tokens: 8192
max_num_seqs: 1024
log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
log_prob_micro_batch_size_per_gpu: null
log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}
log_prob_max_token_len_per_gpu: ${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
disable_log_stats: True
enable_chunked_prefill: True # could get higher throughput
# for hf rollout
do_sample: True
# number of responses (i.e. num sample times)
n: 8 # > 1 for grpo # GRPO过程中采样的数量
critic: # grpo中不需要critic模型
strategy: fsdp
optim:
lr: 1e-5
lr_warmup_steps_ratio: 0. # the total steps will be injected during runtime
min_lr_ratio: null # only useful for warmup with cosine
warmup_style: constant # select from constant/cosine
total_training_steps: -1 # must be override by program
model:
path: ~/models/deepseek-llm-7b-chat
tokenizer_path: ${actor_rollout_ref.model.path}
override_config: { }
external_lib: ${actor_rollout_ref.model.external_lib}
enable_gradient_checkpointing: True
use_remove_padding: False
fsdp_config:
param_offload: False
optimizer_offload: False
wrap_policy:
# transformer_layer_cls_to_wrap: None
min_num_params: 0
fsdp_size: -1
ppo_mini_batch_size: ${actor_rollout_ref.actor.ppo_mini_batch_size}
ppo_micro_batch_size: null # will be deprecated, use ppo_micro_batch_size_per_gpu
ppo_micro_batch_size_per_gpu: null
forward_micro_batch_size: ${critic.ppo_micro_batch_size}
forward_micro_batch_size_per_gpu: ${critic.ppo_micro_batch_size_per_gpu}
use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}
ppo_max_token_len_per_gpu: 32768 # (${actor_rollout_ref.actor.ppo_max_token_len_per_gpu}) * 2
forward_max_token_len_per_gpu: ${critic.ppo_max_token_len_per_gpu}
ulysses_sequence_parallel_size: 1 # sp size
ppo_epochs: ${actor_rollout_ref.actor.ppo_epochs}
shuffle: ${actor_rollout_ref.actor.shuffle}
grad_clip: 1.0
cliprange_value: 0.5
reward_model:
enable: False # 如果为true,就使用生成式模型;如果为false,可以自定义reward
strategy: fsdp
model:
input_tokenizer: ${actor_rollout_ref.model.path} # set this to null if the chat template is identical
path: ~/models/FsfairX-LLaMA3-RM-v0.1 # reward model path
external_lib: ${actor_rollout_ref.model.external_lib}
use_remove_padding: False
fsdp_config:
min_num_params: 0
param_offload: False
fsdp_size: -1
micro_batch_size: null # will be deprecated, use micro_batch_size_per_gpu
micro_batch_size_per_gpu: null # set a number
max_length: null
ulysses_sequence_parallel_size: 1 # sp size
use_dynamic_bsz: ${critic.use_dynamic_bsz}
forward_max_token_len_per_gpu: ${critic.forward_max_token_len_per_gpu}
reward_manager: naive # 这里可以自定义一个,后面会讲
algorithm:
gamma: 1.0 # 这部分参考下GRPO的损失函数公式
lam: 1.0
adv_estimator: gae # 改成grpo
kl_penalty: kl # how to estimate kl divergence
kl_ctrl:
type: fixed
kl_coef: 0.001
trainer:
total_epochs: 3 # 比较重要,训练的总epochs数
total_training_steps: null # 程序会自动计算
project_name: verl_examples
experiment_name: gsm8k
logger: [ 'console', 'wandb' ] # 我一般只选择打印到console
val_generations_to_log_to_wandb: 0
nnodes: 1 # 节点数,单机还是多机
n_gpus_per_node: 8 # 每个节点上的GPU数量
save_freq: -1 # 模型多少个steps后保存
# auto: find the last ckpt to resume. If can't find, start from scratch
resume_mode: auto # or auto or resume_path if
resume_from_path: False # 是否继续训练
test_freq: -1 # test的频率,多少个steps后测试一下
critic_warmup: 0
default_hdfs_dir: null
remove_previous_ckpt_in_save: False
del_local_ckpt_after_load: False
default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name} # 模型训练完后的保存路径
同理,这里可以简单修改一下main_ppo.py
,使得可以传入一个自定义的yaml文件:
def load_config(config_path):
"""加载 YAML 配置文件"""
from omegaconf import OmegaConf
with open(config_path, 'r', encoding='utf-8') as file:
config = OmegaConf.load(file)
return config
# @hydra.main(config_path='config', config_name='ppo_trainer', version_base=None)
def main(args):
config = load_config(args.config_path)
run_ppo(config)
然后运行脚本:
set -x
export VLLM_ATTENTION_BACKEND=XFORMERS
nproc_per_node=8
CONFIG_PATH="/.../grpo_trainer.yaml"
python3 -m verl.trainer.main_ppo \
--config_path=$CONFIG_PATH
2.3.2 自定义Reward
在main_ppo.py
中,选择reward的代码为:
因此,我们也可以在verl.workers.reward_manager
文件夹下写一个CustomRewardManger.py
文件(记得在init文件中导入):
from verl import DataProto
from verl.utils.reward_score import _default_compute_score
import torch
class CustomRewardManager:
"""The custom reward manager.
"""
def __init__(self, tokenizer, num_examine, compute_score=None) -> None:
self.tokenizer = tokenizer
self.num_examine = num_examine # the number of batches of decoded responses to print to the console
self.compute_score = compute_score or _default_compute_score
def __call__(self, data: DataProto):
"""We will expand this function gradually based on the available datasets"""
# If there is rm score, we directly return rm score. Otherwise, we compute via rm_score_fn
if 'rm_scores' in data.batch.keys():
return data.batch['rm_scores']
reward_tensor = torch.zeros_like(data.batch['responses'], dtype=torch.float32)
# data.batch keys:
# 1. responses: response tokens
# 2. prompts:
already_print_data_sources = {}
for i in range(len(data)):
data_item = data[i] # DataProtoItem
prompt_ids = data_item.batch['prompts']
prompt_length = prompt_ids.shape[-1]
valid_prompt_length = data_item.batch['attention_mask'][:prompt_length].sum()
valid_prompt_ids = prompt_ids[-valid_prompt_length:]
response_ids = data_item.batch['responses']
valid_response_length = data_item.batch['attention_mask'][prompt_length:].sum()
valid_response_ids = response_ids[:valid_response_length]
# decode
sequences = torch.cat((valid_prompt_ids, valid_response_ids))
sequences_str = self.tokenizer.decode(sequences)
# custom score
prompt = self.tokenizer.decode(valid_prompt_ids)
response = self.tokenizer.decode(valid_response_ids)
# 这里可以写你自定义的reward函数
score = reward_func(prompt=prompt, response=response)
reward_tensor[i, valid_response_length - 1] = score
return reward_tensor
在这个自定义的类中,有一个for循环用于计算每个对应的response的reward值,由于原始代码中只给出了prompt_id和response_id,因此,如果自定义的奖励函数需要传入str,就需要先decode得到字符串,然后使用reward_func
得到奖励即可。例如,可以直接将response的长度当做奖励,来鼓励模型输出更长的回复:
def reward_func(prompt, response):
"""Reward function that gives higher scores to longer completions."""
return float(len(response))
2.3.3 模型保存与还原
GRPO过程结束后,verl保存的checkpoint并不是huggingface上的那种形式,而是连同优化器状态也一并保存,为了还原成huggingface中的形式,可以使用如下代码:
#!/usr/bin/env python
# encoding: utf-8
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
import torch
from glob import glob
from collections import defaultdict
def main():
step = "50"
fsdp_checkpoint_path = f"/.../global_step_{step}/actor"
huggingface_model_path = f"/.../global_step_{step}/actor/huggingface"
output_path = f"/.../huggingface_checkpoint/checkpoint_global_step_{step}"
state_dict = defaultdict(list)
world_size = 8 # 8卡
for rank in range(world_size):
filepath = f"{fsdp_checkpoint_path}/model_world_size_{world_size}_rank_{rank}.pt"
print('loading', filepath)
this_state_dict = torch.load(filepath)
for key, value in this_state_dict.items():
state_dict[key].append(value.to_local())
for key in state_dict:
state_dict[key] = torch.cat(state_dict[key], dim=0)
config = AutoConfig.from_pretrained(huggingface_model_path)
model = AutoModelForCausalLM.from_config(config)
model.load_state_dict(state_dict)
model.save_pretrained(output_path, max_shard_size="10GB")
tokenizer = AutoTokenizer.from_pretrained(huggingface_model_path)
tokenizer.save_pretrained(output_path)
if __name__ == "__main__":
main()
修改一下step,就可以将不同step保存的actor模型转为huggingface上的形式,便于加载。
III. 总结
本文以小白的视角深入探索了如何使用 verl 框架实现大模型的后训练(Post-Training),包括监督微调(SFT)和强化学习(以 GRPO 为例)。通过详细的步骤和代码示例,本文展示了如何从环境搭建到训练配置的整个过程。