这篇文章包括:
* 使用jupyter notebook载入、推理和LoRA强化微调一个多模态的Qwen 2.5模型。
* 使用GRPO强化微调,奖励函数等设定
为什么要做VLM强化微调?
* 私有数据集上需要做微调适配,除了SFT,强化微调提供了其他可行方案。
* 很多情况下数据集的图文对包含答案简短,推理信息需要模型自行补全。然而,一般的SFT训练决定了模型输出必须是数据集中简短的答案形式。GRPO训练有助于激发模型的推理潜能。[2503.01785] Visual-RFT: Visual Reinforcement Fine-Tuning
直接在kaggle上免费体验,可训练和推理:Finetune_QWen2.5VL_GRPO | Kaggle
1. 准备环境
!pip install unsloth --q
注:本来希望用unsloth加速训练的,但是unsloth目前似乎不支持VLM的GRPO训练。这里只用到unsloth附带的整套环境。
2. 读取模型
from datasets import load_dataset,Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import GRPOConfig, GRPOTrainer
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
Qwen2_5_VLForConditionalGeneration,
AutoProcessor,
BitsAndBytesConfig
)
import torch
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
)
tokenizer = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct",
trust_remote_code=True)
# use cuda device
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct",
device_map="auto",
trust_remote_code=True,
torch_dtype=compute_dtype,
quantization_config=bnb_config).eval()
这里采用了4-bit量化,将采用Q-LoRA。如果不希望因为量化损失模型性能,可以注释quantization_config=bnb_config,训练需要显存48G以上。直接从huggingface读取模型可能较慢,如果自己有模型镜像,直接将"Qwen/Qwen2.5-VL-7B-Instruct"修改成模型的保存路径。
3. 加载和处理数据集
from datasets import load_dataset,Dataset
from PIL import Image
import base64
from io import BytesIO
import pandas as pd
from tqdm import tqdm
ds = load_dataset("BUAADreamer/llava-med-zh-instruct-60k",split = "train[0:2000]", trust_remote_code=True)
print(ds[0])# show content
def get_prompt_rft(example):
'''
input: dict example, including PIL image object
output: multiple samples, within a dict format
'''
dialogue_num = len(example['messages'])
i = 0
results=[]
while i<dialogue_num:
assert example['messages'][i]['role']=='user' and example['messages'][i+1]['role']=='assistant'
question_sample = example['messages'][i]['content']
answer_sample = example['messages'][i+1]['content']
img_pil = example['images'][0].resize((128,128)) # reduce vRAM burden
out_results = []
SYSTEM_PROMPT = r'''
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of
thoughts to ensure a logical and accurate response.
### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
Please answer the following medical question based on the input image. Output the thinking process in <think> </think> and final answer in <answer> </answer> tags.The output format should be as follows:
<think> ... </think> <answer>...</answer>
除了特殊符号,请用中文回答
''' # for a different language, please change the last few words.
results.append({
'prompt': [
{'role': 'system', 'content': [{"type": "text", "text": SYSTEM_PROMPT}]},
{'role': 'user', 'content': [
{"type": "image", },
{"type": "text", "text": question_sample},
]}
],
'image':img_pil,
'solution':answer_sample,
})
i+=2
return results
def dataset_gen():
for items in ds:
multiple_out = get_prompt_rft(items)
for single_out in multiple_out:
yield single_out
my_gen = dataset_gen()
dataset_train = Dataset.from_generator(dataset_gen)
print(dataset_train[-1])
这里以一个单张图片、多轮对话的数据集为例。数据集中每条数据格式如下:
{'messages': [{'role': 'user', 'content': '图中的组织类型是什么?'}, {'role': 'assistant', 'content': '图中的组织切片显示了一个鼻内肿块的病理学样本,这是从鼻腔内部取样的。'}, {'role': 'user', 'content': '放大倍率是多少?'}, {'role': 'assistant', 'content': '图像的放大倍数为200倍。'}, {'role': 'user', 'content': '根据病理学特征,诊断是什么?'}, {'role': 'assistant', 'content': '诊断为B细胞淋巴瘤。B细胞淋巴瘤起源于B淋巴细胞,这是一种白血细胞。影像中显示,大量圆形大细胞弥漫性地遮盖了呼吸道上皮下的基底膜。结合其他临床和实验室资料,这一诊断得以确立。'}], 'images': [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=720x494 at 0x7D67981547C0>]}
传递给模型的数据,需要是{'prompt':[{'role': 'system', 'content':...},{{'role': 'user', 'content':...}], 'image':img_pil, 'solution':answer_sample,}格式的数据。这里要把原始数据的每个问答对单独拆开处理。
SYSTEM_PROMPT 的构建也很重要,需要对不同任务做不同的调整。注意:过短会导致输出准确度下降,过长导致显存占用过大、模型复读机等现象。
4. 尝试推理
image = ds[0]['images'][0]
instruction = "You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. \
Please answer the following medical question based on the input image. 请用中文回答"
# for a different language, please change the last few words.
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
use_cache = True, temperature = 1.5, min_p = 0.1)
这里输入到模型的应该是经过processor处理的数据,输入包括image和转换后的input_text,输出是图像和文字token(即inputs)。
5. 准备奖励函数
import re
!pip install levenshtein
from Levenshtein import ratio as levenshtein_ratio
def format_reward_func(completions, **kwargs):
"""Reward function that checks if the completion has a specific format."""
# print(completions) #debug
pattern = r"^<think>.*?</think>.*?<answer>.*?</answer>$"
matches = [re.match(pattern, content[0]['content'], re.DOTALL) for content in completions]
return [1.0 if match else 0.0 for match in matches]
def levenshtein_reward_func(completions, solution, **kwargs):
"""Reward function that checks if the completion get solutions correctly."""
res = []
for completion, sol in zip(completions, solution):
completion = completion[0]['content']
if '</think>' in completion:
t = completion.split('</think>')[-1] # calculate result distance
res.append(levenshtein_ratio(t, sol))
else:
res.append(0.0)
return res
奖励函数将在GRPOTrainer内部调用,函数的输入是模型回答和处理后数据集中的每个key名称(如completions对应模型在一个问题下的多条回答, solution对应刚刚构建的数据集中'solution':answer_sample。如果有其他名称的字段,依此类推)。一般要对输出做think格式约束、答案准确性约束(这里用了字符串距离levenshtein_ratio,和标准答案比较)

引用一下https://huggingface.co/docs/trl/main/en/grpo_trainer的图。这里奖励函数做的步骤就是图中的RM(Reward Model,负责从completions计算rewards)。GRPO和SFT最大的区别在于他不需要对输出的每一个token 和GT token做loss计算,而是将答案分数的控制权交给用户,让用户设计方案判断一段答案的分数,最后用策略梯度的方案和KL loss相乘,进行梯度计算。在一些推理任务中,甚至不需要数据集中提供标准答案,而是按照给定标准进行评判,如GRPO多模态奖励函数:利用大模型API接入_grpo 奖励模型-优快云博客。
6. 训练
请在实践前检查这个PR,如果trl trainer已经支持qwen vl系列,推荐用官方的trl trainer
output_dir="./outputs/Qwevl-Instruct-GRPO"
run_name="Qwen-vl-GRPO-medical"
# from unsloth import is_bfloat16_supported
from trl import GRPOConfig
!git clone https://github.com/auto-Dog/vlm_rft_trainer.git
!cd /kaggle/working/vlm_rft_trainer && git pull # 替换成自己的地址
!cp /kaggle/working/vlm_rft_trainer/grpo_trainer.py grpo_trainer.py # 替换成自己的地址
from grpo_trainer import Qwen2VLGRPOTrainer # third-party trainer from open-R1
model.train()
peft_config = LoraConfig(
r=32, #Rank
lora_alpha=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
# "gate_proj",
# "up_proj",
# "down_proj"
],
bias="none",
lora_dropout=0.05, # Conventional
)
training_args = GRPOConfig(
# use_vllm = True, # use vLLM for fast inference!
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "adamw_8bit",
logging_steps = 1,
bf16 = False,
fp16 = True,
per_device_train_batch_size = 1,# keep same with num_generations
gradient_accumulation_steps = 2, # Increase to 4 for smoother training
num_generations = 4, # Decrease if out of memory
max_prompt_length = 2048,
max_completion_length = 2048,
num_train_epochs = 1, # Set to 1 for a full training run
max_steps = 100,
save_steps = 5,
max_grad_norm = 0.1,
report_to = "none", # Can use Weights & Biases
output_dir = "outputs",
)
trainer = Qwen2VLGRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
format_reward_func, # all reward functions
levenshtein_reward_func],
args=training_args,
train_dataset=dataset_train,
peft_config = peft_config,
)
trainer.train()
trainer.save_model(output_dir)
由于VLM的GRPO训练器尚未在transformers库中得到有效支持,这里采用openR1项目中的训练器并进行修改,支持图文和仅文字输入。如果需要vllm加速推理、多卡训练,可参考https://github.com/Liuziyu77/Visual-RFT/tree/main/src/virft/src/open_r1/trainer的训练器(trainer),将
cp /kaggle/working/vlm_rft_trainer/grpo_trainer.py grpo_trainer.py
语句换成对应的trainer的地址即可。
from trl import GRPOConfig, GRPOTrainer
from qwen_vl_utils import process_vision_info
def preprocess_vision_info(examples):
'''Process dataset '''
examples_copy = copy.deepcopy(examples)
batch_size = len(examples["prompt"])
examples["image"] = []
for i in range(batch_size):
prompt_data = examples_copy["prompt"][i]
image_data = examples_copy["image"][i]
for message in prompt_data:
for content in message["content"]:
if isinstance(content, dict) and content.get("type") == "image":
content["image"] = image_data
processed_images, _ = process_vision_info(prompt_data)
examples["image"].extend(processed_images)
return examples
# 重要,对数据集提前进行二次处理
dataset_train = dataset_train.with_transform(preprocess_vision_info)
model.train()
peft_config = LoraConfig(
r=32, #Rank
lora_alpha=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
# "gate_proj",
# "up_proj",
# "down_proj"
],
bias="none",
lora_dropout=0.05, # Conventional
)
training_args = GRPOConfig(
# use_vllm = True, # use vLLM for fast inference!
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "adamw_8bit",
logging_steps = 1,
bf16 = False,
fp16 = True,
per_device_train_batch_size = 1,# keep same with num_generations
gradient_accumulation_steps = 2, # Increase to 4 for smoother training
num_generations = 4, # Decrease if out of memory
max_prompt_length = 2048,
max_completion_length = 2048,
num_train_epochs = 1, # Set to 1 for a full training run
max_steps = 100,
save_steps = 5,
max_grad_norm = 0.1,
report_to = "none", # Can use Weights & Biases
output_dir = "outputs",
)
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
format_reward_func, # all reward functions
levenshtein_reward_func],
args=training_args,
train_dataset=dataset_train,
peft_config = peft_config,
)
trainer.train()
trainer.save_model(output_dir)
7. 检验效果
同推理部分
model.eval()
message = dataset_train[0]['prompt']
image = dataset_train[0]['image']
input_text = tokenizer.apply_chat_template(message, add_generation_prompt = True)
inputs = tokenizer(
image,
input_text,
add_special_tokens = False,
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
use_cache = True, temperature = 1.5, min_p = 0.1)
训练期间观察到的问题:
1. 起初的loss始终为0,但这不代表模型没有进行更新。GRPO Loss初期为0的原因与改进方法 - 张胜东的博客
2. num_generation越大,显存消耗越大,但对于RFT探索不同答案有帮助。
3. 直接RFT效果不一定好。Deepseek R1在RFT之前有SFT冷启动步骤,即让模型学习如何推理,是可行的尝试方案。
4. 对于回答准确性的判断,也可以用语言模型。推荐用输入completion文本和GT文本,输出奖励分数的分类模型。
891

被折叠的 条评论
为什么被折叠?



