使用GRPO微调多模态VLM模型（Qwen 2.5 VL）

原创已于 2025-09-01 11:58:53 修改 · 6.7k 阅读

73 ·

CC 4.0 BY-SA版权

文章标签：

#python #人工智能 #算法 #AIGC

于 2025-03-28 13:02:05 首次发布

该文章已生成可运行项目，

这篇文章包括：

* 使用jupyter notebook载入、推理和LoRA强化微调一个多模态的Qwen 2.5模型。

* 使用GRPO强化微调，奖励函数等设定

为什么要做VLM强化微调？

* 私有数据集上需要做微调适配，除了SFT，强化微调提供了其他可行方案。

* 很多情况下数据集的图文对包含答案简短，推理信息需要模型自行补全。然而，一般的SFT训练决定了模型输出必须是数据集中简短的答案形式。GRPO训练有助于激发模型的推理潜能。[2503.01785] Visual-RFT: Visual Reinforcement Fine-Tuning

直接在kaggle上免费体验，可训练和推理：Finetune_QWen2.5VL_GRPO | Kaggle

1. 准备环境

!pip install unsloth --q

注：本来希望用unsloth加速训练的，但是unsloth目前似乎不支持VLM的GRPO训练。这里只用到unsloth附带的整套环境。

2. 读取模型

from datasets import load_dataset,Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import GRPOConfig, GRPOTrainer
from transformers import (
AutoModelForCausalLM, 
AutoTokenizer, 
Qwen2_5_VLForConditionalGeneration, 
AutoProcessor,
BitsAndBytesConfig
)
import torch
compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)
tokenizer = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct",
                                         trust_remote_code=True)
# use cuda device
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", 
                                             device_map="auto", 
                                             trust_remote_code=True,
                                            torch_dtype=compute_dtype,
                                            quantization_config=bnb_config).eval()

这里采用了4-bit量化，将采用Q-LoRA。如果不希望因为量化损失模型性能，可以注释quantization_config=bnb_config，训练需要显存48G以上。直接从huggingface读取模型可能较慢，如果自己有模型镜像，直接将"Qwen/Qwen2.5-VL-7B-Instruct"修改成模型的保存路径。

3. 加载和处理数据集

from datasets import load_dataset,Dataset
from PIL import Image
import base64
from io import BytesIO
import pandas as pd
from tqdm import tqdm

ds = load_dataset("BUAADreamer/llava-med-zh-instruct-60k",split = "train[0:2000]", trust_remote_code=True)
print(ds[0])# show content


def get_prompt_rft(example):
    '''
    input: dict example, including PIL image object
    output: multiple samples, within a dict format
    '''
    dialogue_num = len(example['messages'])
    i = 0
    results=[]
    while i<dialogue_num:
        assert example['messages'][i]['role']=='user' and example['messages'][i+1]['role']=='assistant'
        question_sample = example['messages'][i]['content']
        answer_sample = example['messages'][i+1]['content']
        img_pil = example['images'][0].resize((128,128))  # reduce vRAM burden
        out_results = []
        SYSTEM_PROMPT = r'''
        Below is an instruction that describes a task, paired with an input that provides further context.
        Write a response that appropriately completes the request.
        Before answering, think carefully about the question and create a step-by-step chain of 
        thoughts to ensure a logical and accurate response.
        
        ### Instruction:
        You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning.
        Please answer the following medical question based on the input image. Output the thinking process in <think> </think> and final answer in <answer> </answer> tags.The output format should be as follows:
<think> ... </think> <answer>...</answer>
除了特殊符号，请用中文回答
        '''   # for a different language, please change the last few words.
        results.append({
                'prompt': [
                    {'role': 'system', 'content': [{"type": "text", "text": SYSTEM_PROMPT}]},
                    {'role': 'user', 'content': [
                        {"type": "image", },  
                        {"type": "text", "text": question_sample},    
                    ]}
                ],
                'image':img_pil,
                'solution':answer_sample,
            })
        i+=2
    return results

def dataset_gen():
    for items in ds:
        multiple_out = get_prompt_rft(items)
        for single_out in multiple_out:
            yield single_out
my_gen = dataset_gen()

dataset_train = Dataset.from_generator(dataset_gen)
print(dataset_train[-1])

这里以一个单张图片、多轮对话的数据集为例。数据集中每条数据格式如下：

{'messages': [{'role': 'user', 'content': '图中的组织类型是什么？'}, {'role': 'assistant', 'content': '图中的组织切片显示了一个鼻内肿块的病理学样本，这是从鼻腔内部取样的。'}, {'role': 'user', 'content': '放大倍率是多少？'}, {'role': 'assistant', 'content': '图像的放大倍数为200倍。'}, {'role': 'user', 'content': '根据病理学特征，诊断是什么？'}, {'role': 'assistant', 'content': '诊断为B细胞淋巴瘤。B细胞淋巴瘤起源于B淋巴细胞，这是一种白血细胞。影像中显示，大量圆形大细胞弥漫性地遮盖了呼吸道上皮下的基底膜。结合其他临床和实验室资料，这一诊断得以确立。'}], 'images': [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=720x494 at 0x7D67981547C0>]}

传递给模型的数据，需要是{'prompt':[{'role': 'system', 'content':...},{{'role': 'user', 'content':...}], 'image':img_pil, 'solution':answer_sample,}格式的数据。这里要把原始数据的每个问答对单独拆开处理。

SYSTEM_PROMPT 的构建也很重要，需要对不同任务做不同的调整。注意：过短会导致输出准确度下降，过长导致显存占用过大、模型复读机等现象。

4. 尝试推理

image = ds[0]['images'][0]
instruction = "You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. \
Please answer the following medical question based on the input image. 请用中文回答"
# for a different language, please change the last few words.
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

这里输入到模型的应该是经过processor处理的数据，输入包括image和转换后的input_text，输出是图像和文字token（即inputs）。

5. 准备奖励函数

import re
!pip install levenshtein
from Levenshtein import ratio as levenshtein_ratio
def format_reward_func(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    # print(completions) #debug
    pattern = r"^<think>.*?</think>.*?<answer>.*?</answer>$"
    matches = [re.match(pattern, content[0]['content'], re.DOTALL) for content in completions]
    return [1.0 if match else 0.0 for match in matches]
def levenshtein_reward_func(completions, solution, **kwargs):
    """Reward function that checks if the completion get solutions correctly."""
    res = []
    for completion, sol in zip(completions, solution):
        completion = completion[0]['content']
        if '</think>' in completion:
            t = completion.split('</think>')[-1]    # calculate result distance
            res.append(levenshtein_ratio(t, sol))
        else:
            res.append(0.0)
    return res

奖励函数将在GRPOTrainer内部调用，函数的输入是模型回答和处理后数据集中的每个key名称（如completions对应模型在一个问题下的多条回答, solution对应刚刚构建的数据集中'solution':answer_sample。如果有其他名称的字段，依此类推）。一般要对输出做think格式约束、答案准确性约束（这里用了字符串距离levenshtein_ratio，和标准答案比较）

引用一下https://huggingface.co/docs/trl/main/en/grpo_trainer的图。这里奖励函数做的步骤就是图中的RM(Reward Model，负责从completions计算rewards)。GRPO和SFT最大的区别在于他不需要对输出的每一个token 和GT token做loss计算，而是将答案分数的控制权交给用户，让用户设计方案判断一段答案的分数，最后用策略梯度的方案和KL loss相乘，进行梯度计算。在一些推理任务中，甚至不需要数据集中提供标准答案，而是按照给定标准进行评判，如GRPO多模态奖励函数：利用大模型API接入_grpo 奖励模型-优快云博客。

6. 训练

请在实践前检查这个PR，如果trl trainer已经支持qwen vl系列，推荐用官方的trl trainer

[GRPO] add vlm training capabilities to the trainer by CompN3rd · Pull Request #3072 · huggingface/trl · GitHub

output_dir="./outputs/Qwevl-Instruct-GRPO"
run_name="Qwen-vl-GRPO-medical"
# from unsloth import is_bfloat16_supported
from trl import GRPOConfig
!git clone https://github.com/auto-Dog/vlm_rft_trainer.git
!cd /kaggle/working/vlm_rft_trainer && git pull    # 替换成自己的地址
!cp /kaggle/working/vlm_rft_trainer/grpo_trainer.py grpo_trainer.py    # 替换成自己的地址

from grpo_trainer import Qwen2VLGRPOTrainer # third-party trainer from open-R1
model.train()
peft_config = LoraConfig(
    r=32, #Rank
    lora_alpha=16,
    target_modules=[
        "q_proj", 
        "k_proj", 
        "v_proj", 
        "o_proj", 
        # "gate_proj", 
        # "up_proj", 
        # "down_proj"
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
)

training_args = GRPOConfig(
    # use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = False,
    fp16 = True,
    per_device_train_batch_size = 1,# keep same with num_generations
    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
    num_generations = 4, # Decrease if out of memory
    max_prompt_length = 2048,
    max_completion_length = 2048,
    num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 100,
    save_steps = 5,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)
trainer = Qwen2VLGRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        format_reward_func, # all reward functions
        levenshtein_reward_func],
    args=training_args,
    train_dataset=dataset_train,
    peft_config = peft_config,
)

trainer.train()

trainer.save_model(output_dir)

由于VLM的GRPO训练器尚未在transformers库中得到有效支持，这里采用openR1项目中的训练器并进行修改，支持图文和仅文字输入。如果需要vllm加速推理、多卡训练，可参考https://github.com/Liuziyu77/Visual-RFT/tree/main/src/virft/src/open_r1/trainer的训练器（trainer），将

cp /kaggle/working/vlm_rft_trainer/grpo_trainer.py grpo_trainer.py

语句换成对应的trainer的地址即可。

from trl import GRPOConfig, GRPOTrainer
from qwen_vl_utils import process_vision_info

def preprocess_vision_info(examples):
    '''Process dataset '''
    examples_copy = copy.deepcopy(examples)
    batch_size = len(examples["prompt"])
    examples["image"] = []
    for i in range(batch_size):
        prompt_data = examples_copy["prompt"][i]
        image_data = examples_copy["image"][i]
        for message in prompt_data:
            for content in message["content"]:
                if isinstance(content, dict) and content.get("type") == "image":
                    content["image"] = image_data
        processed_images, _ = process_vision_info(prompt_data)
        examples["image"].extend(processed_images)
    return examples

# 重要，对数据集提前进行二次处理
dataset_train = dataset_train.with_transform(preprocess_vision_info)

model.train()
peft_config = LoraConfig(
    r=32, #Rank
    lora_alpha=16,
    target_modules=[
        "q_proj", 
        "k_proj", 
        "v_proj", 
        "o_proj", 
        # "gate_proj", 
        # "up_proj", 
        # "down_proj"
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
)

training_args = GRPOConfig(
    # use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = False,
    fp16 = True,
    per_device_train_batch_size = 1,# keep same with num_generations
    gradient_accumulation_steps = 2, # Increase to 4 for smoother training
    num_generations = 4, # Decrease if out of memory
    max_prompt_length = 2048,
    max_completion_length = 2048,
    num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 100,
    save_steps = 5,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        format_reward_func, # all reward functions
        levenshtein_reward_func],
    args=training_args,
    train_dataset=dataset_train,
    peft_config = peft_config,
)

trainer.train()

trainer.save_model(output_dir)

7. 检验效果

同推理部分

model.eval()
message = dataset_train[0]['prompt']
image = dataset_train[0]['image']
input_text = tokenizer.apply_chat_template(message, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 512,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

训练期间观察到的问题：

1. 起初的loss始终为0，但这不代表模型没有进行更新。GRPO Loss初期为0的原因与改进方法 - 张胜东的博客

2. num_generation越大，显存消耗越大，但对于RFT探索不同答案有帮助。

3. 直接RFT效果不一定好。Deepseek R1在RFT之前有SFT冷启动步骤，即让模型学习如何推理，是可行的尝试方案。

4. 对于回答准确性的判断，也可以用语言模型。推荐用输入completion文本和GT文本，输出奖励分数的分类模型。

本文章已经生成可运行项目