GRPO强化学习尝试-训练自己的R1推理模型

原创

已于 2025-02-11 16:41:06 修改 · 1.1k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#python #unsloth #vllm #trl #GRPO

于 2025-02-11 15:13:27 首次发布

DeepSeek-R1以其低成本优势（训练成本只有国外模型的1/5，推理成本在优惠期间是01的1/27）在春节期间🔥遍全球，是否你也想使用自己的数据去训练一个R1模型呢？下面给出了这个训练的过程。

1、运行容器：nvcr.io/nvidia/pytorch:24.05-py3

2、安装依赖：

pip install -I Unsloth vllm trl

3、使用unsloth训练框架

from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

4、使用模型：基于qwen2.5-1.5B-Instruct去训练，1.5B以上模型才能涌现"啊哈时刻"。

from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-1.5B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.3, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)