
DeepSeek-R1以其低成本优势(训练成本只有国外模型的1/5,推理成本在优惠期间是01的1/27)在春节期间🔥遍全球,是否你也想使用自己的数据去训练一个R1模型呢?下面给出了这个训练的过程。
1、运行容器:nvcr.io/nvidia/pytorch:24.05-py3
2、安装依赖:
pip install -I Unsloth vllm trl
3、使用unsloth训练框架
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

4、使用模型:基于qwen2.5-1.5B-Instruct去训练,1.5B以上模型才能涌现"啊哈时刻"。
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Qwen/Qwen2.5-1.5B-Instruct",
max_seq_length = max_seq_length,
load_in_4bit = True, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.3, # Reduce if out of memory
)
model = FastLanguageModel.get_peft_model(
model,
r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
], # Remove QKVO if out of memory
lora_alpha = lora_rank,
use_gradient_checkpointing = "unsloth", # Enable long context finetuning
random_state = 3407,
)
低成本训练R1模型的GRPO强化学习指南

最低0.47元/天 解锁文章
2061





