在对 Qwen3 进行微调训练时，如何保护模型的思考能力？

原创已于 2025-08-06 15:45:26 修改 · 393 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#Qwen3 #微调 #SFT

于 2025-08-06 15:19:14 首次发布

Qwen 专栏收录该内容

67 篇文章

订阅专栏

在对 Qwen3 进行微调训练时，如何保护模型的思考能力？

flyfish

Qwen3 模型支持两种思考模式：

一种是 “思考模式”，也就是 “慢慢想” 的方式。碰到复杂问题时，它会一步一步仔细推理，想透彻了再给出最终答案。像解数学题、分析复杂逻辑这类需要深入思考的事儿，用这种模式就特别合适。

另一种是 “非思考模式”，相当于 “快答” 模式。面对简单问题时，它能快速给出响应，几乎不用等，直接给结果不磨蹭。比如查个常识、问个简单定义，这种对速度要求比对深度要求高的情况，用这个模式就又快又省事。

在使用不包含思考过程的数据集（即仅包含问题与最终答案，缺少中间推理步骤的数据）进行模型微调时，为避免模型因过度学习 “直接输出答案” 的模式而丢失深层思考能力，可采用以下两种处理方式，从数据结构和训练策略层面减少对思考能力的破坏。

训练工具：SWIFT (Scalable lightWeight Infrastructure for Fine-Tuning)

平常使用的方式

例如对Qwen3-8B进行训练的脚本

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn

方式一：改变训练参数

在训练时额外指定--loss_scale ignore_empty_think，忽略<think>\n\n</think>\n\n的损失计算。

# use `--loss_scale ignore_empty_think`
# Avoid losing the think capability by ignoring the loss of empty `<think>\n\n</think>\n\n`
# This method is also applicable to the Deepseek-R1 series of models.
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:empty_think#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --load_from_cache_file false \
    --loss_scale ignore_empty_think \
    --model_author swift \
    --model_name swift-robot

方式二：数据集的构建方式

在构建数据集时，需要对数据进行两项标准化处理：一是在所有用户提出的问题（query）文本末尾统一添加 /no_think这一特定标识；二是在所有助手给出的答案（response）文本开头，统一添加 <think>\n\n</think>\n\n 这一固定格式的前缀。

# use `swift/self-cognition:qwen3`
# Avoid losing the thinking capability by appending `/no_think` to the dataset query.
# https://github.com/modelscope/ms-swift/blob/77985c2ccdac8ed4037174ee222e79d1f1d5059d/swift/llm/dataset/dataset/llm.py#L835
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --load_from_cache_file false \
    --model_author swift \
    --model_name swift-robot

swift/self-cognition

dataset_infos.json

{
  "default": {
    "features": {
      "query": { "_type": "Value" },
      "response": { "_type": "Value" },
      "tag": { "_type": "Value" }
    },
    "splits": { "train": { "name": "train", "dataset_name": "self-cognition" } }
  }
}

self_cognition.jsonl

{“query”: “你是？”, “response”: “我是{{NAME}}，由{{AUTHOR}}训练的人工智能助手。我的目标是为用户提供有用、准确和及时的信息，并通过各种方式帮助用户进行有效的沟通。请告诉我有什么可以帮助您的呢？”, “tag”: “zh”}
{“query”: “你是谁!”, “response”: “您好！我是{{AUTHOR}}开发的人工智能语言模型，名为{{NAME}}。我可以回答您的问题、提供信息、进行对话并帮助解决问题。如果您有任何疑问或需要帮助，请随时告诉我！”, “tag”: “zh”}
…

register_dataset(
    DatasetMeta(
        ms_dataset_id='swift/self-cognition',
        hf_dataset_id='modelscope/self-cognition',
        subsets=[
            SubsetDataset(preprocess_func=SelfCognitionPreprocessor()),
            SubsetDataset(
                'qwen3',
                preprocess_func=SelfCognitionPreprocessor(
                    query_suffix=' /no_think', response_prefix='<think>\n\n</think>\n\n')),
            SubsetDataset(
                'empty_think', preprocess_func=SelfCognitionPreprocessor(response_prefix='<think>\n\n</think>\n\n')),
        ],
        dataset_name='self-cognition',
        tags=['chat', 'self-cognition', '🔥']))

重点看qwen3这个地方

subsets 参数（数据集子集）
定义了该数据集包含的多个子集（SubsetDataset），每个子集可能对应不同的数据处理需求，通过preprocess_func指定预处理函数：

第一个子集：未指定名称，使用默认的SelfCognitionPreprocessor()预处理函数（基础处理逻辑）。

第二个子集：名称为qwen3，预处理函数为SelfCognitionPreprocessor，并指定了query_suffix=' /no_think'（查询末尾添加该后缀）和response_prefix='<think>\n\n</think>\n\n'（响应开头添加该前缀），可能用于特定模型（如Qwen3）的输入格式适配。

第三个子集：名称为empty_think，预处理函数同样为SelfCognitionPreprocessor，但仅指定response_prefix，可能用于处理“无思考过程”的响应数据。

问题增加 /no_think，答案增加<think>\n\n</think>\n\n

最后经过处理的数据集的样子

{"messages": [{"role": "user", "content": "苹果属于什么科的植物？ /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n 苹果属于蔷薇科植物。"}]}
{"messages": [{"role": "user", "content": "常见的苹果品种有哪些？ /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n 常见的苹果品种有红富士、嘎啦、金帅、蛇果、阿克苏冰糖心等。"}]}
{"messages": [{"role": "user", "content": "苹果一般在什么季节成熟？ /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n 苹果通常在秋季成熟，不同品种成熟时间略有差异，大多集中在 9-10 月。"}]}