基本信息
- 📝 原文链接: https://arxiv.org/abs/2502.13173
- 👥 作者: Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, Xiaotian Han
- 🏷️ 关键词: Thinking Preference Optimization, Direct Preference Optimization, Supervised Fine-Tuning
- 📚 分类: 自然语言处理, 机器学习
摘要
中文摘要
监督微调(SFT)一直是增强相对较小语言模型(LLM)的长期思维链(CoT)推理的有效方法,通过用来自更大LLM的长期CoT响应来微调它们。为了持续提高推理能力,我们可以要么收集新的高质量长期CoT推理SFT数据,要么反复在现有的SFT数据集上训练。然而,获取新的长期CoT SFT数据成本高昂且有限,而反复训练往往会导致性能停滞或下降。为了进一步利用SFT数据提升性能,我们提出了思维偏好优化(ThinkPO),这是一种简单而有效的方法,在无需新的长期CoT响应的情况下增强长期CoT推理。相反,ThinkPO利用现成的或易于获取的短期CoT推理响应作为拒绝答案,将长期CoT响应作为选择答案来处理同一问题。然后,它应用直接偏好优化来鼓励模型更偏好较长的推理输出。实验表明,ThinkPO进一步提高了SFT模型的推理性能,例如,它将SFT模型的数学推理准确率提高了8.6%,输出长度增加了25.9%。值得注意的是,ThinkPO能够持续提升公开蒸馏的SFT模型性能,例如,将官方DeepSeek-R1-Distill-Qwen-7B在MATH500上的性能从87.4%提升到91.2%。
原文摘要
Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning abilities, we can either collect new high-quality long CoT reasoning SFT data or repeatedly train on existing SFT datasets. However, acquiring new long CoT SFT data is costly and limited, while repeated training often results in a performance plateau or decline. To further boost the performance with the SFT data, we propose Thinking Preference Optimization (ThinkPO), a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes readily available or easily obtainable short CoT reasoning responses as rejected answers and long CoT responses as chosen answers for the same question. It then applies direct preference optimization to encourage the model to favor longer reasoning outputs. Experiments show that ThinkPO further improves the reasoning performance of SFT-ed models, e.g. it increases math reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%. Notably, ThinkPO is capable of continually boosting the performance of the publicly distilled SFT model, e.g., increasing the official DeepSeek-R1-Distill-Qwen-7B’s performance on MATH500 from 87.4% to 91.2%.
论文解读
一句话总结
本文提出了一种名为“思考偏好优化”(ThinkPO)的方法,通过利用现有的短链式思维(CoT)推理响应作为拒绝答案,以及长CoT响应作为选择答案,来增强经过监督微调(SFT)的LLM的推理能力。
…
阅读全文:https://www.llamafactory.cn/daily-paper/detail/?id=1334