自主学习-《Can Large Reasoning Models Self-Train?》

代码和资料:

https://github.com/tajwarfahim/srt
https://self-rewarding-llm-training.github.io/

中文总结:

AI数学能力暴涨100%,自进化直逼RL极限,CMU新作颠覆认知-36氪

核心思想:LLM自己生成题目的“正确”答案,不借助人工或者外部给的答案。使用生成多组output,majoryity-vote找最多的那个ouput作为“正确答案”。

前提假设:generation-verification gap(验证正确性,比生成,难度上要简单)

公式:

自回归生成y的每个token:

reward:

目标函数:(更新模型参数\theta使得最大化reward的期望;这里的题目x是从固定题库里采样得到的)

\theta求导,得到经典的策略梯度:

优势A,为该样本的当前y的reward,减去,该样本的所有y的reward的数学期望:

算法描述:

就是多采样几次,把majority-vote的结果当成“正确”答案。

RL算法可使用任意算法, PPO, RLOO, REINFORCE, REINFORCE++等。

实验效果

跟ground truth正确答案的,做对比。

评测时,每道test题让LLM输出32个结果,对错01分数计算平均分;比只让LLM输出1个结果的对错分,更能减少误差。

DAPO上模型崩溃了。原因:模型作弊,倾向产生多组一致性的结果,而不是倾向产生正确的结果。因为前者比后者容易。

3种解决模型因为作弊而崩溃的方法:

1. early-stop。实验表明,用不同的validation-set来测(不一定和该train set分布相同,只要领域一样就行,比如训练数学就用数学的validation set),得到的stop位置是差不多的。比较宽松。

 2. 使用固定的(不参与训练)模型输出的结果作为“正确”答案,放弃用训练中的主模型的结果做答案。这样使得训练中的模型更倾向拟合“正确”答案,而不是去拟合让所有结果都一样。

3. 课程学习。DAPO数据集难度更大,所以,LLM模型做输出一致性比做正确性的难度小。

模型:Qwen2.5-Math-7B;Qwen3-14B-Base

训练数据:MATH-12K,AIME,DAPO

评测数据:MATH-12K,AIME,DAPO

算法:RLOO(为了简单)

### Chain-of-Thought Prompting Mechanism in Large Language Models In large language models, chain-of-thought prompting serves as a method to enhance reasoning capabilities by guiding the model through structured thought processes. This approach involves breaking down complex problems into simpler components and providing step-by-step guidance that mirrors human cognitive processing. The creation of these prompts typically includes selecting examples from training datasets where each example represents part of an overall problem-solving process[^2]. By decomposing tasks into multiple steps, this technique encourages deeper understanding and more accurate predictions compared to traditional methods. For instance, when faced with multi-hop question answering or logical deduction challenges, using such chains allows models not only to generate correct answers but also articulate intermediate thoughts leading up to those conclusions. Such transparency facilitates better interpretability while improving performance on various NLP benchmarks. ```python def create_chain_of_thought_prompt(task_description, examples): """ Creates a chain-of-thought prompt based on given task description and examples. Args: task_description (str): Description of the task at hand. examples (list): List containing tuples of input-output pairs used for demonstration purposes. Returns: str: Formatted string representing the final prompt including both instructions and sample cases. """ formatted_examples = "\n".join([f"Input: {ex[0]}, Output: {ex[1]}" for ex in examples]) return f""" Task: {task_description} Examples: {formatted_examples} Now try solving similar questions following above pattern. """ # Example usage examples = [ ("What color do you get mixing red and blue?", "Purple"), ("If it rains tomorrow, will we have our picnic?", "No") ] print(create_chain_of_thought_prompt("Solve logic puzzles", examples)) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值