Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage

本文是LLM系列文章,针对《Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization》的翻译。

通过直接优势策略优化提高大型语言模型的多步推理能力

摘要

强化学习(RL)在增强大型语言模型(LLM)推理方面的作用越来越重要。尽管强化学习在许多场景中取得了成功,但在改进LLM的推理方面仍然存在许多挑战。一个挑战是稀疏的奖励,这使得RL难以进行优化,并且需要大量的数据样本。另一个挑战源于RL固有的不稳定性,特别是在使用Actor-Critic(AC)方法推导最优策略时,这通常会导致训练过程不稳定。为了解决这些问题,我们引入了直接优势策略优化(DAPO),这是一种新的步骤级离线RL算法。与仅依赖结果奖励来优化策略的标准对齐不同(如DPO),DAPO采用批评函数来预测每一步的推理准确性,从而生成密集的信号来优化生成策略。此外,DAPO中的Actor和Critic组件是独立训练的,避免了PPO等标准AC算法中观察到的协同训练不稳定性。我们在数学和代码查询数据集上训练DAPO,然后在多个基准上评估其性能。我们的结果表明,DAPO可以有效地增强SFT模型和RL模型的数学和编码能力,证明了DAPO的有效性。

1 引言

2 前言

3 直接优势策略优化

4 实验

5 讨论,局限性和未来工作

在这项工作中,我们提出了一种离线步骤级RLHF方法,称为直接优势策略优化(DAPO),旨在优化推理步骤的生成。DAPO在数学和编码基准测试中都取得了显著的性能改进,从而证明了其有效性。与DPO、DRO等标准响应级别方法相比,DAPO利用评论家功能实现了更细粒度的策略优化。与其他步骤级RLHF

### Chain-of-Thought Prompting Mechanism in Large Language Models In large language models, chain-of-thought prompting serves as a method to enhance reasoning capabilities by guiding the model through structured thought processes. This approach involves breaking down complex problems into simpler components and providing step-by-step guidance that mirrors human cognitive processing. The creation of these prompts typically includes selecting examples from training datasets where each example represents part of an overall problem-solving process[^2]. By decomposing tasks into multiple steps, this technique encourages deeper understanding and more accurate predictions compared to traditional methods. For instance, when faced with multi-hop question answering or logical deduction challenges, using such chains allows models not only to generate correct answers but also articulate intermediate thoughts leading up to those conclusions. Such transparency facilitates better interpretability while improving performance on various NLP benchmarks. ```python def create_chain_of_thought_prompt(task_description, examples): """ Creates a chain-of-thought prompt based on given task description and examples. Args: task_description (str): Description of the task at hand. examples (list): List containing tuples of input-output pairs used for demonstration purposes. Returns: str: Formatted string representing the final prompt including both instructions and sample cases. """ formatted_examples = "\n".join([f"Input: {ex[0]}, Output: {ex[1]}" for ex in examples]) return f""" Task: {task_description} Examples: {formatted_examples} Now try solving similar questions following above pattern. """ # Example usage examples = [ ("What color do you get mixing red and blue?", "Purple"), ("If it rains tomorrow, will we have our picnic?", "No") ] print(create_chain_of_thought_prompt("Solve logic puzzles", examples)) ```
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值