REINFORCE、Remax、GRPO、DR.GRPO、DAPO、REINFORCE++、GPG、OPO、GSPO、VC-PPO、CLIP-COV、VAPO对比

REINFORCE

REINFORCE算法使用蒙特卡罗方法计算策略梯度,智能体根据当前策略与环境交互,生成一条状态-动作-奖励序列(轨迹)。对每条轨迹进行回报计算,通常采用折扣累计奖励(也就是下图中的r)的形式。REINFORCE的缺点就是方差大,按照论文ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models里的说法就是In theory, this variance can be attributed
to two main sources: the external randomness inherent in
MDP’s transitions and the internal randomness from the
policy decisions of the language model (i.e., token generation).
在这里插入图片描述

ReMax(REINFORCE+argmax)

来自ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models。在REINFORCE基础上加个baseline,放弃critic函数,This baseline value can be obtained by greedily sampling a
response and calculating the associated reward value
在这里插入图片描述
在VeRL论文(HybridFlow: A Flexible and Efficient RLHF Framework)中,ReMax被描述成了一个需要Actor Gen两次的数据流,因为baseline确实多采样了一次,而且这个baseline是通过greedy采样得到的
在这里插入图片描述
来看一下VeRL中Remax advantage的实现:

@register_adv_est(AdvantageEstimator.REMAX)  # or simply: @register_adv_est("remax")
def compute_remax_outcome_advantage(
    token_level_rewards: torch.Tensor,
    reward_baselines: torch.Tensor,
    response_mask: torch.Tensor,
    config: Optional[AlgoConfig] = None,
    **kwargs,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute advantage for ReMax, operating only on Outcome reward
    This implementation is based on the paper: https://arxiv.org/abs/2310.10505
    (with only one scalar reward for each response).

    Args:
        token_level_rewards: `(torch.Tensor)`
            shape: (bs, response_length)
        reward_baselines: `(torch.Tensor)`
            shape: (bs,)
        response_mask: `(torch.Tensor)`
            shape: (bs, response_length)
        config: (AlgoConfig) algorithm config

    Returns:
        advantages: `(torch.Tensor)`
            shape: (bs, response_length)
        Returns: `(torch.Tensor)`
            shape: (bs, response_length)
    """

    with torch.no_grad():
        returns = (token_level_rewards * response_mask).flip(dims=[-1]).cumsum(dim=-1).flip(dims=[-1])
        advantages = returns - reward_baselines.unsqueeze(-1) * response_mask

    return advantages, returns

问题1: 为什么减去Baseline对于Policy-Gradient来说是无偏的?

减去基线的操作是将梯度修改为:
∇ θ J ( θ ) = E τ ∼ π θ [ ( R ( τ ) − b ) ∇ θ log ⁡ π θ ( τ ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \left( R(\tau) - b \right) \nabla_\theta \log \pi_\theta(\tau) \right] θJ(θ)=Eτπθ[(R(τ)b)θlogπθ(τ)]
其中b是基线(通常为状态值函数 ( V ( s ) (V(s) (V(s)))。关键问题在于:为何这种修改不引入偏差?

无偏性的数学本质

无偏性要求修改后的梯度期望值与原梯度一致:
E τ ∼ π θ [ ( R ( τ ) − b ) ∇ θ log ⁡ π θ ( τ ) ] = E τ ∼ π θ [ R ( τ ) ∇ θ log ⁡ π θ ( τ ) ] \mathbb{E}_{\tau \sim \pi_\theta} \left[ \left( R(\tau) - b \right) \nabla_\theta \log \pi_\theta(\tau) \right] = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \nabla_\theta \log \pi_\theta(\tau) \right] Eτπθ[(R(τ)b)θlogπθ(τ)]=Eτπθ[R(τ)θlogπθ(τ)]

只需证明基线项期望为零
E τ ∼ π θ [ b ⋅ ∇ θ log ⁡ π θ ( τ ) ] = 0 \mathbb{E}_{\tau \sim \pi_\theta} \left[ b \cdot \nabla_\theta \log \pi_\theta(\tau) \right] = 0 Eτ

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值