离线强化学习(Offline RL)系列3: (算法篇) CQL(Conservative Q-Learning)算法详解与实现

这篇博客介绍了UCBerkeley团队提出的CQL算法,用于解决离线强化学习中的分布偏移问题。CQL通过在Q值函数上添加正则项,学习一个保守的Q函数,确保策略的值函数是真实的下界估计。该方法在理论和实验上都表现出色,尤其在处理复杂和多模态数据时。CQL算法的代码实现简洁,且已在多个基准上验证了其效果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

[更新记录]

论文信息:Conservative Q-Learning for Offline Reinforcement Learning]
[Code]

本文由UC Berkeley的Sergey Levine团队(一作是Aviral Kumar)于2020年提出,并发表在NIPS2020会议上。论文的主要思想是在 Q Q Q值基础上增加一个regularizer,学习一个保守的Q函数,作者从理论上证明了CQL可以产生一个当前策略的真实值下界,并且是可以进行策略评估和策略提升的过程。从代码的角度上来说,本文的regularizer只需要20行代码即可实现,大幅提升了实验结果。同时作者也全部opensource了代码,非常推荐研究。

摘要:在CQL算法出来之前,离线强化学习中对于分布偏移问题的解决思路是将待优化策略的动作选择限制在离线数据集的动作分布上,从而避免分布外的动作出现Q值的过估计问题,进而减少了未知的动作在策略训练学习过程中的影响,这种方法被称为策略约束(Policy constraint),比如离线强化学习中的BCQ和BEAR算法。CQL尝试通过修改值函数的back up方式,在 Q Q Q值的基础上添加一个regularizer,得到真实动作值函数的下界估计。实验表明,CQL的表现非常好,特别是在学习复杂和多模态数据分布的时候。

1、预备知识

1.1 sample error

离线数据集 D \mathcal{D} D是通过使用行为策略 π β ( a ∣ s ) \pi_{\beta}(\mathbf{a} \mid \mathbf{s}) πβ(as)采样得到的, d π β ( s ) d^{\pi_{\beta}}(\mathbf{s}) dπβ(s)是折扣的边缘状态分布, D ∼ d π β ( s ) π β ( a ∣ s ) \mathcal{D} \sim d^{\pi_{\beta}}(\mathbf{s})\pi_{\beta}(\mathbf{a} \mid \mathbf{s}) Ddπβ(s)πβ(as),采样的过程会因为状态动作对的采样不充分产生sample error。

1.2 Operator

关于对Bellman算子的理解和策略迭代过程可以参考这篇文章,通过Bellman算子理解动态规划

1.2.1 Bellman operator

B π Q = r + γ P π Q \mathcal{B}^{\pi} Q=r+\gamma P^{\pi} Q BπQ=r+γPπQ
P π Q ( s , a ) = E s ′ ∼ T ( s ′ ∣ s , a ) , a ′ ∼ π ( a ′ ∣ s ′ ) [ Q ( s ′ , a ′ ) ] P^{\pi} Q(\mathbf{s}, \mathbf{a})=\mathbb{E}_{\mathbf{s}^{\prime} \sim T\left(\mathbf{s}^{\prime} \mid \mathbf{s}, \mathbf{a}\right), \mathbf{a}^{\prime} \sim \pi\left(\mathbf{a}^{\prime} \mid \mathbf{s}^{\prime}\right)}\left[Q\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right)\right] PπQ(s,a)=EsT(ss,a),aπ(as)[Q(s,a)]

1.2.2 Empirical Bellman operator

离线数据集无法包含所有动作的转移数据,因此只能用 D \mathcal{D} D中包含的数据进行back up,用 B ^ π \hat{\mathcal{B}}^{\pi} B^π表示。

1.2.3 Optimal Bellman operator

B ∗ Q ( s , a ) = r ( s , a ) + γ E s ′ ∼ P ( s ′ ∣ s , a ) [ max ⁡ a ′ Q ( s ′ , a ′ ) ] \mathcal{B}^{*} Q(\mathbf{s}, \mathbf{a})=r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{s}^{\prime} \sim P\left(\mathbf{s}^{\prime} \mid \mathbf{s}, \mathbf{a}\right)}\left[\max _{\mathbf{a}^{\prime}} Q\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right)\right] BQ(s,a)=r(s,a)+γEsP(ss,a)[amaxQ(s,a)]

1.3 策略迭代

1.3.1 策略评估

当前我们在优化这个策略的过程中,会得到对应策略的值函数,根据值函数估计策略的价值。
Q ^ k + 1 ← arg ⁡ min ⁡ Q E s , a , s ′ ∼ D [ ( ( r ( s , a ) + γ E a ′ ∼ π ^ k ( a ′ ∣ s ′ ) [ Q ^ k ( s ′ , a ′ ) ] ) − Q ( s , a ) ) 2 ]  (policy evaluation)  \hat{Q}^{k+1} \leftarrow \arg \min _{Q} \mathbb{E}_{\mathbf{s}, \mathbf{a}, \mathbf{s}^{\prime} \sim \mathcal{D}}\left[\left(\left(r(\mathbf{s}, \mathbf{a})+\gamma \mathbb{E}_{\mathbf{a}^{\prime} \sim \hat{\pi}^{k}\left(\mathbf{a}^{\prime} \mid \mathbf{s}^{\prime}\right)}\left[\hat{Q}^{k}\left(\mathbf{s}^{\prime}, \mathbf{a}^{\prime}\right)\right]\right)-Q(\mathbf{s}, \mathbf{a})\right)^{2}\right] \text { (policy evaluation) } Q^k+1argQminEs,a,sD[((r(s,a)+γEaπ^k(as)[Q^k(s,a)])Q(s,a))2] (policy evaluation) 

1.3.2 策略提升

通过在 Q Q Q函数上取极大化,然后在这个 Q Q Q函数上面做一个贪心的搜索来进一步改进它的策略。
π ^ k + 1 ← arg ⁡ max ⁡ π E s ∼ D , a ∼ π k ( a ∣ s ) [ Q ^ k + 1 ( s , a )

### Conservative Q-Learning in Reinforcement Learning Algorithms Conservative Q-Learning (CQL) is an offline reinforcement learning algorithm designed to address the challenges associated with learning from a fixed dataset without further interaction with the environment[^1]. Unlike traditional online RL methods which require continuous exploration, CQL operates on pre-collected datasets. #### Key Concepts of Conservative Q-Learning In conservative Q-learning, two main objectives are pursued simultaneously: - **Maximizing Expected Return**: The primary goal remains optimizing policy performance by maximizing cumulative rewards. - **Minimizing Overestimation Bias**: A critical issue in off-policy evaluation arises when learned policies tend to overestimate action values not well supported by data. This leads to poor generalization outside observed states and actions. To mitigate this problem, CQL introduces a regularization term into the standard Bellman backup process. Specifically, instead of simply selecting maximum predicted Q-values during updates, CQL averages predictions across all possible actions while penalizing high-value estimates that lack sufficient support within the training set. This approach ensures more robustness against distributional shift between train/test environments and prevents catastrophic forgetting issues common in purely exploratory strategies. ```python import numpy as np def cql_loss(q_values, next_q_values, alpha=0.5): """ Compute Conservative Q-Learning loss Args: q_values: Current state-action value function outputs next_q_values: Next state-action value function outputs alpha: Regularization coefficient Returns: Loss scalar tensor """ # Standard TD error component td_error = ... # Log-sum-exp penalty for encouraging conservatism logsumexp_penalty = alpha * ( torch.logsumexp(next_q_values / alpha, dim=-1).mean() - next_q_values.mean() ) return td_error + logsumexp_penalty ``` By incorporating such penalties, CQL effectively discourages overly optimistic assessments about unseen scenarios, leading to safer extrapolations beyond available samples.
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

@RichardWang

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值