Conservative Q-Learning(CQL)保守Q学习(三)-CQL在DDPG下的代码实现和实际应用效果

作者实现并比较了CQL增强的DDPG和纯DDPG算法在BipedalWalker-v3环境中的性能。通过在线DDPG得到模型权重后,进行大规模采样生成样本用于离线DDPG和CQL-DDPG的训练。实验结果显示,CQL+DDPG在稳定性上略有优势,但在奖励回报上与DDPG差异不显著,且方差相差悬殊。作者提出数据集大小和采样批次可能影响结果,并邀请读者参与验证和反馈。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

笔者对CQL进行了实现和原文章代码部分的修改,放在了下面的链接里,欢迎各位对代码提出点评和修正。
代码链接在:https://github.com/outongyiLv/CQLwithDDPG
以下对该代码进行一定的描述方便大家理解和运行。
1.首先笔者采用OnlineDDPG算法进行了训练,并得到了训练结果,模型参数存储在目录下名为DDPG_weight的文件中。
2.笔者根据OnlineDDPG得到的weight进行了大规模采样( s , a , r , s ′ s,a,r,s' s,a,r,s)对,存储样本量为100000对样本(用户可自行设置样本量大小进行采样,采样请运行Sampling_data.py文件并修改迭代轮次即可)
3.笔者使用该采样数据进行了离线DDPG和离线CQLDDPG的编写,主体模型分别在OffLineDDPG.py与OffLineCQLDDPG.py文件中。
4.用户可以直接运行OFFLineT_CQL.py与OFFLineT_QL.py两个文件来分别对应的加入了CQL的DDPG和未加入CQL的DDPG算法,得到结果并画图。
下面笔者展示一下笔者运行后的效果,均在batch=256下,跑了100*2000=20W个epoch运行得到,横坐标每1等于2000epoch。结果图如下展示:可以看出,加入了DDPG后的CQL相对较为稳定一些,但是笔者在此问题"BipedalWalker-v3"中并未发现它的明显优势,只能说在Reward的结果上,两者差距并不是很大,并且还存在着偶然性。可能是由于10W的数据集采用Batch=256来采样效果可能稍差一些并且存在一定的偶然性。欢迎各位去实现并回馈笔者。
在这里笔者跑出的两者maxreward分别是:
CQL+DDPG:22.1141
DDPG:25.9134

但是两者方差差距极大:
CQL+DDPG:251.3043
DDPG:2929.3041

请添加图片描述

### Conservative Q-Learning in Reinforcement Learning Algorithms Conservative Q-Learning (CQL) is an offline reinforcement learning algorithm designed to address the challenges associated with learning from a fixed dataset without further interaction with the environment[^1]. Unlike traditional online RL methods which require continuous exploration, CQL operates on pre-collected datasets. #### Key Concepts of Conservative Q-Learning In conservative Q-learning, two main objectives are pursued simultaneously: - **Maximizing Expected Return**: The primary goal remains optimizing policy performance by maximizing cumulative rewards. - **Minimizing Overestimation Bias**: A critical issue in off-policy evaluation arises when learned policies tend to overestimate action values not well supported by data. This leads to poor generalization outside observed states and actions. To mitigate this problem, CQL introduces a regularization term into the standard Bellman backup process. Specifically, instead of simply selecting maximum predicted Q-values during updates, CQL averages predictions across all possible actions while penalizing high-value estimates that lack sufficient support within the training set. This approach ensures more robustness against distributional shift between train/test environments and prevents catastrophic forgetting issues common in purely exploratory strategies. ```python import numpy as np def cql_loss(q_values, next_q_values, alpha=0.5): """ Compute Conservative Q-Learning loss Args: q_values: Current state-action value function outputs next_q_values: Next state-action value function outputs alpha: Regularization coefficient Returns: Loss scalar tensor """ # Standard TD error component td_error = ... # Log-sum-exp penalty for encouraging conservatism logsumexp_penalty = alpha * ( torch.logsumexp(next_q_values / alpha, dim=-1).mean() - next_q_values.mean() ) return td_error + logsumexp_penalty ``` By incorporating such penalties, CQL effectively discourages overly optimistic assessments about unseen scenarios, leading to safer extrapolations beyond available samples.
评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值