Offline RL : Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

ICML 2021
paper
code
利用Q的方差作为权重估计,降低OOD数据的影响程度。

Intro

在离线强化学习中,目标是在不需要探索或交互的情况下,从静态数据集中学习。现有的基于Q学习和演员-评论家算法在处理分布外(OOD)行为或状态时存在困难,这可能导致价值估计中的重大错误,从而破坏训练的稳定性。

为了解决这个问题,提出了一种名为不确定性加权演员-评论家(UWAC)的新算法。UWAC背后的关键是检测到OOD行为-状态对,相应地减少它们在训练目标中的影响。这是通过一种实用的基于dropout的不确定性估计方法实现的,防止 Q 函数对OOD数据(高不确定性)过于乐观的学习。与现有的强化学习算法相比,这种方法几乎没有额外的开销。

Method

Uncertainty estimation through dropout

采用Monte-Carlo Dropout来计算Q值不确定性: 即在训练时对每个隐藏层网络输出加入Dropout,测试时也执行Dropout,然后对同一个数据连续T次预测,然后估计方差
V a r [ Q ( s , a ) ] ≈ σ 2 + 1 T ∑ t = 1 T Q ^ t ( s , a ) ⊤ Q ^ t ( s , a ) − E [ Q ^ ( s , a ) ] ⊤ E [ Q ^ ( s , a ) ] \begin{aligned}Var[Q(s,a)]\approx\sigma^2+\frac1T\sum_{t=1}^T\hat{Q}_t(s,a)^\top\hat{Q}_t(s,a)-E[\hat{Q}(s,a)]^\top E[\hat{Q}(s,a)]\end{aligned} Var[Q(s,a)]σ2+T1t=1TQ^t(s,a)Q^t(s,a)E[Q^(s,a)]E[Q^(s,a)]

其中Dropout源代码为

def forward(self, input, return_preactivations=False):
        h = input
        for i, fc 
OSDP: CP: PD-1: 2025-10-13T16:31:51Z osdp_cp.c:958 [WARN ] No response in 200ms; probing (1) OSDP: CP: PD-1: 2025-10-13T16:31:52Z osdp_cp.c:441 [WARN ] PD replied with NAK(6) for CMD: POLL(60) OSDP: CP: PD-1: 2025-10-13T16:31:52Z osdp_cp.c:445 [ERROR] PD sc is shutdown, try to connect. OSDP: CP: PD-1: 2025-10-13T16:31:52Z osdp_cp.c:738 [DEBUG] CMD: POLL(60) REPLY: NAK(41) OSDP: CP: PD-1: 2025-10-13T16:31:52Z osdp_cp.c:1296 [ERROR] Going offline for 1 seconds; Was in 'Online' state OSDP: CP: PD-1: 2025-10-13T16:31:52Z osdp_cp.c:1306 [DEBUG] StateChange: [Online] -> [Offline] (SC-Inactive) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:1306 [DEBUG] StateChange: [Offline] -> [ID-Request] (SC-Inactive) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:738 [DEBUG] CMD: ID(61) REPLY: PDID(45) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:1306 [DEBUG] StateChange: [ID-Request] -> [Cap-Detect] (SC-Inactive) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:489 [DEBUG] Reports capability 'OutputControl' (1/1) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:489 [DEBUG] Reports capability 'CardDataFormat' (2/1) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:489 [DEBUG] Reports capability 'LEDControl' (1/1) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:489 [DEBUG] Reports capability 'AudibleControl' (1/1) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:489 [DEBUG] Reports capability 'CheckCharacter' (1/0) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:489 [DEBUG] Reports capability 'CommunicationSecurity' (1/0) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:489 [DEBUG] Reports capability 'ReceiveBufferSize' (24/21) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:489 [DEBUG] Reports capability 'Reader' (1/1) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:489 [DEBUG] Reports capability 'OsdpVersion' (2/0) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:738 [DEBUG] CMD: CAP(62) REPLY: PDCAP(46) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:1306 [DEBUG] StateChange: [Cap-Detect] -> [SC-Chlng] (SC-Inactive) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:738 [DEBUG] CMD: CHLNG(76) REPLY: CCRYPT(76) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:1306 [DEBUG] StateChange: [SC-Chlng] -> [SC-Scrypt] (SC-Inactive) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:738 [DEBUG] CMD: SCRYPT(77) REPLY: RMAC_I(78) OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:1281 [INFO ] Online; With SC OSDP: CP: PD-1: 2025-10-13T16:31:53Z osdp_cp.c:1306 [DEBUG] StateChange: [SC-Scrypt] -> [Online] (SC-Active) [2025-10-13 16:31:56] [ERROR] cloud_com_outer_start():552 - [CLOUDCOM]TCP req, cloudCom n-device-entry-sur.tplinkcloud.com:443
10-14
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值