Conservative Q-Learning(CQL)保守Q学习(一)-CQL1(下界Q值估计)

本文将介绍2020年NIPS上的文章,我认为非常有助于RL研究者进行深入阅读,是一篇兼具理论和实际应用的好文章。CQL原文在此,由于CQL原文内容符号比较混乱,并且在公式推导和符号定义上存在一些小错误。笔者进行了仔细阅读和分析,在此做出一些自己语言的总结理解和重述,给予和笔者一样的读者和研究者进行参考和帮助,希望可以和大家一起讨论和学习。本篇文章理论分析性极强若读者不喜欢理论证明而想直接应用,笔者也给出了如何直接应用CQL的部分。后续笔者会持续更新这部分与CQL的代码部分,由于CQL设计理论部分和应用部分,限于篇幅,笔者无法在一个博客写下所有,必须分开。笔者分成两部分来进行叙述,一部分用于Q估计,一部分用于V估计。笔者的证明中如有推导错误。欢迎各位学者提出批评和建议。
第一部分:Conservative Q-Learning(CQL)保守Q学习(一)-CQL1(下界Q值估计)主要介绍基础概念和CQL第一个下界Q估计
第二部分:Conservative Q-Learning(CQL)保守Q学习(二)-CQL2(下界V值估计),CQL®与CQL(H)。
不感兴趣原理的读者请直接阅读第二部分的2.4部分进行 C Q L ( R ) CQL(R) CQL(R) C Q L ( H ) CQL(H) CQL(H)的应用,而无需了解理论。

原作者给出的代码链接在此:CQL原作代码。但是笔者认为原作者给出的代码存在一些小问题,笔者在这里暂且保留这个疑虑,在文章中最后笔者给出了疑虑内容,欢迎大家进行讨论。笔者已经将文章出现的疑虑和代码疑虑汇总给CQL作者Aviral Kumar发了邮件。

作为开头,首先笔者先给出一些符号定义和问题的重述,便于后续的阅读,否则直接读极其容易混乱,希望本文可以给予读者指引。

1、预备知识说明

1.1、全文符号重定义

A g e n t Agent Agent:智能体(探索对象)
s t a t e state state: A g e n t Agent Agent所处的状态—— t t t时刻所处状态简称为 s t s_t st
a a a: A g e n t Agent Agent所采取的动作—— t t t时刻所处采取的动作简称为 a t a_t at
r r r: A g e n t Agent Agent s t s_t st下所采取动作 a t a_t at获得多少奖励,简称为 r ( s t , a t ) r(s_t,a_t) r(st,at)

符号内容符号表示意义
π β ( a t ∣ s t ) \pi_{\beta}(a_t|s_t) πβ(atst)先验分布函数,数据集中真实的 s t s_t st下采取动作 a t a_t at的概率
π ^ β ( a t ∣ s t ) \hat{\pi}_{\beta}(a_t|s_t) π^β(atst)经验分布函数,数据集体现在采样中 s t s_t st下采取动作 a t a_t at的概率
π ^ k ( a t ∣ s t ) \hat{\pi}^{k}(a_t|s_t) π^k(atst) k k k步迭代下, s t s_t st状态下, A g e n t Agent Agent采取动作 a t a_t at对应的概率
π ( a t ∣ s t ) \pi(a_t|s_t) π(atst)迭代稳定后, s t s_t st状态下, A g e n t Agent Agent采取动作 a t a_t at对应的概率
Q k ( s t , a t ) Q^{k}(s_t,a_t) Qk(st,at) k k k步迭代下, s t s_t st状态下, A g e n t Agent Agent采取动作 a t a_t at对应的真实Q值
Q ^ k ( s t , a t ) \hat{Q}^{k}(s_t,a_t) Q^k(st,at) k k k步迭代下, s t s_t st状态下, A g e n t Agent Agent采取动作 a t a_t at对应的预估Q值
Q π ( s t , a t ) Q^{\pi}(s_t,a_t) Qπ(st,at)迭代稳定后, s t s_t st状态下, A g e n t Agent Agent采取动作 a t a_t at对应的真实Q值
Q ^ π ( s t , a t ) \hat{Q}^{\pi}(s_t,a_t) Q^π(st,at)迭代稳定后, s t s_t st状态下, A g e n t Agent Agent采取动作 a t a_t at对应的预估Q值
T ( s t + 1 ∣ s t , a t ) T(s_{t+1}|s_t,a_t) T(st+1st,at) s t s_t st状态下, A g e n t Agent Agent采取动作 a t a_t at对应的真实状态转移概率
T ^ ( s t + 1 ∣ s t , a t ) \hat{T}(s_{t+1}|s_t,a_t) T^(st+1st,at) s t s_t st状态下, A g e n t Agent Agent采取动作 a t a_t at对应的经验状态转移概率
r ( s t , a t ) r(s_t,a_t) r(st,at) s t s_t st状态下,基于 T T T得到 A g e n t Agent Agent采取动作 a t a_t at对应的真实奖励
r ^ ( s t , a t ) \hat{r}(s_t,a_t) r^(st,at) s t s_t st状态下,基于 T ^ \hat{T} T^得到 A g e n t Agent Agent采取动作 a t a_t at对应的经验奖励
B π Q ( s t , a t ) B^{\pi}Q(s_t,a_t) BπQ(st,at) r ( s t , a t ) + E s t + 1 ~ T , a t + 1 ~ π ( a t + 1 ∣ s t + 1 ) [ Q ( s t + 1 , a t + 1 ) ] r(s_t,a_t)+E_{s_{t+1}~T, a_{t+1}~\pi(a_{t+1}|s_{t+1})}[Q(s_{t+1},a_{t+1})] r(st,at)+Est+1T,at+1π(at+1st+1)[Q(st+1,at+1)]
B ^ π Q ( s t , a t ) \hat{B}^{\pi}Q(s_t,a_t) B^πQ(st,at) r ^ ( s t , a t ) + E s t + 1 ~ T ^ , a t + 1 ~ π ( a t + 1 ∣ s t + 1 ) [ Q ( s t + 1 , a t + 1 ) ] \hat{r}(s_t,a_t)+E_{s_{t+1}~\hat{T}, a_{t+1}~\pi(a_{t+1}|s_{t+1})}[Q(s_{t+1},a_{t+1})] r^(st,at)+Est+1T^,at+1π(at+1st+1)[Q(st+1,at+1)]
V π ( s t ) V^{\pi}(s_t) Vπ(st) E a t ~ π ( a t ∣ s t ) [ Q π ( s t , a t ) ] E_{a_t~\pi(a_t|s_t)}[Q^{\pi}(s_t,a_t)] Eatπ(atst)[Qπ(st,at)]
V ^ k ( s t ) \hat{V}^{k}(s_t) V^k(st) E a t ~ π ( a t ∣ s t ) [ Q ^ k ( s t , a t ) ] E_{a_t~\pi(a_t|s_t)}[\hat{Q}^k(s_t,a_t)] Eatπ(atst)[Q^k(st,at)]
d π β ( s t ) d^{\pi_{\beta}}(s_t) dπβ(st) π β ( a ∣ s ) {\pi_{\beta}}(a|s) πβ(as)的状态边际分布
d ^ π β ( s t ) \hat{d}^{\pi_{\beta}}(s_t) d^πβ(st) π ^ β ( a ∣ s ) {\hat{\pi}_{\beta}}(a|s) π^β(as)的状态边际分布

1.2、预备知识和问题描述

1.2.1、离线数据集 D D D的构成

针对一个已经通过离线获取好的数据集 D D D,其中, D D D为一系列这样的集合构成: D = { ( s t , a t , s t + 1 ) } D=\{(s_t,a_t,s_{t+1})\} D={(st,at,st+1)} D D D中元素构成分为三部分,假设 D D D中元素总数为 ∣ D ∣ |D| D
一、从边际先验分布 d π β ( s t ) d^{\pi_{\beta}}(s_t) dπβ(st)中采样获取 s t s_t st
二、从先验分布 π β ( a t ∣ s t ) \pi_{\beta}(a_t|s_t) πβ(atst)中采样获取 a t a_t at
三、从真实状态转移分布 T ( s t + 1 ∣ s t , a t ) T(s_{t+1}|s_t,a_t) T(st+1st,at)中采样获取 s t + 1 s_{t+1} st+1
P ( { ( s t , a t , s t + 1 ) } ) = T ( s t + 1 ∣ s t , a t ) π β ( a t ∣ s t ) d π β ( s t ) P(\{(s_t,a_t,s_{t+1})\})=T(s_{t+1}|s_t,a_t)\pi_{\beta}(a_t|s_t)d^{\pi_{\beta}}(s_t) P({(st,at,st+1)})=T(st+1st,at)πβ(atst)dπβ(st)但是这一先验分布和真实状态转移分布其实人为是并不知道的。我们只能去估计。在实际应用中,我们只能获取到它的以下几个内容:
一、从边际经验分布 d ^ π β ( s t ) \hat{d}^{\pi_{\beta}}(s_t) d^πβ(st)中采样获得 s t s_t st
二、从经验分布 π ^ β ( a t ∣ s t ) \hat{\pi}_{\beta}(a_t|s_t) π^β(atst)中采样获取 a t a_t at
三、从经验状态转移分布 T ^ ( s t + 1 ∣ s t , a t ) \hat{T}(s_{t+1}|s_t,a_t) T^(st+1st,at)中采样获取 s t + 1 s_{t+1} st+1
其中,根据简单概率论知识不难得到这三者的定义计算公式如下,它们的定义均是由示性函数 1 1 1定义:
d ^ π β ( s t ) = ∑ s ∈ D 1 ( s = s t ) ∣ D ∣ \hat{d}^{\pi_{\beta}}(s_t)=\frac{\sum_{s \in D}1(s=s_t)}{|D|} d^πβ(st)=DsD1(s=st)
π ^ β ( a t ∣ s t ) = P ( s t , a t ) d ^ π β ( s t ) = ∑ s , a ∈ D 1 ( s = s t , a = a t ) ∑ s ∈ D 1 ( s = s t ) \hat{\pi}_{\beta}(a_t|s_t)=\frac{P(s_t,a_t)}{\hat{d}^{\pi_{\beta}}(s_t)}=\frac{\sum_{s,a \in D}1(s=s_t,a=a_t)}{\sum_{s \in D}1(s=s_t)} π^β(atst)=d^πβ(st)P(st,at)=sD1(s=st)s,aD1(s=st,a=at)
T ^ ( s t + 1 ∣ s t , a t ) = P ( s t , a t , s t + 1 ) P ( s t , a t ) = ∑ s , a , s ′ ∈ D 1 ( s = s t , a = a t , s ′ = s t + 1 ) ∑ s , a ∈ D 1 ( s = s t , a = a t ) \hat{T}(s_{t+1}|s_t,a_t)=\frac{P(s_t,a_t,s_{t+1})}{P(s_t,a_t)}=\frac{\sum_{s,a,s^{'}\in D}1(s=s_t,a=a_t,s^{'}=s_{t+1})}{\sum_{s,a \in D}1(s=s_t,a=a_t)} T^(st+1st,at)=P(st,at)P(st,at,st+1)=s,aD1(s=st,a=at)s,a,sD1(s=st,a=at,s=st+1)

1.2.2、Bellman 最优算子(QL)与Bellman算子(AC)

Bellman 最优算子为Q-Learning(QL)更新时候采用的Q值更新方式,称之为 B ∗ B^{*} B,定义如下,其中 γ \gamma γ为折扣因子(discounted-factor):
B ∗ Q ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 ~ T [ m a x a Q ( s t , a ) ] B^{*}Q(s_t,a_t)=r(s_t,a_t)+\gamma E_{s_{t+1}~T}[max_aQ(s_t,a)] BQ(st,at)=r(st,at)+γEst+1T[maxaQ(st,a)]Bellman算子为Actor-Critic(AC)更新时候采用的Q值更新方式,称之为 B π B^{\pi} Bπ,定义如下,其中 γ \gamma γ为折扣因子(discounted-factor):
B π Q ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 ~ T , a t + 1 ~ π [ Q ( s t + 1 , a t + 1 ) ] B^{\pi}Q(s_t,a_t)=r(s_t,a_t)+\gamma E_{s_{t+1}~T,a_{t+1}~\pi}[Q(s_{t+1},a_{t+1})] BπQ(st,at)=r(st,at)+γEst+1T,at+1π[Q(st+1,at+1)]
但是事实上,在针对离线数据集时,注意到 s t + 1 ~ T s_{t+1}~T st+1T这一项是无法获取全部的 s t + 1 s_{t+1} st+1来进行实际估计的,因此本文作者提出了经验Bellman算子 B ^ π \hat{B}^{\pi} B^π,定义如下,其中 γ \gamma γ为折扣因子(discounted-factor):
B ^ π Q ( s t , a t ) = r ^ ( s t , a t ) + γ E s t + 1 ~ T ^ , a t + 1 ~ π [ Q ( s t + 1 , a t + 1 ) ] \hat{B}^{\pi}Q(s_t,a_t)=\hat{r}(s_t,a_t)+\gamma E_{s_{t+1}~\hat{T},a_{t+1}~\pi}[Q(s_{t+1},a_{t+1})] B^πQ(st,at)=r^(st,at)+γEst+1T^,at+1π[Q(st+1,at+1)]
其中, r ^ ( s t , a t ) \hat{r}(s_t,a_t) r^(st,at)的定义为:
r ^ ( s t , a t ) = ∑ s , a ∈ D 1 s = s t , a = a t r ( s t , a t ) ∑ s , a ∈ D 1 s = s t , a = a t \hat{r}(s_t,a_t)=\frac{\sum_{s,a \in D}1_{s=s_t,a=a_t}r(s_t,a_t)}{\sum_{s,a \in D}1_{s=s_t,a=a_t}} r^(st,at)=s,aD1s=st,a=ats,aD1s=st,a=atr(st,at)

1.2.3、Bellman 迭代(不感兴趣的读者可以不看)

通过1.2.2的我们给出了 B π ^ \hat{B^\pi} Bπ^ B π B^\pi Bπ的定义。接下来介绍两者相应的Bellman迭代公式:
Q ^ k + 1 ( s t , a t ) = B π Q ^ k = r ( s t , a t ) + γ E s t + 1 ~ T , a t + 1 ~ π [ Q ^ k ( s t + 1 , a t + 1 ) ] \hat{Q}^{k+1}(s_t,a_t)=B^\pi \hat{Q}^k=r(s_t,a_t)+\gamma E_{s_{t+1}~T,a_{t+1}~\pi}[\hat{Q}^k(s_{t+1},a_{t+1})] Q^k+1(st,at)=BπQ^k=r(st,at)+γEst+1T,at+1π[Q^k(st+1,at+1)]

Q ^ k + 1 ( s t , a t ) = B ^ π Q ^ k = r ^ ( s t , a t ) + γ E s t + 1 ~ T ^ , a t + 1 ~ π [ Q ^ k ( s t + 1 , a t + 1 ) ] \hat{Q}^{k+1}(s_t,a_t)=\hat{B}^\pi \hat{Q}^k=\hat{r}(s_t,a_t)+\gamma E_{s_{t+1}~\hat{T},a_{t+1}~\pi}[\hat{Q}^k(s_{t+1},a_{t+1})] Q^k+1(st,at)=B^πQ^k=r^(st,at)+γEst+1T^,at+1π[Q^k(st+1,at+1)]
首先笔者先给出该Bellman迭代公式的来源证明,这很重要,是后面CQL的理论基础之一

定理1:下两个Bellman优化式等价
( 1 ) Q k + 1 ( s , a ) ← a r g m i n Q E s , a , s ′ [ ( r ( s , a ) + γ E a ′ ~ π [ Q k ( s ′ , a ′ ) ] − Q ( s , a ) ) 2 ] (1)Q^{k+1}(s,a)\leftarrow argmin_QE_{s,a,s'}[(r(s,a)+\gamma E_{a'~\pi}[Q^k(s',a')]-Q(s,a))^2] (1)Qk+1(s,a)argminQEs,a,s[(r(s,a)+γEaπ[Qk(s,a)]Q(s,a))2]
( 2 ) Q k + 1 ( s , a ) ← r ( s , a ) + γ E s ′ ~ T , a ′ ~ π [ Q k ( s ′ , a ′ ) ] (2)Q^{k+1}(s,a)\leftarrow r(s,a)+\gamma E_{s'~T,a'~\pi}[Q^k(s',a')] (2)Qk+1(s,a)r(s,a)+γEsT,aπ[Qk(s,a)]
证明:
令:
L ( Q ) = E s , a , s ′ [ ( r ( s , a ) + γ E a ′ ~ π [ Q k ( s ′ , a ′ ) ] − Q ( s , a ) ) 2 ] L(Q)=E_{s,a,s'}[(r(s,a)+\gamma E_{a'~\pi}[Q^k(s',a')]-Q(s,a))^2] L(Q)=Es,a,s[(r(s,a)+γEaπ[Qk(s,a)]Q(s,a))2]
L ( Q ) = ∑ s , a ∑ s ′ T ( s ′ ∣ s , a ) P ( s , a ) [ r ( s , a ) + γ ∑ a ′ π ( a ′ ∣ s ′ ) Q k ( s ′ , a ′ ) − Q ( s , a ) ] 2 L(Q)=\sum_{s,a}\sum_{s'}T(s'|s,a)P(s,a)[r(s,a)+\gamma \sum_{a'}\pi(a'|s')Q^k(s',a')-Q(s,a)]^2 L(Q)=s,asT(ss,a)P(s,a)[r(s,a)+γaπ(as)Qk(s,a)Q(s,a)]2
∇ Q L ( Q ) = 0 \nabla_Q L(Q)=0 QL(Q)=0会有:
P ( s , a ) ∑ s ′ T ( s ′ ∣ s , a ) [ r ( s , a ) + γ ∑ a ′ π ( a ′ ∣ s ′ ) Q k ( s ′ , a ′ ) − Q ( s , a ) ] = 0 P(s,a)\sum_{s'}T(s'|s,a)[r(s,a)+\gamma \sum_{a'}\pi(a'|s')Q^k(s',a')-Q(s,a)]=0 P(s,a)sT(ss,a)[r(s,a)+γaπ(as)Qk(s,a)Q(s,a)]=0这即为 a r g m i n Q argmin_Q argminQ:
r ( s , a ) + γ ∑ s ′ ∑ a ′ T ( s ′ ∣ s , a ) π ( a ′ ∣ s ′ ) Q k ( s ′ , a ′ ) = Q ( s , a ) r(s,a)+\gamma \sum_{s'}\sum_{a'}T(s'|s,a)\pi(a'|s')Q^k(s',a')=Q(s,a) r(s,a)+γsaT(ss,a)π(as)Qk(s,a)=Q(s,a)简单整理以下发现这就是(2)
r ( s , a ) + γ E s ′ ~ T , a ′ ~ π [ Q k ( s ′ , a ′ ) ] = Q ( s , a ) → Q k + 1 r(s,a)+\gamma E_{s'~T,a'~\pi}[Q^k(s',a')]=Q(s,a) \rightarrow Q^{k+1} r(s,a)+γEsT,aπ[Qk(s,a)]=Q(s,a)Qk+1
证毕
定理2:若 ∣ r ( s , a ) ∣ ≤ R ( ∀ ( s , a ) ) |r(s,a)|\leq R (\forall(s,a)) r(s,a)R((s,a)),则 Q ( s , a ) ≤ R 1 − γ Q(s,a)\leq\frac{R}{1-\gamma} Q(s,a)1γR
证明:
由Bellman迭代我们已经有了
Q ( s , a ) ← r ( s , a ) + γ E s ′ ~ T , a ′ ~ π [ Q ( s ′ , a ′ ) ] Q(s,a)\leftarrow r(s,a)+\gamma E_{s'~T,a'~\pi}[Q(s',a')] Q(s,a)r(s,a)+γEsT,aπ[Q(s,a)]
Q ( s 0 , a 0 ) = r ( s 0 , a 0 ) + γ E s 1 ~ T , a 1 ~ π [ Q ( s 1 , a 1 ) ] Q(s_0,a_0)=r(s_0,a_0)+\gamma E_{s_1~T,a_1~\pi}[Q(s_1,a_1)] Q(s0,a0)=r(s0,a0)+γEs1T,a1π[Q(s1,a1)]
Q ( s 1 , a 1 ) = r ( s 1 , a 1 ) + γ E s 2 ~ T , a 2 ~ π [ Q ( s 2 , a 2 ) ] Q(s_1,a_1)=r(s_1,a_1)+\gamma E_{s_2~T,a_2~\pi}[Q(s_2,a_2)] Q(s1,a1)=r(s1,a1)+γEs2T,a2π[Q(s2,a2)]
Q ( s 2 , a 2 ) = r ( s 2 , a 2 ) + γ E s 3 ~ T , a 3 ~ π [ Q ( s 3 , a 3 ) ] Q(s_2,a_2)=r(s_2,a_2)+\gamma E_{s_3~T,a_3~\pi}[Q(s_3,a_3)] Q(s2,a2)=r(s2,a2)+γEs3T,a3π[Q(s3,a3)]
整理会发现
Q ( s t , a t ) = r ( s t , a t ) + γ ( r ( s t + 1 , a t + 1 ) ) + γ 2 ( r ( s t + 2 , a t + 2 ) ) + ⋅ ⋅ Q(s_t,a_t)=r(s_t,a_t)+\gamma (r(s_{t+1},a_{t+1}))+\gamma^2(r(s_{t+2},a_{t+2}))+·· Q(st,at)=r(st,at)+γ(r(st+1,at+1))+γ2(r(st+2,at+2))+⋅⋅
这是等比数列,由于Agent不可能无限探索下去,有限步会终止。因此一定会有
∀ ( s , a ) , Q ( s , a ) ≤ R 1 − γ \forall(s,a),Q(s,a) \leq\frac{R}{1-\gamma} (s,a),Q(s,a)1γR证毕
有了以上两个定理。我们首先介绍作者提出的第一个引理,该引理的目的是去衡量经验Bellman算子和Bellman算子的差异性到底有多大
首先,笔者给予一些自己的注释便于大家后续理解,为什么作者要定义这样一个“经验Bellman算子”呢?这是因为 T T T T ^ \hat{T} T^的不同所导致的数据集 D D D并不包含全部的 s t + 1 s_{t+1} st+1转移情况。

引理1:下列不等式满足在高概率条件下成立(成立的可能性大于 1 − δ 1-\delta 1δ),并且奖励函数具有上界。则 B π ^ \hat{B^\pi} Bπ^ B π B^\pi Bπ误差是可控的

1. r ^ ( s t , a t ) 与 r ( s t , a t ) \hat{r}(s_t,a_t)与r(s_t,a_t) r^(st,at)r(st,at)误差足够的小,并且高概率条件下满足下列不等式**(并不要求处处满足该不等式,而是以高概率满足):其中 C r , δ C_{r,\delta} Cr,δ为一个关于 r r r δ \delta δ常数**
∣ r ^ ( s t , a t ) − r ( s t , a t ) ∣ ≤ C r , δ ∑ s , a ∈ D 1 s = s t , a = a t |\hat{r}(s_t,a_t)-r(s_t,a_t)|\leq\frac{C_{r,\delta}}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}} r^(st,at)r(st,at)s,aD1s=st,a=at Cr,δ2. T ^ 与 T \hat{T}与T T^T误差足够的小,并且高概率条件下满足下列不等式:其中 C T , δ C_{T,\delta} CT,δ为一个关于 T T T δ \delta δ常数
∣ T ^ ( s t + 1 ∣ s t , a t ) − T ( s t + 1 ∣ s t , a t ) ∣ ≤ C T , δ ∑ s , a ∈ D 1 s = s t , a = a t |\hat{T}(s_{t+1}|s_t,a_t)-T(s_{t+1}|s_t,a_t)|\leq\frac{C_{T,\delta}}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}} T^(st+1st,at)T(st+1st,at)s,aD1s=st,a=at CT,δ3. ∣ r ( s , a ) ∣ ≤ R ( ∀ ( s , a ) ) |r(s,a)|\leq R (\forall(s,a)) r(s,a)R((s,a))

在满足1,2两高概率成立条件下,同时满足3条件中Reward有上界。则采样误差满足:
∣ B π ^ Q ( s t , a t ) − B π Q ( s t , a t ) ∣ ≤ ∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t |\hat{B^\pi}Q(s_t,a_t)- B^\pi Q(s_t,a_t)|\leq \frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}} Bπ^Q(st,at)BπQ(st,at)s,aD1s=st,a=at Cr,δ+1γγCT,δR
证明:
∣ B π ^ Q ( s t , a t ) − B π Q ( s t , a t ) ∣ = B |\hat{B^\pi}Q(s_t,a_t)- B^\pi Q(s_t,a_t)|=B Bπ^Q(st,at)BπQ(st,at)=B可以简单的推导得到:

B = ∣ r − r ^ + γ ∑ s t + 1 ( T ^ − T ) E a t ~ π [ Q ( s t , a t ) ] ∣ B=|r-\hat{r}+\gamma \sum_{s_{t+1}}(\hat{T}-T) E_{a_t~\pi}[Q(s_t,a_t)]| B=rr^+γst+1(T^T)Eatπ[Q(st,at)]
B ≤ ∣ r − r ^ ∣ + ∣ γ ∑ s t + 1 ( T ^ − T ) E a t ~ π [ Q ( s t , a t ) ] ∣ B\leq|r-\hat{r}|+|\gamma \sum_{s_{t+1}}(\hat{T}-T) E_{a_t~\pi}[Q(s_t,a_t)]| Brr^+γst+1(T^T)Eatπ[Q(st,at)]
B ≤ C r , δ ∑ s , a ∈ D 1 s = s t , a = a t + ∣ γ ∑ s t + 1 ( T ^ − T ) E a t ~ π [ R 1 − γ ] ∣ B\leq\frac{C_{r,\delta}}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}+|\gamma \sum_{s_{t+1}}(\hat{T}-T) E_{a_t~\pi}[\frac{R}{1-\gamma}]| Bs,aD1s=st,a=at Cr,δ+γst+1(T^T)Eatπ[1γR]
B ≤ C r , δ ∑ s , a ∈ D 1 s = s t , a = a t + ∣ γ C T , δ ∑ s , a ∈ D 1 s = s t , a = a t [ R 1 − γ ] ∣ B\leq\frac{C_{r,\delta}}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}+| \frac{\gamma C_{T,\delta}}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}[\frac{R}{1-\gamma}]| Bs,aD1s=st,a=at Cr,δ+s,aD1s=st,a=at γCT,δ[1γR]
∣ B π ^ Q ( s t , a t ) − B π Q ( s t , a t ) ∣ ≤ ∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t |\hat{B^\pi}Q(s_t,a_t)- B^\pi Q(s_t,a_t)|\leq\frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}} Bπ^Q(st,at)BπQ(st,at)s,aD1s=st,a=at Cr,δ+1γγCT,δR证毕

1.2.4、Actor-Critic更新方式

在笔者的另一篇文章PPO中已经介绍了策略梯度的更新方式。现在我们还有了Q值的更新方式,因此汇总起来得到如下的Actor-Critic更新方式如下:
Q k + 1 ( s , a ) ← a r g m i n Q E s , a , s ′ ~ D [ ( r ( s , a ) + γ E a ′ ~ π k [ Q k ( s ′ , a ′ ) ] − Q ( s , a ) ) 2 ] Q^{k+1}(s,a)\leftarrow argmin_QE_{s,a,s'~D}[(r(s,a)+\gamma E_{a'~{\pi}^k}[Q^k(s',a')]-Q(s,a))^2] Qk+1(s,a)argminQEs,a,sD[(r(s,a)+γEaπk[Qk(s,a)]Q(s,a))2]
π ^ k + 1 ( a ∣ s ) ← a r g m a x π E s ~ D , a ~ π [ Q k + 1 ( s , a ) ] \hat{\pi}^{k+1}(a|s)\leftarrow argmax_\pi E_{s~D,a~\pi}[Q^{k+1}(s,a)] π^k+1(as)argmaxπEsD,aπ[Qk+1(s,a)]

1.2.5、问题描述

Offline RL算法存在一个明显的问题是,数据集 D D D是给定好的。我们注意到这一点,在训练的时候,也即 ( s , a , s ′ ) (s,a,s') (s,a,s)这一对是固定好在数据集 D D D中的,而数据集 D D D是基于用 π β ( a ∣ s ) \pi_\beta(a|s) πβ(as)采样而得到的。但是在训练的时候我们发现我们训练出来的目标 π k ( a ∣ s ) \pi^k(a|s) πk(as)是去最大化这个 Q Q Q值,换而言之:
π ^ k ( a ′ ∣ s ′ ) ← a r g m a x π E s ′ ~ D , a ′ ~ π [ Q k ( s ′ , a ′ ) ] \hat{\pi}^{k}(a'|s')\leftarrow argmax_\pi E_{s'~D,a'~\pi}[Q^{k}(s',a')] π^k(as)argmaxπEsD,aπ[Qk(s,a)]其实按照常理来讲,更新完了策略以后,应该利用当前所给出的策略去采样一段 ( s , a , s ′ ) (s,a,s') (s,a,s),然后再利用公式:
Q k + 1 ( s , a ) ← a r g m i n Q E s , a , s ′ ~ D [ ( r ( s , a ) + γ E a ′ ~ π ^ k [ Q k ( s ′ , a ′ ) ] − Q ( s , a ) ) 2 ] Q^{k+1}(s,a)\leftarrow argmin_QE_{s,a,s'~D}[(r(s,a)+\gamma E_{a'~\hat{\pi}^k}[Q^k(s',a')]-Q(s,a))^2] Qk+1(s,a)argminQEs,a,sD[(r(s,a)+γEaπ^k[Qk(s,a)]Q(s,a))2]进行更新,以此类推。但是显然的在Offline RL中存在这样的问题,用红色标注
上述公式中的 r ( s , a ) r(s,a) r(s,a)在Offline中是无法获取的,因为无法与环境进行探索,这会导致一个问题,很有可能是真实的 ( r ( s , a ) ∣ π β ) (r(s,a)|\pi_\beta) (r(s,a)πβ)要比现在固定的 ( r ( s , a ) ∣ π ^ k ) (r(s,a)|\hat{\pi}^k) (r(s,a)π^k)要低,因为此时的 π ^ k \hat{\pi}^k π^k是已经经过优化后的策略了。那么自然的,Offline RL算法存在了最明显也是最薄弱的缺陷之一,即由于不能与环境进行更新互动,导致了真实的Q值要比估计的Q值偏低。这就是最著名的Q值高估问题

2、CQL算法思想,证明与应用。

这一部分涉及很多理论证明和应用。不感兴趣证明和为什么CQL好的原理的读者,可以直接跳过证明部分只看如何应用CQL即可,无需看本部分证明,而如果想详细了解的读者可以跟随笔者进行证明。
为了便于后续理论部分内容,首先回顾下传统的Q更新方式,已经在第一节介绍过了:
Q k + 1 ( s , a ) ← a r g m i n Q E s , a , s ′ ~ D [ ( r ( s , a ) + γ E a ′ ~ π [ Q k ( s ′ , a ′ ) ] − Q ( s , a ) ) 2 ] Q^{k+1}(s,a)\leftarrow argmin_QE_{s,a,s'~D}[(r(s,a)+\gamma E_{a'~{\pi}}[Q^k(s',a')]-Q(s,a))^2] Qk+1(s,a)argminQEs,a,sD[(r(s,a)+γEaπ[Qk(s,a)]Q(s,a))2]
或者写成
Q k + 1 ( s , a ) ← a r g m i n Q E s , a [ ( B π Q k ( s , a ) − Q ( s , a ) ) 2 ] Q^{k+1}(s,a)\leftarrow argmin_QE_{s,a}[(B^{{\pi}}Q^k(s,a)-Q(s,a))^2] Qk+1(s,a)argminQEs,a[(BπQk(s,a)Q(s,a))2]

根据上述讨论,这样会高估Q值,原文作者提出了一种学习真实Q值函数下界的办法来改善这种状况:
原文作者提出了两种版本的CQL,分别是针对Q值的点态估计下界和关于V值的点态估计下界。在这里我逐一介绍并重证明。

2.1、CQL-version1

注:一个函数 f ( x ) f(x) f(x)的支集定义为 s p t ( f ( x ) ) = { x ∣ f ( x ) ≠ 0 } spt(f(x))=\{x|f(x)\neq0 \} spt(f(x))={xf(x)=0}
CQL定理1:对于任意的分布 μ ( a ∣ s ) \mu(a|s) μ(as),因子 α > 0 \alpha>0 α>0。满足: s u p p ( μ ) ⊂ s u p p ( π β ) supp(\mu)\subset supp(\pi_\beta) supp(μ)supp(πβ)(即 π β = 0 \pi_\beta=0 πβ=0 → \rightarrow μ = 0 \mu=0 μ=0)时,
满足在高概率条件成立中的引理1条件。在因子 α \alpha α足够大条件下,下列CQL1估计出的Q值满足: Q ^ π ( s , a ) ≤ Q π ( s , a ) ∀ ( s , a ) \hat{Q}^\pi(s,a) \leq Q^\pi(s,a) \forall(s,a) Q^π(s,a)Qπ(s,a)(s,a)额外的,若 B ^ π = B π \hat{B}^\pi=B^\pi B^π=Bπ即无采样误差存在,此时无需满足引理1的任何条件。对于任意 α > 0 \alpha>0 α>0,均有 Q ^ π ( s , a ) ≤ Q π ( s , a ) ∀ ( s , a ) \hat{Q}^\pi(s,a) \leq Q^\pi(s,a) \forall(s,a) Q^π(s,a)Qπ(s,a)(s,a)
CQL1更新方式为:
Q k + 1 ( s , a ) ← a r g m i n Q [ 1 2 E s , a [ ( B ^ π Q k ( s , a ) − Q ( s , a ) ) 2 ] + α E s ~ D , a ~ μ ( a ∣ s ) [ Q ( s , a ) ] ] Q^{k+1}(s,a)\leftarrow argmin_Q[\frac{1}{2}E_{s,a}[(\hat{B}^{{\pi}}Q^k(s,a)-Q(s,a))^2]+\alpha E_{s~D,a~\mu(a|s)}[Q(s,a)]] Qk+1(s,a)argminQ[21Es,a[(B^πQk(s,a)Q(s,a))2]+αEsD,aμ(as)[Q(s,a)]]证明:
仿照之前的证明的办法,令 L ( Q ) = [ 1 2 E s , a [ ( B π Q k ( s , a ) − Q ( s , a ) ) 2 ] + α E a ~ μ ( a ∣ s ) [ Q ( s , a ) ] ] L(Q)=[\frac{1}{2}E_{s,a}[(B^{{\pi}}Q^k(s,a)-Q(s,a))^2]+\alpha E_{a~\mu(a|s)}[Q(s,a)]] L(Q)=[21Es,a[(BπQk(s,a)Q(s,a))2]+αEaμ(as)[Q(s,a)]],并令 ∇ Q L ( Q ) = 0 \nabla_QL(Q)=0 QL(Q)=0求解 Q Q Q即可。
∇ Q L ( Q ) = − ∑ s ′ T ^ ( s ′ ∣ s , a ) P ( s , a ) [ r ^ ( s , a ) + γ ∑ a ′ π ( a ′ ∣ s ′ ) Q k ( s ′ , a ′ ) − Q ( s , a ) ] + α d π β ( s ) μ ( a ∣ s ) \nabla_QL(Q)=-\sum_{s'}\hat{T}(s'|s,a)P(s,a)[\hat{r}(s,a)+\gamma \sum_{a'}{\pi}(a'|s')Q^k(s',a')-Q(s,a)]+\alpha d^{\pi_\beta}(s)\mu(a|s) QL(Q)=sT^(ss,a)P(s,a)[r^(s,a)+γaπ(as)Qk(s,a)Q(s,a)]+αdπβ(s)μ(as)令上式=0会得到
α d π β ( s ) μ ( a ∣ s ) P ( s , a ) = r ^ ( s , a ) + γ E s ′ ~ T ^ , a ′ ~ π [ Q k ( s ′ , a ′ ) ] − Q ( s , a ) \frac{\alpha d^{\pi_\beta}(s)\mu(a|s)}{P(s,a)}=\hat{r}(s,a)+\gamma E_{s'~\hat{T},a'~{\pi}}[Q^k(s',a')]-Q(s,a) P(s,a)αdπβ(s)μ(as)=r^(s,a)+γEsT^aπ[Qk(s,a)]Q(s,a)这即:
α μ ( a ∣ s ) π β ( a ∣ s ) = B ^ π Q k ( s , a ) − Q ( s , a ) \frac{\alpha\mu(a|s)}{\pi_\beta(a|s)}=\hat{B}^{{\pi}}Q^k(s,a)-Q(s,a) πβ(as)αμ(as)=B^πQk(s,a)Q(s,a)整理一下即可得到Q值更新公式:
( C Q L 1 ) Q k + 1 ( s , a ) = B ^ π Q k ( s , a ) − α μ ( a ∣ s ) π β ( a ∣ s ) (CQL1)Q^{k+1}(s,a)=\hat{B}^{{\pi}}Q^k(s,a)-\frac{\alpha\mu(a|s)}{\pi_\beta(a|s)} (CQL1)Qk+1(s,a)=B^πQk(s,a)πβ(as)αμ(as)而我们之前RL中Q得更新公式为
( R L ) Q k + 1 ( s , a ) = B π Q k ( s , a ) (RL)Q^{k+1}(s,a)=B^{{\pi}}Q^k(s,a) (RL)Qk+1(s,a)=BπQk(s,a)下面来对比这两个结果,由引理1可以知道已经有了如下不等式估计
∣ B π ^ Q ( s t , a t ) − B π Q ( s t , a t ) ∣ ≤ ∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t |\hat{B^\pi}Q(s_t,a_t)- B^\pi Q(s_t,a_t)|\leq\frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}} Bπ^Q(st,at)BπQ(st,at)s,aD1s=st,a=at Cr,δ+1γγCT,δR
( C Q L 1 ) Q k + 1 ( s t , a t ) ≤ B π Q k ( s t , a t ) + ∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t − α μ ( a ∣ s ) π β ( a ∣ s ) (CQL1)Q^{k+1}(s_t,a_t)\leq B^{{\pi}}Q^k(s_t,a_t)+\frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}-\frac{\alpha \mu(a|s)}{\pi_\beta(a|s)} (CQL1)Qk+1(st,at)BπQk(st,at)+s,aD1s=st,a=at Cr,δ+1γγCT,δRπβ(as)αμ(as) k → ∞ k\rightarrow\infty k让策略趋于稳定可以得到
Q ^ π ( s t , a t ) ≤ B π Q ^ π ( s t , a t ) + ∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t − α μ ( a t ∣ s t ) π β ( a t ∣ s t ) \hat{Q}^{\pi}(s_t,a_t)\leq B^{\pi}\hat{Q}^\pi(s_t,a_t)+\frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}-\frac{\alpha \mu(a_t|s_t)}{\pi_\beta(a_t|s_t)} Q^π(st,at)BπQ^π(st,at)+s,aD1s=st,a=at Cr,δ+1γγCT,δRπβ(atst)αμ(atst)对于真实的Q值,应该满足Bellman方程:
B π Q ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 ~ T , a t + 1 ~ π [ Q ( s t + 1 , a t + 1 ) ] B^{\pi}Q(s_t,a_t)=r(s_t,a_t)+\gamma E_{s_{t+1}~T,a_{t+1}~\pi}[Q(s_{t+1},a_{t+1})] BπQ(st,at)=r(st,at)+γEst+1T,at+1π[Q(st+1,at+1)]若令 P π Q ( s t , a t ) = E s t + 1 ~ T , a t + 1 ~ π [ Q ( s t + 1 , a t + 1 ) ] P^\pi Q(s_t,a_t)=E_{s_{t+1}~T,a_{t+1}~\pi}[Q(s_{t+1},a_{t+1})] PπQ(st,at)=Est+1T,at+1π[Q(st+1,at+1)]
则会有 B π Q ( s t , a t ) = r ( s t , a t ) + P π Q ( s t , a t ) B^{\pi}Q(s_t,a_t)=r(s_t,a_t)+P^\pi Q(s_t,a_t) BπQ(st,at)=r(st,at)+PπQ(st,at),待到策略稳定时会有:
Q π ( s t , a t ) = r ( s t , a t ) + P π Q π ( s t , a t ) → Q π ( s t , a t ) = ( I − P π ) − 1 r ( s t , a t ) Q^\pi(s_t,a_t)=r(s_t,a_t)+P^\pi Q^\pi(s_t,a_t)\rightarrow Q^\pi(s_t,a_t)=(I-P^\pi)^{-1}r(s_t,a_t) Qπ(st,at)=r(st,at)+PπQπ(st,at)Qπ(st,at)=(IPπ)1r(st,at)故因此我们会有:
Q ^ π ( s t , a t ) ≤ r ( s t , a t ) + P π Q ^ π ( s t , a t ) + ∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t − α μ ( a t ∣ s t ) π β ( a t ∣ s t ) \hat{Q}^{\pi}(s_t,a_t)\leq r(s_t,a_t)+P^\pi \hat{Q}^\pi(s_t,a_t)+\frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}-\frac{\alpha\mu(a_t|s_t)}{\pi_\beta(a_t|s_t)} Q^π(st,at)r(st,at)+PπQ^π(st,at)+s,aD1s=st,a=at Cr,δ+1γγCT,δRπβ(atst)αμ(atst)
Q ^ π ( s t , a t ) ≤ ( I − P π ) − 1 [ r ( s t , a t ) + ∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t − α μ ( a t ∣ s t ) π β ( a t ∣ s t ) ] \hat{Q}^{\pi}(s_t,a_t)\leq(I-P^\pi)^{-1}[r(s_t,a_t)+\frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}-\frac{\alpha\mu(a_t|s_t)}{\pi_\beta(a_t|s_t)}] Q^π(st,at)(IPπ)1[r(st,at)+s,aD1s=st,a=at Cr,δ+1γγCT,δRπβ(atst)αμ(atst)]这也即分别对应了是否存在采样误差(红色和蓝色)的情况
Q ^ π ( s t , a t ) ≤ Q π ( s t , a t ) + ( I − P π ) − 1 [ ∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t − α μ ( a t ∣ s t ) π β ( a t ∣ s t ) ] \hat{Q}^{\pi}(s_t,a_t)\leq Q^\pi(s_t,a_t)+(I-P^\pi)^{-1}[\frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}-\frac{\alpha \mu(a_t|s_t)}{\pi_\beta(a_t|s_t)}] Q^π(st,at)Qπ(st,at)+(IPπ)1[s,aD1s=st,a=at Cr,δ+1γγCT,δRπβ(atst)αμ(atst)]
Q ^ π ( s t , a t ) ≤ Q π ( s t , a t ) + ( I − P π ) − 1 [ − α μ ( a t ∣ s t ) π β ( a t ∣ s t ) ] \hat{Q}^{\pi}(s_t,a_t)\leq Q^\pi(s_t,a_t)+(I-P^\pi)^{-1}[-\frac{\alpha \mu(a_t|s_t)}{\pi_\beta(a_t|s_t)}] Q^π(st,at)Qπ(st,at)+(IPπ)1[πβ(atst)αμ(atst)]1.当存在采样误差时,并且 α \alpha α足够大时候可以保证第二项为负的,这时有
Q ^ π ( s t , a t ) ≤ Q π ( s t , a t ) \hat{Q}^{\pi}(s_t,a_t)\leq Q^\pi(s_t,a_t) Q^π(st,at)Qπ(st,at)恒成立。
有趣的是,这个足够大的 α \alpha α是可以计算的。事实上读者们会发现当:
∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t − α μ ( a t ∣ s t ) π β ( a t ∣ s t ) < 0 \frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}-\frac{\alpha \mu(a_t|s_t)}{\pi_\beta(a_t|s_t)}<0 s,aD1s=st,a=at Cr,δ+1γγCT,δRπβ(atst)αμ(atst)<0
α ≥ m a x s t , a t ∣ C r , δ ∣ + ∣ γ C T , δ R 1 − γ ∣ ∑ s , a ∈ D 1 s = s t , a = a t m a x s t , a t π β ( a t ∣ s t ) μ ( a t ∣ s t ) \alpha \geq max_{s_t,a_t}\frac{|C_{r,\delta}|+|\frac{\gamma C_{T,\delta}R}{1-\gamma}|}{\sqrt{\sum_{s,a \in D}1_{s=s_t,a=a_t}}}max_{s_t,a_t}\frac{\pi_\beta(a_t|s_t)}{\mu(a_t|s_t)} αmaxst,ats,aD1s=st,a=at Cr,δ+1γγCT,δRmaxst,atμ(atst)πβ(atst)2.当不存在采样误差时,注意到第二项已经恒负了,而无需调节 α \alpha α,这时有
Q ^ π ( s t , a t ) ≤ Q π ( s t , a t ) \hat{Q}^{\pi}(s_t,a_t)\leq Q^\pi(s_t,a_t) Q^π(st,at)Qπ(st,at)恒成立。
证毕
笔者本部分证对应于下图所示原文的Theorem 3.1,笔者与原文证明略有不同,但是本质是一样的。
在这里插入图片描述
接下来将在《Conservative Q-Learning(CQL)保守Q学习(二)-CQL2(下界V值估计)》中主要介绍CQL第二个下界V估计即CQL逐步下界估计中介绍下一个下界算法,这两个是CQL的应用基础,谢谢大家。

### Conservative Q-Learning in Reinforcement Learning Algorithms Conservative Q-Learning (CQL) is an offline reinforcement learning algorithm designed to address the challenges associated with learning from a fixed dataset without further interaction with the environment[^1]. Unlike traditional online RL methods which require continuous exploration, CQL operates on pre-collected datasets. #### Key Concepts of Conservative Q-Learning In conservative Q-learning, two main objectives are pursued simultaneously: - **Maximizing Expected Return**: The primary goal remains optimizing policy performance by maximizing cumulative rewards. - **Minimizing Overestimation Bias**: A critical issue in off-policy evaluation arises when learned policies tend to overestimate action values not well supported by data. This leads to poor generalization outside observed states and actions. To mitigate this problem, CQL introduces a regularization term into the standard Bellman backup process. Specifically, instead of simply selecting maximum predicted Q-values during updates, CQL averages predictions across all possible actions while penalizing high-value estimates that lack sufficient support within the training set. This approach ensures more robustness against distributional shift between train/test environments and prevents catastrophic forgetting issues common in purely exploratory strategies. ```python import numpy as np def cql_loss(q_values, next_q_values, alpha=0.5): """ Compute Conservative Q-Learning loss Args: q_values: Current state-action value function outputs next_q_values: Next state-action value function outputs alpha: Regularization coefficient Returns: Loss scalar tensor """ # Standard TD error component td_error = ... # Log-sum-exp penalty for encouraging conservatism logsumexp_penalty = alpha * ( torch.logsumexp(next_q_values / alpha, dim=-1).mean() - next_q_values.mean() ) return td_error + logsumexp_penalty ``` By incorporating such penalties, CQL effectively discourages overly optimistic assessments about unseen scenarios, leading to safer extrapolations beyond available samples.
评论 15
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值