Contrastive RL:Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data

ICLR 2024 Spotlight
paper

Method

对比学习结合 goal-conditioned offline 视觉强化学习实现自监督学习。考虑如下MDP设定,其中策略为 π ( a ∣ s , g ) \pi(a|s,g) π(as,g) P π ( ⋅ ∣ ⋅ , g ) ( s t = s ∣ s 0 , a 0 ) \mathbb{P}^{\pi(\cdot|\cdot,g)}(s_{t}=s|s_{0},a_{0}) Pπ(,g)(st=ss0,a0)表示从初始状态执行策略,经过t步到达的状态概率密度。状态带折扣measure occupancy表示为 p π ( ⋅ ∣ ⋅ , g ) ( s t + = s ∣ s 0 , a 0 ) ≜ ( 1 − γ ) ∑ t = 0 ∞ γ t P π ( ⋅ ∣ ⋅ , g ) ( s t = s ∣ s 0 , a 0 ) p^{\pi(\cdot|\cdot,g)}(s_{t+}=s\mid s_0,a_0)\triangleq(1-\gamma)\sum_{t=0}^{\infty}\gamma^t\mathbb{P}^{\pi(\cdot|\cdot,g)}(s_t=s\mid s_0,a_0) pπ(,g)(st+=ss0,a0)(1γ)t=0γtPπ(,g)(st=ss0,a0)
策略优化的目标便是最大化状态占有度量期望值
E p g ( g ) [ p π ( ⋅ ∣ ⋅ , g ) ( s t + = g ) ] = E p g ( g ) p 0 ( s 0 ) π ( a 0 ∣ s 0 , g ) [ p π ( ⋅ ∣ ⋅ , g ) ( s t + ∣ s 0 , a 0 ) ] \mathbb{E}_{p_g(g)}[p^{\pi(\cdot|\cdot,g)}(s_{t+}=g)]=\mathbb{E}_{p_g(g)p_0(s_0)\pi(a_0|s_0,g)}\left[p^{\pi(\cdot|\cdot,g)}(s_{t+}\mid s_0,a_0)\right] Epg(g)[pπ(,g)(st+=g)]=Epg(g)p0(s0)π(a0s0,g)[pπ(,g)(st+s0,a0)]

结合以往工作[C-learning],通过对比表示学习完成述目标。首先设置价值函数 f ( s , a , s t + ) = ϕ ( s , a ) ⊤ ψ ( s t + ) f(s,a,s_{t+})=\phi(s,a)^{\top}\psi(s_{t+}) f(s,a,st+)=ϕ(s,a)ψ(st+), 通过两个表征函数的内积估量当前状态动作与未来状态的关联程度。

对比强化学习目标便是区分平均未来状态 s f + s_{f}^+ sf+与任意采样未来状态 s f − s_{f}^- sf:
s f + ∼ p π ( ⋅ ∣ ⋅ ) ( s t + ∣ s , a ) = ∫ p π ( ⋅ ∣ ⋅ , g ) ( s t + ∣ s , a ) p π ( g ∣ s , a ) d g s_{f}^{+}\sim p^{\pi(\cdot|\cdot)}(s_{t+}\mid s,a)=\int p^{\pi(\cdot|\cdot,g)}(s_{t+}\mid s,a)p^{\pi}(g\mid s,a)dg sf+pπ()(st+s,a)=pπ(,g)(st+s,a)pπ(gs,a)dg
s f − ∼ p ( s t + ) = ∫ p π ( ⋅ ∣ ⋅ ) ( s t + ∣ s , a ) p ( s , a ) d a d s s_{f}^{-}\sim p(s_{t+})=\int p^{\pi(\cdot|\cdot)}(s_{t+}|s,a)p(s,a)dads sfp(st+)=pπ()(st+s,a)p(s,a)dads
采用NCE Binary loss优化价值函数
E s f + ∼ p π ( ⋅ ∣ ⋅ ) ( s t + ∣ s , a ) [ log ⁡ σ ( ϕ ( s , a ) ⊤ ψ ( s f + ) ) ⏟ L 1 ( ϕ ( s , a ) , ψ ( s f + ) ) ] + E s f − ∼ p ( s t + ) [ log ⁡ ( 1 − σ ( ϕ ( s , a ) ⊤ ψ ( s f − ) ) ) ⏟ L 2 ( ϕ ( s , a ) , ψ ( s f − ) ) ] . \mathbb{E}_{s_f^+\sim p^{\pi(\cdot|\cdot)}(s_{t+}|s,a)}[\underbrace{\log\sigma(\phi(s,a)^{\top}\psi(s_{f}^{+}))}_{\mathcal{L}_{1}(\phi(s,a),\psi(s_{f}^{+}))}]+\mathbb{E}_{s_f^-\sim p(s_{t+})}[\underbrace{\log(1-\sigma(\phi(s,a)^\top\psi(s_f^-)))}_{\mathcal{L}_2(\phi(s,a),\psi(s_f^-))}]. Esf+pπ()(st+s,a)[L1(ϕ(s,a),ψ(sf+)) logσ(ϕ(s,a)ψ(sf+))]+Esfp(st+)[L2(ϕ(s,a),ψ(sf)) log(1σ(ϕ(s,a)ψ(sf)))].
上述公式在offf-policy下可以重写成TD形式[C-learning]
max ⁡ f E ( s , a ) ∼ p ( s , a ) , s ′ ∼ p ( s ′ ∣ s , a ) [ ( 1 − γ ) log ⁡ σ ( f ( s , a , s ′ ) ) + log ⁡ ( 1 − σ ( f ( s , a , s f ) ) ) + γ ⌊ w ( s ′ , a ′ , s f ) ⌋ s g log ⁡ σ ( f ( s , a , s f ) ) ] \begin{aligned}\max_f\mathbb{E}_{(s,a)\sim p(s,a),s^{\prime}\sim p(s^{\prime}|s,a)}&\Big[(1-\gamma)\log\sigma(f(s,a,s^{\prime}))\\&+\log(1-\sigma(f(s,a,s_f)))\\&+\gamma\lfloor w(s^{\prime},a^{\prime},s_{f})\rfloor_{\mathrm{sg}}\log\sigma(f(s,a,s_{f}))\Big]\end{aligned} fmaxE(s,a)p(s,a),sp(ss,a)[(1γ)logσ(f(s,a,s))+log(1σ(f(s,a,sf)))+γw(s,a,sf)sglogσ(f(s,a,sf))]
策略的优化则是在offline设定下最大化critic f f f并结合BC正则化:
max ⁡ π ( ⋅ ∣ ⋅ , ⋅ ) E p g ( g ) p ( s , a o r g ) π ( a ∣ s , g ) [ ( 1 − λ ) ⋅ f ( s , a , g ) + λ log ⁡ π ( a o r i g ∣ s , a ) ] \max_{\pi(\cdot|\cdot,\cdot)}\mathbb{E}_{p_g(g)p(s,a_{\mathrm{org}})\pi(a|s,g)}\left[\left(1-\lambda\right)\cdot f(s,a,g)+\lambda\log\pi(a_{\mathrm{orig}}\mid s,a)\right] π(,)maxEpg(g)p(s,aorg)π(as,g)[(1λ)f(s,a,g)+λlogπ(aorigs,a)]

Network Architecture

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值