ICLR 2024 Spotlight
paper
Method
对比学习结合 goal-conditioned offline 视觉强化学习实现自监督学习。考虑如下MDP设定,其中策略为
π
(
a
∣
s
,
g
)
\pi(a|s,g)
π(a∣s,g),
P
π
(
⋅
∣
⋅
,
g
)
(
s
t
=
s
∣
s
0
,
a
0
)
\mathbb{P}^{\pi(\cdot|\cdot,g)}(s_{t}=s|s_{0},a_{0})
Pπ(⋅∣⋅,g)(st=s∣s0,a0)表示从初始状态执行策略,经过t步到达的状态概率密度。状态带折扣measure occupancy表示为
p
π
(
⋅
∣
⋅
,
g
)
(
s
t
+
=
s
∣
s
0
,
a
0
)
≜
(
1
−
γ
)
∑
t
=
0
∞
γ
t
P
π
(
⋅
∣
⋅
,
g
)
(
s
t
=
s
∣
s
0
,
a
0
)
p^{\pi(\cdot|\cdot,g)}(s_{t+}=s\mid s_0,a_0)\triangleq(1-\gamma)\sum_{t=0}^{\infty}\gamma^t\mathbb{P}^{\pi(\cdot|\cdot,g)}(s_t=s\mid s_0,a_0)
pπ(⋅∣⋅,g)(st+=s∣s0,a0)≜(1−γ)t=0∑∞γtPπ(⋅∣⋅,g)(st=s∣s0,a0)
策略优化的目标便是最大化状态占有度量期望值
E
p
g
(
g
)
[
p
π
(
⋅
∣
⋅
,
g
)
(
s
t
+
=
g
)
]
=
E
p
g
(
g
)
p
0
(
s
0
)
π
(
a
0
∣
s
0
,
g
)
[
p
π
(
⋅
∣
⋅
,
g
)
(
s
t
+
∣
s
0
,
a
0
)
]
\mathbb{E}_{p_g(g)}[p^{\pi(\cdot|\cdot,g)}(s_{t+}=g)]=\mathbb{E}_{p_g(g)p_0(s_0)\pi(a_0|s_0,g)}\left[p^{\pi(\cdot|\cdot,g)}(s_{t+}\mid s_0,a_0)\right]
Epg(g)[pπ(⋅∣⋅,g)(st+=g)]=Epg(g)p0(s0)π(a0∣s0,g)[pπ(⋅∣⋅,g)(st+∣s0,a0)]
结合以往工作[C-learning],通过对比表示学习完成述目标。首先设置价值函数 f ( s , a , s t + ) = ϕ ( s , a ) ⊤ ψ ( s t + ) f(s,a,s_{t+})=\phi(s,a)^{\top}\psi(s_{t+}) f(s,a,st+)=ϕ(s,a)⊤ψ(st+), 通过两个表征函数的内积估量当前状态动作与未来状态的关联程度。
对比强化学习目标便是区分平均未来状态
s
f
+
s_{f}^+
sf+与任意采样未来状态
s
f
−
s_{f}^-
sf−:
s
f
+
∼
p
π
(
⋅
∣
⋅
)
(
s
t
+
∣
s
,
a
)
=
∫
p
π
(
⋅
∣
⋅
,
g
)
(
s
t
+
∣
s
,
a
)
p
π
(
g
∣
s
,
a
)
d
g
s_{f}^{+}\sim p^{\pi(\cdot|\cdot)}(s_{t+}\mid s,a)=\int p^{\pi(\cdot|\cdot,g)}(s_{t+}\mid s,a)p^{\pi}(g\mid s,a)dg
sf+∼pπ(⋅∣⋅)(st+∣s,a)=∫pπ(⋅∣⋅,g)(st+∣s,a)pπ(g∣s,a)dg
s
f
−
∼
p
(
s
t
+
)
=
∫
p
π
(
⋅
∣
⋅
)
(
s
t
+
∣
s
,
a
)
p
(
s
,
a
)
d
a
d
s
s_{f}^{-}\sim p(s_{t+})=\int p^{\pi(\cdot|\cdot)}(s_{t+}|s,a)p(s,a)dads
sf−∼p(st+)=∫pπ(⋅∣⋅)(st+∣s,a)p(s,a)dads
采用NCE Binary loss优化价值函数
E
s
f
+
∼
p
π
(
⋅
∣
⋅
)
(
s
t
+
∣
s
,
a
)
[
log
σ
(
ϕ
(
s
,
a
)
⊤
ψ
(
s
f
+
)
)
⏟
L
1
(
ϕ
(
s
,
a
)
,
ψ
(
s
f
+
)
)
]
+
E
s
f
−
∼
p
(
s
t
+
)
[
log
(
1
−
σ
(
ϕ
(
s
,
a
)
⊤
ψ
(
s
f
−
)
)
)
⏟
L
2
(
ϕ
(
s
,
a
)
,
ψ
(
s
f
−
)
)
]
.
\mathbb{E}_{s_f^+\sim p^{\pi(\cdot|\cdot)}(s_{t+}|s,a)}[\underbrace{\log\sigma(\phi(s,a)^{\top}\psi(s_{f}^{+}))}_{\mathcal{L}_{1}(\phi(s,a),\psi(s_{f}^{+}))}]+\mathbb{E}_{s_f^-\sim p(s_{t+})}[\underbrace{\log(1-\sigma(\phi(s,a)^\top\psi(s_f^-)))}_{\mathcal{L}_2(\phi(s,a),\psi(s_f^-))}].
Esf+∼pπ(⋅∣⋅)(st+∣s,a)[L1(ϕ(s,a),ψ(sf+))
logσ(ϕ(s,a)⊤ψ(sf+))]+Esf−∼p(st+)[L2(ϕ(s,a),ψ(sf−))
log(1−σ(ϕ(s,a)⊤ψ(sf−)))].
上述公式在offf-policy下可以重写成TD形式[C-learning]
max
f
E
(
s
,
a
)
∼
p
(
s
,
a
)
,
s
′
∼
p
(
s
′
∣
s
,
a
)
[
(
1
−
γ
)
log
σ
(
f
(
s
,
a
,
s
′
)
)
+
log
(
1
−
σ
(
f
(
s
,
a
,
s
f
)
)
)
+
γ
⌊
w
(
s
′
,
a
′
,
s
f
)
⌋
s
g
log
σ
(
f
(
s
,
a
,
s
f
)
)
]
\begin{aligned}\max_f\mathbb{E}_{(s,a)\sim p(s,a),s^{\prime}\sim p(s^{\prime}|s,a)}&\Big[(1-\gamma)\log\sigma(f(s,a,s^{\prime}))\\&+\log(1-\sigma(f(s,a,s_f)))\\&+\gamma\lfloor w(s^{\prime},a^{\prime},s_{f})\rfloor_{\mathrm{sg}}\log\sigma(f(s,a,s_{f}))\Big]\end{aligned}
fmaxE(s,a)∼p(s,a),s′∼p(s′∣s,a)[(1−γ)logσ(f(s,a,s′))+log(1−σ(f(s,a,sf)))+γ⌊w(s′,a′,sf)⌋sglogσ(f(s,a,sf))]
策略的优化则是在offline设定下最大化critic
f
f
f并结合BC正则化:
max
π
(
⋅
∣
⋅
,
⋅
)
E
p
g
(
g
)
p
(
s
,
a
o
r
g
)
π
(
a
∣
s
,
g
)
[
(
1
−
λ
)
⋅
f
(
s
,
a
,
g
)
+
λ
log
π
(
a
o
r
i
g
∣
s
,
a
)
]
\max_{\pi(\cdot|\cdot,\cdot)}\mathbb{E}_{p_g(g)p(s,a_{\mathrm{org}})\pi(a|s,g)}\left[\left(1-\lambda\right)\cdot f(s,a,g)+\lambda\log\pi(a_{\mathrm{orig}}\mid s,a)\right]
π(⋅∣⋅,⋅)maxEpg(g)p(s,aorg)π(a∣s,g)[(1−λ)⋅f(s,a,g)+λlogπ(aorig∣s,a)]