IEEE TAI 2024
paper
code
采用集成模型下MC Dropout实现状态动作对的不确定性估计,该估计将作为奖励函数的正则化项。然后基于MMD约束实现策略优化。
Intro
面对离线RL中由分布偏移引起的高估问题,显式不确定性估计是一种有希望的方法。论文提出了一种名为MOUP(带不确定性估计和策略约束的基于模型的离线RL)的新算法,在集成网络中引入了蒙特卡洛(MC)dropout以获得可靠的不确定性估计,将最大均值差异(MMD)约束整合进策略优化中限制状态不匹配。
Method
Ensemble Dropout Network
采用集成高斯模型对环境动力学模型进行建模
T
ϕ
(
s
t
+
1
,
r
t
∣
s
t
,
a
t
)
=
N
(
μ
ϕ
(
s
t
,
a
t
)
,
Σ
ϕ
(
s
t
,
a
t
)
)
μ
ϕ
(
s
t
,
a
t
)
=
∑
i
=
1
K
μ
ϕ
i
(
s
t
,
a
t
)
,
Σ
ϕ
(
s
t
,
a
t
)
=
∑
i
=
1
K
Σ
ϕ
i
(
s
t
,
a
t
)
.
\begin{aligned} &T_{\phi}(s_{t+1},r_{t}|s_{t},a_{t})=N(\mu_{\phi}(s_{t},a_{t}),\Sigma_{\phi}(s_{t},a_{t}))\\ &\mu_{\phi}(s_{t},a_{t}) =\sum_{i=1}^{K}\mu_{\phi i}(s_{t},a_{t}), \\ &\Sigma_{\phi}\left(s_{t},a_{t}\right) =\sum_{i=1}^{K}\Sigma_{\phi^{i}}(s_{t},a_{t}). \end{aligned}
Tϕ(st+1,rt∣st,at)=N(μϕ(st,at),Σϕ(st,at))μϕ(st,at)=i=1∑Kμϕi(st,at),Σϕ(st,at)=i=1∑KΣϕi(st,at).
对于每一个模型,采用基于MC Dropout对估计方差建模
V
a
r
[
ϕ
]
=
1
K
∑
i
=
1
K
μ
ϕ
i
(
s
t
,
a
t
)
T
μ
ϕ
i
(
s
t
,
a
t
)
−
E
[
μ
ϕ
i
(
s
t
,
a
t
)
]
T
E
[
μ
ϕ
i
(
s
t
,
a
t
)
]
.
(
12
)
Var\left[\phi\right]=\frac{1}{K}\sum_{i=1}^{K}\mu_{\phi^{i}}\left(s_{t},a_{t}\right)^{T}\mu_{\phi^{i}}\left(s_{t},a_{t}\right)-E\left[\mu_{\phi^{i}}\left(s_{t},a_{t}\right)\right]^{T}E\left[\mu_{\phi^{i}}\left(s_{t},a_{t}\right)\right].~~~~(12)
Var[ϕ]=K1i=1∑Kμϕi(st,at)Tμϕi(st,at)−E[μϕi(st,at)]TE[μϕi(st,at)]. (12)
估计方差与模型估计的标准差联合得到不确定性估计
Σ
^
ϕ
(
s
t
,
a
t
)
=
V
a
r
[
ϕ
]
+
Σ
ϕ
(
s
t
,
a
t
)
.
\hat{\Sigma}_\phi\left(s_t,a_t\right)=Var\left[\phi\right]+\Sigma_\phi\left(s_t,a_t\right).
Σ^ϕ(st,at)=Var[ϕ]+Σϕ(st,at).
选取集成模型中不确定性估计最大的模型,作为正则化项加入到预测的奖励函数中(MOPO算法中也采用这种惩罚项设置)
r
~
(
s
t
,
a
t
)
=
r
^
(
s
t
,
a
t
)
−
λ
max
i
=
1
,
.
.
.
,
H
∥
Σ
^
ϕ
i
(
s
t
,
a
t
)
∥
F
.
(
14
)
\tilde{r}(s_t,a_t)=\hat{r}(s_t,a_t)-\lambda\max_{i=1,...,H}\|\hat{\Sigma}_{\phi_i}\left(s_t,a_t\right)\|_{\mathrm{F}}. ~~~(14)
r~(st,at)=r^(st,at)−λi=1,...,Hmax∥Σ^ϕi(st,at)∥F. (14)
Policy constraint
基于MMD约束的RL问题表示如下
π
(
⋅
∣
s
t
)
:
=
max
π
E
a
t
∼
π
(
⋅
∣
s
t
)
[
Q
(
s
t
,
a
t
)
]
s
.
t
.
E
s
~
t
,
s
^
t
∼
D
[
M
M
D
(
T
^
(
s
~
t
+
1
∣
s
~
t
,
π
(
⋅
∣
s
~
t
)
)
,
s
^
t
)
]
≤
η
,
\pi(\cdot\mid s_{t}):=\max_{\pi}\mathbb{E}_{a_{t}\sim\pi(\cdot\mid s_{t})}[Q(s_{t},a_{t})]\\\mathrm{s.t.}\quad\mathbb{E}_{\tilde{s}_{t},\hat{s}_{t}\sim D}[\mathbf{MMD}(\widehat{T}(\tilde{s}_{t+1}|\tilde{s}_{t},\pi(\cdot\mid\tilde{s}_{t})),\hat{s}_{t})]\leq\eta,
π(⋅∣st):=πmaxEat∼π(⋅∣st)[Q(st,at)]s.t.Es~t,s^t∼D[MMD(T
(s~t+1∣s~t,π(⋅∣s~t)),s^t)]≤η,
然后通过拉格朗日乘子转化为,最小化如下问题:
L
π
:
=
−
E
a
t
∼
π
(
⋅
∣
s
t
)
[
Q
(
s
t
,
a
t
)
]
+
α
(
E
s
~
t
,
s
^
t
∼
D
[
M
M
D
(
T
^
(
s
~
t
+
1
∣
s
~
t
,
π
(
⋅
∣
s
~
t
)
)
,
s
^
t
)
]
−
η
)
.
L_{\pi}:=-\mathbb{E}_{a_{t}\sim\pi(\cdot|s_{t})}[Q(s_{t},a_{t})]+\alpha(\mathbb{E}_{\tilde{s}_{t},\hat{s}_{t}\sim D}[\mathbf{MMD}(\widehat{T}(\tilde{s}_{t+1}|\tilde{s}_{t},\pi(\cdot\mid\tilde{s}_{t})),\hat{s}_{t})]-\eta).
Lπ:=−Eat∼π(⋅∣st)[Q(st,at)]+α(Es~t,s^t∼D[MMD(T
(s~t+1∣s~t,π(⋅∣s~t)),s^t)]−η).