强化学习—— Trust Region Policy Optimization (TRPO算法)

最新推荐文章于 2024-04-17 13:17:04 发布

原创

最新推荐文章于 2024-04-17 13:17:04 发布 · 670 阅读

CC 4.0 BY-SA版权

文章标签：

本文详细介绍了TrustRegionPolicyOptimization (TRPO)算法，包括其核心原理：在给定旧策略θold的基础上，通过构建近似J(θ)的置信域L(θ|θold)，在该区域内寻找优化后的策略θnew。内容涉及状态值函数、目标函数的蒙特卡洛近似和实际的优化步骤。

1 Trust Region Algorithm 置信域算法

problem：
$\theta^\star=\mathop{argmax}\limits_{\theta} J(\theta)$
repeat：

Approximation: 给定 $\theta_{old}$ , 构建 $L(\theta|\theta_{old})$ 去近似 $J(\theta)$ ，其中 $\theta$ 需要满足 $\theta_{old}$ 的置信域，即 $N(\theta_{old})$ 。
Maximization: 在置信域内，求取优化后的 $\theta$ ： $\theta_{new}=\mathop{argmax}\limits_{\theta\in N(\theta_{old})}L(\theta|\theta_{old})$

state-value function:
$V_{\pi}(s)=\sum_{a}\pi(a|s;\theta)Q(s,a)=E_{A~\pi}[Q_\pi(s,A)]$
objective function:
$J(\theta)=E_S[V_\pi(S)]$
approximation:
$V_\pi(s)=\sum_a\frac{\pi(a|s;\theta)}{\pi(a|,s;\theta_{old})}\cdot Q_\pi(s,a)\cdot \pi(a|s;\theta_{old})=E_{A~\pi(\cdot|s;\theta_{old})}[\frac{\pi(A|s;\theta)}{\pi(A|s;\theta_{old})}\cdot Q_\pi(s,A)]$