强化学习—— Trust Region Policy Optimization (TRPO算法
1 Trust Region Algorithm 置信域算法
problem:
θ ⋆ = a r g m a x θ J ( θ ) \theta^\star=\mathop{argmax}\limits_{\theta} J(\theta) θ⋆=θargmaxJ(θ)
repeat:
- Approximation: 给定 θ o l d \theta_{old} θold, 构建 L ( θ ∣ θ o l d ) L(\theta|\theta_{old}) L(θ∣θold)去近似 J ( θ ) J(\theta) J(θ),其中 θ \theta θ需要满足 θ o l d \theta_{old} θold的置信域,即 N ( θ o l d ) N(\theta_{old}) N(θold)。
- Maximization: 在置信域内,求取优化后的 θ \theta θ: θ n e w = a r g m a x θ ∈ N ( θ o l d ) L ( θ ∣ θ o l d ) \theta_{new}=\mathop{argmax}\limits_{\theta\in N(\theta_{old})}L(\theta|\theta_{old}) θnew=θ∈N(θold)argmaxL(θ∣θold)
2 Trust Region Policy Optimization (TRPO算法)
- state-value function:
V π ( s ) = ∑ a π ( a ∣ s ; θ ) Q ( s , a ) = E A π [ Q π ( s , A ) ] V_{\pi}(s)=\sum_{a}\pi(a|s;\theta)Q(s,a)=E_{A~\pi}[Q_\pi(s,A)] Vπ(s)=a∑π(a∣s;θ)Q(s,a)=EA π[Qπ(s,A)] - objective function:
J ( θ ) = E S [ V π ( S ) ] J(\theta)=E_S[V_\pi(S)] J(θ)=ES[Vπ(S)] - approximation:
V π ( s ) = ∑ a π ( a ∣ s ; θ ) π ( a ∣ , s ; θ o l d ) ⋅ Q π ( s , a ) ⋅ π ( a ∣ s ; θ o l d ) = E A π ( ⋅ ∣ s ; θ o l d ) [ π ( A ∣ s ; θ ) π ( A ∣ s ; θ o l d ) ⋅ Q π ( s , A ) ] V_\pi(s)=\sum_a\frac{\pi(a|s;\theta)}{\pi(a|,s;\theta_{old})}\cdot Q_\pi(s,a)\cdot \pi(a|s;\theta_{old})=E_{A~\pi(\cdot|s;\theta_{old})}[\frac{\pi(A|s;\theta)}{\pi(A|s;\theta_{old})}\cdot Q_\pi(s,A)] Vπ(s)=

本文详细介绍了TrustRegionPolicyOptimization (TRPO)算法,包括其核心原理:在给定旧策略θold的基础上,通过构建近似J(θ)的置信域L(θ|θold),在该区域内寻找优化后的策略θnew。内容涉及状态值函数、目标函数的蒙特卡洛近似和实际的优化步骤。
最低0.47元/天 解锁文章
1563

被折叠的 条评论
为什么被折叠?



