参考:王树森《深度强化学习》书籍和代码
1、置信域算法
1.Approximation: Given θold constructL(θ∣θold)is an approximation to J(θ) in N(θold).1.{\text{Approximation: Given }\mathbf{\theta}_{\mathrm{old}}\text{ construct}}{L(\mathbf{\theta}\mid\mathbf{\theta}_{\mathrm{old}})}\text{is an approximation to }J(\mathbf{\theta})\text{ in }\mathcal{N}(\mathbf{\theta}_{\mathrm{old}}).1.Approximation: Given θold constructL(θ∣θold)is an approximation to J(θ) in N(θold).
2. Maximization: In the trust region, find θnew by:2.\text{ Maximization: In the trust region, find }\mathbf{\theta}_{\mathrm{new}}\text{ by:}2. Maximization: In the trust region, find θnew by:
θnew←argmaxθ∈N(θold)L(θ∣θold). \theta_{\mathrm{new}}\leftarrow\underset{\mathbf{\theta}\in\mathbb{N}(\mathbf{\theta}_{\mathrm{old}})}{\operatorname*{\operatorname{argmax}}}L(\mathbf{\theta}\mid\mathbf{\theta}_{\mathrm{old}}). θnew←θ∈N(θold)argmaxL(θ∣θold).
注:做近似时,可以用二阶泰勒展开、蒙特卡洛近似等;在置信域求LLL最大化时用什么方法都可以;
2、目标函数的推导
Use policy network, π(a∣s;0),for controlling the agent.\text{Use policy network, }\pi(a|s;\mathbf{0}),\text{for controlling the agent.}Use policy network, π(a∣s;0),for controlling the agent.
State-value function:\text{State-value function:}State-value function:
Vπ(s)=∑aπ(a∣s;θ)⋅Qπ(s,a)=∑aπ(a∣s;θold)⋅π(a∣s;θ)π(a∣s;θold)⋅Qπ(s,a)=EA∼π(⋅∣s;θold)[π(A∣s;θ)π(A∣s;θold)⋅Qπ(s,A)]. \begin{aligned}V_{\pi}(s)& =\sum_a\pi({a}\mid s;\mathbf{\theta})\cdot Q_\pi(s,{a}) \\&=\sum_{
{a}}\pi({a}\mid s;\mathbf{\theta}_{\mathrm{old}})\cdot\frac{\pi({a}\mid s;\mathbf{\theta})}{\pi({a}\mid s;\mathbf{\theta}_{\mathrm{old}})}\cdot Q_{\pi}({s},{a}) \\&={\mathbb{E}_{
{A}\sim\pi(\cdot|s;\mathbf{\theta}_{\mathrm{old}})}}[\frac{\pi({A}\mid s;\mathbf{\theta})}{\pi({A}\mid s;\mathbf{\theta}_{\mathrm{old}})}\cdot Q_\pi(s,{A})].\end{aligned} Vπ(s)=a∑π(a∣s;θ)⋅Qπ(s,a)=a∑π(a∣s;θold)⋅π(a∣s;θold)π(a∣s;θ)⋅Qπ(s,a)=EA∼π(⋅∣s;θold)[π(A∣s;θold)π(A∣s;θ)⋅Qπ(s,A)].
Objective function:\text{Objective function:}Objective function:
J(θ)=ES[Vπ(S)]=ES[EA[π(A∣S;θ)π(A∣S;θold)⋅Qπ(S,A)]] \begin{aligned}J(\mathbf{\theta})& =\mathbb{E}_S[V_\pi(S)] \\&={\mathbb{E}_S\left[\mathbb{E}_A\left[\frac{\pi(A\mid S;\mathbf{\theta})}{\pi(A\mid S;\mathbf{\theta}_{\mathrm{old}})}\cdot Q_\pi(S,A)\right]\right]}\end{aligned} J(θ)=ES[Vπ(S)]=ES[EA[π(A∣S;θold)π(A

本文参考《深度强化学习》书籍和代码,介绍了置信域算法,包括近似和最大化步骤,推导了目标函数。详细阐述了TRPO算法,涉及目标函数、蒙特卡洛近似、折扣回报等内容,还给出了在置信域内寻找新参数的方法,最后提及代码实现。
最低0.47元/天 解锁文章
1598

被折叠的 条评论
为什么被折叠?



