在之前的文章里介绍了 深度强化学习(DRL)算法 1 —— REINFORCE,提出了两个缺点,其中缺点一,在后续提出的 DRL 算法 PPO 解决了,今天我们就来看看大名鼎鼎的 PPO 算法。
回顾
在 REINFORCE 算法里,用于产生 τ \tau τ 的策略和用来学习的策略(on-policy)是同一个,导致 τ \tau τ 不能复用,使得训练效率很低。那么直觉上,是不是把这两个策略分开来,一个用来产生 τ \tau τ ,一个用来学习(off-policy),是不是就可以了呢?答案是,对的!PPO 也是这么做的,接下来,我们来看看算法描述(建议先阅读 深度强化学习(DRL)算法 1 —— REINFORCE,文章接下来用文1指代这篇文章)。
算法描述
根据文1,最大期望回报表达如下:
R ˉ θ = E τ ∼ p θ ( τ ) [ R ( τ ) ] = ∑ τ p θ ( τ ) R ( τ ) \bar{R}_{\theta} = E_{\tau\sim p_{\theta}(\tau)}[R(\tau)] = \sum_{\tau}p_{\theta}(\tau)R(\tau) Rˉθ=Eτ∼pθ(τ)[R(τ)]=τ∑pθ(τ)R(τ)
现在我们希望有一个新的策略 q,用来产生 τ \tau τ ,原有策略 p 用来学习,那么我们可以对 R ˉ θ \bar{R}_{\theta} Rˉθ 做如下修改:
R ˉ θ = ∑ τ q θ ′ ( τ ) p θ ( τ ) q θ ′ ( τ ) R ( τ ) = E τ ∼ q θ ′ ( τ ) [ p θ ( τ ) q θ ′ ( τ ) R ( τ ) ] \bar{R}_{\theta} = \sum_{\tau}q_{\theta'}(\tau)\frac{p_{\theta}(\tau)}{q_{\theta'}(\tau)}R(\tau) = E_{\tau\sim q_{\theta'}(\tau)}[\frac{p_{\theta}(\tau)}{q_{\theta'}(\tau)}R(\tau)] Rˉθ=τ∑qθ′(τ)qθ′(τ)pθ(τ)R(τ)=Eτ∼qθ′(τ)[qθ′(τ)pθ(τ)R(τ)]
(Importance Sampling)
为了写起来方便,我们用 g 来表示文1里的
∇ R ˉ ( θ ) \nabla \bar{R}(\theta) ∇Rˉ(θ)
,那么新的 g 如下:
g = ∇ θ ∑ τ q θ ′ ( τ ) p θ ( τ ) q θ ′ ( τ ) R ( τ ) = 1 m ∑ i = 1 m R ( τ ( i ) ) ∑ t = 1 T p θ ( τ ( i ) ) q θ ′ ( τ ( i ) ) ∇ θ l o g p θ ( a t ( i ) ∣ s t ( i ) ) = 1 m ∑ i = 1 m R ( τ ( i ) ) ∑ t = 1 T p θ ( τ ( i ) ) q θ ′ ( τ ( i ) ) ∇ θ p θ ( a t ( i ) ∣ s t ( i ) ) p θ ( a t ( i ) ∣ s t ( i ) ) = 1 m ∑ i = 1 m R ( τ ( i ) ) ∑ t = 1 T ∏ t = 0 T p θ ( a t ( i ) ∣ s t ( i ) ) p ( s t + 1 ( i ) ∣ s t ( i ) , a t ( i ) ) ∏ t = 0 T q θ ′ ( a t ( i ) ∣ s t ( i ) ) q ( s t + 1 ( i ) ∣ s t ( i ) , a t ( i ) ) ∇ θ p θ ( a t ( i ) ∣ s t ( i ) ) p θ ( a t ( i ) ∣ s t ( i ) ) ≈ 1 m ∑ i = 1 m R ( τ ( i ) ) ∑ t = 1 T ∇ θ p θ ( a t ( i ) ∣ s t ( i ) ) q θ ′ ( a t ( i ) ∣ s t ( i ) ) g = \nabla_{\theta} \sum_{\tau}q_{\theta'}(\tau)\frac{p_{\theta}(\tau)}{q_{\theta'}(\tau)}R(\tau) \\ \ \ = \frac{1}{m}\sum_{i=1}^{m}R(\tau^{(i)})\sum_{t=1}^{T}\frac{p_{\theta} (\tau^{(i)})}{q_{\theta'}(\tau^{(i)})}\nabla_{\theta}log\ p_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\ \ \ = \frac{1}{m}\sum_{i=1}^{m}R(\tau^{(i)})\sum_{t=1}^{T}\frac{p_{\theta}(\tau^{(i)})}{q_{\theta'}(\tau^{(i)})}\frac{\nabla_{\theta}p_\theta(a_{t}^{(i)}|s_{t}^{(i)})}{p_\theta(a_{t}^{(i)}|s_{t}^{(i)})} \\ \ \ = \frac{1}{m}\sum_{i=1}^{m}R(\tau^{(i)})\sum_{t=1}^{T} \frac{\prod_{t=0}^{T}p_{\theta}(a_{t}^{(i)}|s_{t}^{(i)})p(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)})} {\prod_{t=0}^{T}q_{\theta'}(a_{t}^{(i)}|s_{t}^{(i)})q(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)})} \frac{\nabla_{\theta}p_\theta(a_{t}^{(i)}|s_{t}^{(i)})}{p_\theta(a_{t}^{(i)}|s_{t}^{(i)})} \\ \ \ \approx \frac{1}{m}\sum_{i=1}^{m}R(\tau^{(i)})\sum_{t=1}^{T} \frac{\nabla_{\theta}p_\theta(a_{t}^{(i)}|s_{t}^{(i)})}{q_\theta'(a_{t}^{(i)}|s_{t}^{(i)})} g=∇θτ∑qθ′(τ)qθ′(τ)pθ(τ)R(τ) =m1i=1∑mR

最低0.47元/天 解锁文章
1010

被折叠的 条评论
为什么被折叠?



