强化学习—— 基于baseline的策略梯度(Reinforce算法与A2C)
1. baseline的推导
- 策略网络为: π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s;θ)
- 状态价值函数为: V π ( s ) = E A ∼ π [ Q π ( A , s ) ] = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( a , s ) V_\pi(s)=E_{A\sim\pi}[Q_\pi(A,s)]\\=\sum_a\pi(a|s;\theta)\cdot Q_\pi(a,s) Vπ(s)=EA∼π[Qπ(A,s)]=a∑π(a∣s;θ)⋅Qπ(a,s)
- 策略梯度为: ∂ V π ( s ) ∂ θ = E A ∼ π [ Q π ( s , a ) ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] \frac{\partial V_\pi(s)}{\partial \theta}=E_{A\sim\pi}[Q_\pi(s,a)\cdot\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}] ∂θ∂Vπ(s)=EA∼π[Qπ(s,a)⋅∂θ∂log(π(a∣s;θ))]
- 设b为不依赖于动作A的任何函数,则: E A ∼ π [ b ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = b ⋅ E A ∼ π [ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = b ⋅ ∑ a π ( a ∣ s ; θ ) ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ = b ⋅ ∑ a π ( a ∣ s ; θ ) ⋅ 1 π ( a ∣ s ; θ ) ⋅ ∂ π ( a ∣ s ; θ ) ∂ θ = b ⋅ ∂ ∑ a π ( a ∣ s ; θ ) ∂ θ = b ⋅ ∂ 1 ∂ θ = 0 E_{A\sim\pi}[b\cdot \frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]\\=b\cdot E_{A\sim\pi}[\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]\\ = b\cdot \sum_a \pi(a|s;\theta)\cdot \frac{\partial log(\pi(a|s;\theta))}{\partial \theta}\\=b\cdot \sum_a \pi(a|s;\theta)\cdot \frac{1}{\pi(a|s;\theta)}\cdot \frac{\partial \pi(a|s;\theta)}{\partial \theta}\\ =b\cdot \frac{\partial \sum_a \pi (a|s;\theta)}{\partial \theta}\\=b\cdot\frac{\partial1}{\partial \theta}\\=0 EA∼π[b⋅∂θ∂log(π(a∣s;θ))]=b⋅EA∼π[∂θ∂log(π(a∣s;θ))]=b⋅a∑π(a∣s;θ)⋅∂θ∂log(π(a∣s;θ))=b⋅a∑π(a∣s;θ)⋅π(a∣s;θ)1⋅∂θ∂π(a∣s;θ)=b⋅∂θ∂∑aπ(a∣s;θ)=b⋅∂θ∂1=0因此,如果b独立于动作A,则: E A ∼ π [ b ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = 0 E_{A\sim\pi}[b\cdot\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]=0 EA∼π[b⋅∂θ∂log(π(a∣s;θ))]=0
- 则带baseline的策略梯度为: ∂ V π ( s ) ∂ θ = E A ∼ π [ Q π ( A , s ) ⋅ ∂ l o g ( π ( A ∣ s , θ ) ) ∂ θ ] − E A ∼ π [ b ⋅ ∂ l o g ( π ( A ∣ s , θ ) ) ∂ θ ] = E A ∼ π [ ∂ l o g ( π ( A ∣ s ; θ ) ) ∂ θ ⋅ ( Q π ( A , s ) − b ) ] \frac{\partial V_\pi(s)}{\partial \theta}=E_{A\sim\pi}[Q_\pi(A,s)\cdot\frac{\partial log(\pi(A|s,\theta))}{\partial \theta}]-E_{A\sim\pi}[b\cdot\frac{\partial log(\pi(A|s,\theta))}{\partial \theta}]\\=E_{A\sim\pi}[\frac{\partial log(\pi(A|s;\theta))}{\partial \theta}\cdot(Q_\pi(A,s)-b)] ∂θ∂Vπ(s)=EA∼π[Qπ(A,s)⋅∂θ∂log(π(A∣s,θ))]−EA∼π[b⋅∂θ∂log(π(A∣s,θ))]=EA∼π[∂θ∂log(π(A∣s;θ))⋅(Qπ(A,s)−b)]b不会影响期望,但合适的b会降低蒙特卡洛近似的方差,加快模型收敛。
2. 策略梯度的蒙特卡洛近似
- 基于baselin的策略梯度为: ∂ V π ( s t ) ∂ θ = = E A t ∼ π [ ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − b ) ] g ( A t ) = ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − b ) \frac{\partial V_\pi(s_t)}{\partial \theta}==E_{A_t\sim\pi}[\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-b)]\\g(A_t)=\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-b) ∂θ∂Vπ(st)==EAt∼π[∂θ∂log(π(At∣st;θ))⋅(Qπ(At,st)−b)]g(At)=∂θ∂log(π(At∣st;θ))⋅(Qπ(At,st)−b)
- 依据策略函数随机抽样得到t时刻的动作: a t ∼ π ( ⋅ ∣ s t ; θ ) a_t\sim\pi(\cdot|s_t;\theta) at∼π(⋅∣st;θ)
- 则策略梯度的无偏估计为: g ( a t ) g(a_t) g(at)
- 随机策略梯度: g ( a t ) = ( Q π ( s t , a t ) − b ) ⋅ ( ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ) g(a_t)=(Q_\pi(s_t,a_t)-b)\cdot(\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}) g(

本文详细探讨了强化学习中Reinforce算法的基础概念、策略梯度推导,以及A2C算法的网络结构、训练过程和数学原理。重点讲解了策略梯度中的baseline选择、蒙特卡洛近似和A2C算法的onestep和multi-step TD Target。最后对比了Reinforce与A2C的异同,指出Reinforce是A2C的特殊情况。
最低0.47元/天 解锁文章
1015

被折叠的 条评论
为什么被折叠?



