强化学习—— 基于baseline的策略梯度(Reinforce算法与A2C)

本文详细探讨了强化学习中Reinforce算法的基础概念、策略梯度推导,以及A2C算法的网络结构、训练过程和数学原理。重点讲解了策略梯度中的baseline选择、蒙特卡洛近似和A2C算法的onestep和multi-step TD Target。最后对比了Reinforce与A2C的异同,指出Reinforce是A2C的特殊情况。

1. baseline的推导

  • 策略网络为: π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ)
  • 状态价值函数为: V π ( s ) = E A ∼ π [ Q π ( A , s ) ] = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( a , s ) V_\pi(s)=E_{A\sim\pi}[Q_\pi(A,s)]\\=\sum_a\pi(a|s;\theta)\cdot Q_\pi(a,s) Vπ(s)=EAπ[Qπ(A,s)]=aπ(as;θ)Qπ(a,s)
  • 策略梯度为: ∂ V π ( s ) ∂ θ = E A ∼ π [ Q π ( s , a ) ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] \frac{\partial V_\pi(s)}{\partial \theta}=E_{A\sim\pi}[Q_\pi(s,a)\cdot\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}] θVπ(s)=EAπ[Qπ(s,a)θlog(π(as;θ))]
  • 设b为不依赖于动作A的任何函数,则: E A ∼ π [ b ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = b ⋅ E A ∼ π [ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = b ⋅ ∑ a π ( a ∣ s ; θ ) ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ = b ⋅ ∑ a π ( a ∣ s ; θ ) ⋅ 1 π ( a ∣ s ; θ ) ⋅ ∂ π ( a ∣ s ; θ ) ∂ θ = b ⋅ ∂ ∑ a π ( a ∣ s ; θ ) ∂ θ = b ⋅ ∂ 1 ∂ θ = 0 E_{A\sim\pi}[b\cdot \frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]\\=b\cdot E_{A\sim\pi}[\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]\\ = b\cdot \sum_a \pi(a|s;\theta)\cdot \frac{\partial log(\pi(a|s;\theta))}{\partial \theta}\\=b\cdot \sum_a \pi(a|s;\theta)\cdot \frac{1}{\pi(a|s;\theta)}\cdot \frac{\partial \pi(a|s;\theta)}{\partial \theta}\\ =b\cdot \frac{\partial \sum_a \pi (a|s;\theta)}{\partial \theta}\\=b\cdot\frac{\partial1}{\partial \theta}\\=0 EAπ[bθlog(π(as;θ))]=bEAπ[θlog(π(as;θ))]=baπ(as;θ)θlog(π(as;θ))=baπ(as;θ)π(as;θ)1θπ(as;θ)=bθaπ(as;θ)=bθ1=0因此,如果b独立于动作A,则: E A ∼ π [ b ⋅ ∂ l o g ( π ( a ∣ s ; θ ) ) ∂ θ ] = 0 E_{A\sim\pi}[b\cdot\frac{\partial log(\pi(a|s;\theta))}{\partial \theta}]=0 EAπ[bθlog(π(as;θ))]=0
  • 则带baseline的策略梯度为: ∂ V π ( s ) ∂ θ = E A ∼ π [ Q π ( A , s ) ⋅ ∂ l o g ( π ( A ∣ s , θ ) ) ∂ θ ] − E A ∼ π [ b ⋅ ∂ l o g ( π ( A ∣ s , θ ) ) ∂ θ ] = E A ∼ π [ ∂ l o g ( π ( A ∣ s ; θ ) ) ∂ θ ⋅ ( Q π ( A , s ) − b ) ] \frac{\partial V_\pi(s)}{\partial \theta}=E_{A\sim\pi}[Q_\pi(A,s)\cdot\frac{\partial log(\pi(A|s,\theta))}{\partial \theta}]-E_{A\sim\pi}[b\cdot\frac{\partial log(\pi(A|s,\theta))}{\partial \theta}]\\=E_{A\sim\pi}[\frac{\partial log(\pi(A|s;\theta))}{\partial \theta}\cdot(Q_\pi(A,s)-b)] θVπ(s)=EAπ[Qπ(A,s)θlog(π(As,θ))]EAπ[bθlog(π(As,θ))]=EAπ[θlog(π(As;θ))(Qπ(A,s)b)]b不会影响期望,但合适的b会降低蒙特卡洛近似的方差,加快模型收敛。

2. 策略梯度的蒙特卡洛近似

  • 基于baselin的策略梯度为: ∂ V π ( s t ) ∂ θ = = E A t ∼ π [ ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − b ) ] g ( A t ) = ∂ l o g ( π ( A t ∣ s t ; θ ) ) ∂ θ ⋅ ( Q π ( A t , s t ) − b ) \frac{\partial V_\pi(s_t)}{\partial \theta}==E_{A_t\sim\pi}[\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-b)]\\g(A_t)=\frac{\partial log(\pi(A_t|s_t;\theta))}{\partial \theta}\cdot(Q_\pi(A_t,s_t)-b) θVπ(st)==EAtπ[θlog(π(Atst;θ))(Qπ(At,st)b)]g(At)=θlog(π(Atst;θ))(Qπ(At,st)b)
  • 依据策略函数随机抽样得到t时刻的动作: a t ∼ π ( ⋅ ∣ s t ; θ ) a_t\sim\pi(\cdot|s_t;\theta) atπ(st;θ)
  • 则策略梯度的无偏估计为: g ( a t ) g(a_t) g(at)
  • 随机策略梯度: g ( a t ) = ( Q π ( s t , a t ) − b ) ⋅ ( ∂ l o g ( π ( a t ∣ s t ; θ ) ) ∂ θ ) g(a_t)=(Q_\pi(s_t,a_t)-b)\cdot(\frac{\partial log(\pi(a_t|s_t;\theta))}{\partial \theta}) g(
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值