基于policy gradient的强化学习算法

基于policy gradient的强化学习算法相比于value function方法的优缺点:
优点:

  • 直接策略搜索是对策略进行参数化表示,与值函数相比,策略化参数的方法更简单,更容易收敛。
  • 值函数的放法无法解决状态空间过大或者不连续的情形
  • 直接策略的方法可以采取随机策略,随机策略可以将探索直接集成到算法当中

缺点:

  • 策略搜索的方法更容易收敛局部极值点
  • 在评估单个策略时,评估的并不好,方差容易过大

在这里插入图片描述
上图是一个完整的MDP过程对于一个完整的采样轨迹 τ \tau τ
于是有
p θ ( τ ) = p ( s 1 ) ∏ t = 1 T p θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p_{\theta}(\tau) = p(s_1)\prod_{t=1}^{T}p_{\theta}(a_t|s_t)p(s_{t+1}|s_t, a_t) pθ(τ)=p(s1)t=1Tpθ(atst)p(st+1st,at)
其中 θ \theta θ是策略的参数,一个策略完全由其参数决定。在实际应用中,这种关系是由神经网路刻画的。
在定义了一条采样轨迹的概率之后,我们来定义期望回报:
R θ ˉ = ∑ τ R ( τ ) p θ ( τ ) = E τ ∼ p θ ( τ ) [ R ( τ ) ] \bar{R_{\theta}} = \sum_{\tau}R(\tau)p_{\theta}(\tau) = E_{\tau \sim p_{\theta}(\tau)}[R(\tau)] Rθˉ=τR(τ)pθ(τ)=Eτpθ(τ)[R(τ)]
过程如图所示:
在这里插入图片描述
得到期望回报关于策略的表达式之后,我们的目标变得非常明确了,我们只需要优化这个函数,使之最大化即可。我们可以使用最常用的梯度下降的方法。
▽ R θ ˉ = ∑ τ R ( τ ) ▽ p θ ( τ ) \triangledown \bar{R_{\theta}}=\sum_{\tau}R(\tau)\triangledown p_{\theta}(\tau) Rθˉ=τR(τ)pθ(τ) = ∑ τ R ( τ ) p θ ( τ ) ▽ p θ ( τ ) p θ ( τ ) =\sum_{\tau}R(\tau)p_{\theta}(\tau)\frac{\triangledown p_{\theta}(\tau)}{p_{\theta}(\tau)} =τR(τ)pθ(τ)pθ(τ)pθ(τ) = E τ ∼ p θ [ R ( τ ) ▽ l o g p θ ( τ ) ] =E_{\tau \sim p_{\theta}}[R(\tau) \triangledown log p_{\theta}(\tau)] =Eτpθ[R(τ)logpθ(τ)] = 1 N ∑ n = 1 N R ( τ n ) ▽ l o g p θ ( τ n ) =\frac{1}{N}\sum_{n=1}^{N}R(\tau^{n}) \triangledown log p_{\theta}(\tau^{n}) =N1n=1NR(τn)logpθ(τn)

Deterministic Policy Optimization 相比于 Stochastic Policy Optimization 有三个优势:

  1. value estimate from on-policy samples have lower variance (zero when the system dynamics is deterministic)
  2. it is possible to write the policy gradient in a simpler form which makes it computationally more attractive.
  3. in some cases, stochastic policy can lead to poor and unpredictable performance

On the contrary, a deterministic policy gradient method requires a good exploration strategy since, unlike stochastic policy gradient, it has no clear rule for exploring the state space. In addition, stochastic policy gradient methods tend to solve more broad classes of problems from our experience.
We suspect that this is due to the fact that stochastic policy gradient has less local optima that are present in deterministic policy gradient.

Reference:
https://blog.youkuaiyun.com/weixin_41679411/article/details/82414400
https://blog.youkuaiyun.com/qq_29462849/article/details/82966672

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值