上节课和本节课内容
具体的:
Finite Difference Policy Gradient
Monte-Carlo Policy Gradient
Actor-Critic Policy Gradient
区别和联系:
Advantages of Policy-Based RL:
Better convergence properties
Effective in high-dimensional or continuous action spaces
Can learn stochastic policies(课件中有个Example: Aliased Gridworld,很好理解)
Disadvantages of Policy-Based RL :
Typically converge to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance
Policy-Gradient RL 问题的数学化描述:
上面三种策略都可以得到相同的policy gradient,方法如下(就是基于gradient求最优θ 的过程)。
Finite Difference Policy Gradient(就是我们常说的gradient checking):
Monte-Carlo Policy Gradient:
从下面开始,认为 πθ(s,a) 是关于θ的函数(这个函数的映射是probability(a|
s, θ));
MC-PG通过采样episode,确定最优的θ。
1)以one-step的MDP为例:
2)Policy Gradient Theorem 将 [ one-step的MDP ] 扩展到 [ 任意的differentiable policy和任意的policy
objective functions ]:
3)Monte-Carlo Policy Gradient (REINFORCE)
缺点是:high variance,slow convergence rate。
Actor-Critic Policy Gradient:
1)motivation
为了降低variance,可以考虑先使用上节课讲的value-function approximation方法计算一个Qw(s,a),然后用Qw(s,a)去指导Qπ(s,a)的梯度更新。
2)基于action-value fn approx. 和TD(0)的简单actor-critic algorithm :
可以看出来,θ,w都要更新。
3)Compatible Function Approximation Theorem:
actor-critic algorithm的问题:
Approximating the policy gradient introduces bias。A biased policy gradient may not find the right solution
Luckily, if we choose value function approximation carefully, then we can avoid introducing any bias, i.e. We can still follow the
exact policy gradient .
证明省略。。。
Advantage Actor-Critic 和 Natural Actor-Critic 也不写了。。。
但简单总结一下: