Chapter 4 Dynamic Programming 学习笔记

最新推荐文章于 2021-07-11 11:33:58 发布

阳光什锦

最新推荐文章于 2021-07-11 11:33:58 发布

阅读量595

点赞数

分类专栏：强化学习文章标签：机器学习

本文链接：https://blog.youkuaiyun.com/weixin_45563118/article/details/106295767

版权

Chapter 4 Dynamic Programming

前言：

4.1 Policy Evaluation (Prediction)

4.2 Policy Improvement

4.3 Policy Iteration

4.4 Value Iteration

4.5 Asynchronous Dynamic Programming

4.6 Generalized Policy Iteration

4.7 Effiffifficiency of Dynamic Programming

4.8 Summary

仿真实现部分待完善

前言：

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).

Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically.

DP provides an essential foundation for the understanding of the methods presented in the rest of this book.

DP算法是其他强化学习的基础，其他算法实际上是希望用更少的计算量和对模型更少的假设，来达到和DP一样的效果

We usually assume that the environment is a fifinite MDP. That is, we assume that its state, action, and reward sets, S, A, and R , are fifinite, and that its dynamics are given by a set of probabilities p ( s‘ , r | s, a )

The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies. In this chapter we show how DP can be used to compute the value functions defifined in Chapter 3. As discussed there

一旦找到满足Bellman最优性方程的最优值函数v*或q*，就很容易得到最优策略:

DP算法是通过将这些Bellman方程转化为赋值，即转化为改进期望值函数逼近的更新规则而得到的。

4.1 Policy Evaluation (Prediction)

First we consider how to compute the state-value function vΠ for an arbitrary policy π . This is called policy evaluation in the DP literature. We also refer to it as the prediction problem .

策略估计：计算任意一个策略下，这个策略下的状态值

状态值具有唯一性和确定性

迭代策略估计:

我们用迭代的方法，用贝尔曼方程作为更新规则，逐次去逼近vΠ

for all s ∈ S . Clearly, v k = vΠ is a fifixed point for this update rule because the Bellman equation for vΠ assures us of equality in this case. Indeed, the sequence { v k } can be shown in general to converge to vΠ as k趋于无穷大， under the same conditions that guarantee the existence of vΠ</

最低0.47元/天解锁文章