Chapter 4 Dynamic Programming 学习笔记

Chapter 4 Dynamic Programming

目录

Chapter 4 Dynamic Programming

前言:

 

4.1 Policy Evaluation (Prediction)

4.2 Policy Improvement

 4.3 Policy Iteration

 4.4 Value Iteration

4.5 Asynchronous Dynamic Programming

4.6 Generalized Policy Iteration 

4.7 Effiffifficiency of Dynamic Programming 

4.8 Summary


 仿真实现部分待完善

前言:

The term dynamic programming (DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).
 
Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically.
 
DP provides an essential foundation for the understanding of the methods presented in the rest of this book.
 
DP算法是其他强化学习的基础,其他算法实际上是希望用更少的计算量和对模型更少的假设,来达到和DP一样的效果
 
We usually assume that the environment is a fifinite MDP. That is, we assume that its state, action, and reward sets, S, A, and R , are fifinite, and that its dynamics are given by a set of probabilities p ( s‘ , r | s, a )
 
 
The key idea of DP, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies. In this chapter we show how DP can be used to compute the value functions defifined in Chapter 3. As discussed there
 
一旦找到满足Bellman最优性方程的最优值函数v*或q*,就很容易得到最优策略:
 
 

 
DP算法是通过将这些Bellman方程转化为赋值,即转化为改进期望值函数逼近的更新规则而得到的。

 

4.1 Policy Evaluation (Prediction)

First we consider how to compute the state-value function for an arbitrary policy π . This is called policy evaluation in the DP literature. We also refer to it as the prediction problem .
 

 策略估计:计算任意一个策略下,这个策略下的状态值

状态值具有唯一性和确定性

迭代策略估计:

我们用迭代的方法,用贝尔曼方程作为更新规则,逐次去逼近vΠ

for all s ∈ S . Clearly, v k = is a fifixed point for this update rule because the Bellman equation for assures us of equality in this case. Indeed, the sequence { v k } can be shown in general to converge to as k趋于无穷大, under the same conditions that guarantee the existence of vΠ</
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值