Planning by Dynamic Programming

本文介绍了动态规划在强化学习中的应用,包括策略评估、策略改进、值迭代等算法,并探讨了通用策略迭代的概念。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Dynamic Programming(DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a MDP.

Dynamic—sequential or temporal component to the problem
Programming—optimising a “program”, i.e. a policy

The key idea of DP, and of reinforcement learning generally, is the use of value funtions to organize and structure the search for good policies.

As we shall see, DP algorithms are obtained by turning Bellman equations such as these into assignments, that is, into update rules for improving approximations of the desired value functions.

1. Policy Evaluation (Prediction)

Policy evaluation (Prediction problem) refers to the problem that how to compute the state-value function vπv_{\pi}vπ for an arbitrary policy π\piπ.

Solution: iterative application of Belman expectation backup
A complete in-place version of iterative policy evaluation is shown in the box below.

在这里插入图片描述

2. Policy Improvement (Policy Iteration)

The basis of policy improvement:
这里写图片描述
The process of policy improvement:
这里写图片描述
这里写图片描述
这里写图片描述
The proof of policy improvement:
这里写图片描述
策略迭代在每一个迭代步总是先对策略进行值函数估计,直至收敛,那我们能否在策略估计还未收敛时就进行策略改进呢?比如说引入epsilon收敛 ,比如简单地在对策略估计迭代k次之后就进行策略改进,甚至,k=1就进行策略改进又会怎么样呢?下面我们将会讲到,k=1的情形就是值迭代方法。

3. Value Iteration

4. Asynchronous DP

5. Generalized Policy Iteration

Generalized policy iteration (GPI) refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值