Planning by Dynamic Programming

最新推荐文章于 2023-05-04 22:04:46 发布

原创最新推荐文章于 2023-05-04 22:04:46 发布 · 295 阅读

0 ·

CC 4.0 BY-SA版权

强化学习专栏收录该内容

18 篇文章

订阅专栏

本文介绍了动态规划在强化学习中的应用，包括策略评估、策略改进、值迭代等算法，并探讨了通用策略迭代的概念。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Dynamic Programming(DP) refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a MDP.

Dynamic—sequential or temporal component to the problem
Programming—optimising a “program”, i.e. a policy

The key idea of DP, and of reinforcement learning generally, is the use of value funtions to organize and structure the search for good policies.

As we shall see, DP algorithms are obtained by turning Bellman equations such as these into assignments, that is, into update rules for improving approximations of the desired value functions.

1. Policy Evaluation (Prediction)

Policy evaluation (Prediction problem) refers to the problem that how to compute the state-value function $vπv_{\pi}$ for an arbitrary policy $π\pi$ .

Solution: iterative application of Belman expectation backup

A complete in-place version of iterative policy evaluation is shown in the box below.

在这里插入图片描述

2. Policy Improvement (Policy Iteration)

The basis of policy improvement:
这里写图片描述
The process of policy improvement:

The proof of policy improvement:

策略迭代在每一个迭代步总是先对策略进行值函数估计，直至收敛，那我们能否在策略估计还未收敛时就进行策略改进呢？比如说引入epsilon收敛，比如简单地在对策略估计迭代k次之后就进行策略改进，甚至，k=1就进行策略改进又会怎么样呢？下面我们将会讲到，k=1的情形就是值迭代方法。

3. Value Iteration

4. Asynchronous DP

5. Generalized Policy Iteration

Generalized policy iteration (GPI) refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes.