Lect3_Dynamic_Programming

最新推荐文章于 2024-03-26 02:20:03 发布

原创最新推荐文章于 2024-03-26 02:20:03 发布 · 226 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#强化学习

RL_by_DavidSilver_notes 专栏收录该内容

7 篇文章

订阅专栏

本文介绍了如何使用动态规划解决马尔科夫决策过程(MDP)中的预测与控制问题，详细解析了策略评估、策略迭代及价值迭代等算法，并通过具体实例展示了算法的运作过程。

文章目录

Planning by Dynamic Programming

Planning by Dynamic Programming

Introduction

Dynamic sequential or temporal component to the problem Programming optimising a “program”

Requirements for DP

Optimal substructure: 将问题分解成多个子问题，寻找子问题的最优解而后组合起来得到大问题的最优解
Overlapping subproblems: 子问题出现多次并且其解决方案可以重复利用

MDP satisfy both requirements:

Bellman equation gives recursive decomposition.
Value function stores and reuses solutions

DP used for planning in an MDP

Prediction:

在这里插入图片描述

Control:

在这里插入图片描述

Policy Evaluation

Iterative Policy Evaluation

Iterative application of [ $\text{Bellman {\color{red}expectation} backup}$ ](#Bellman expectation backup): $\operatorname{v}_1 \rightarrow\operatorname{v}_2 \rightarrow \ldots \rightarrow \operatorname{v}_\pi$

Using Synchornous backups:
- at each iteration $k + 1$
- for all states $\in \mathcal{S}$
- update $\operatorname{v}_{k+1}(s)$ from $\operatorname{v}_{k}(s')$
- where $s^{'}$ is a successor state of $s$
Convergence to $\operatorname{v}_\pi$ can be proved

How to update：

在这里插入图片描述

$\operatorname{v}_{\color{red}{k+1}}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}k}(s') \right) \\$
不由得想起数值分析里面迭代求解 $x = f (x)$ ，即 $\text{initial}\ x_0 = \text{number}, \ \text{then loop}\ x_{k+1} = f(x_k)$

Matrix From：
$\mathbf{v}^{k+1} = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi \mathbf{v}^k$
where
$\begin{aligned} \mathcal{R}^\pi &= \sum_{a \in \mathcal{A}}\pi(a \mid s) \mathcal{R}_s^a \\ \mathcal{P}^\pi &= \sum_{s' \in \mathcal{s}} \mathcal{P}_{ss'}^\pi = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}}\pi(a \mid s) \mathcal{P}_{ss'}^a = \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \end{aligned}$

Example

在这里插入图片描述

$\gamma =1$
Nonterminal states 1, …, 14; One terminal state (shown twice as shaded squares)
Actions leading out of the grid leave state unchanged, e.g. when $\text{west}$ , the next state will be $s^{'} = 4$
Transition is deterministic given the action, e.g. $\mathcal{P}_{62}^{\text{north}} = \mathcal{P} \left[s' = 2 \mid s=6, a=\text{north}\right] =1$
Reward is -1 until the terminal state is reached
Uniform random policy, $\pi(n \mid ·) = \pi(e \mid ·) = \pi(w \mid ·) = \pi(s \mid ·) = 0.25$

一开始给每一个状态的价值函数都初始化为0，不断地进行迭代就好了。如下：

在这里插入图片描述

计算过程如下：

For k=0:
$\operatorname{v}_0(s) = 0 \qquad \forall s$
For k = 1: e.g.
$\begin{aligned} \operatorname{v}_1(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_0(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_0(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_0(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_0(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1+0)+0.25*(-1+0)+0.25*(-1+0)=-1.0 \end{aligned}$
For k = 2: e.g.
$\begin{aligned} \operatorname{v}_2(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_1(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_1(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_1(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_1(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)=-1.75 \\ \end{aligned}$
$\begin{aligned} \operatorname{v}_2(s=8) &= \pi(a=n \mid s=8)*\left(R_{s=8}^{a=n} + \mathcal{P}_{s=8,s'=4}^{a=n}\operatorname{v}_1(s'=4) \right) + \\ &= \pi(a=w \mid s=8)*\left(R_{s=8}^{a=w} + \mathcal{P}_{s=8,s'=8}^{a=w} \operatorname{v}_1(s'=8) \right) + \\ &= \pi(a=s \mid s=8)*\left(R_{s=8}^{a=s} + \mathcal{P}_{s=8,s'=12}^{a=s}\operatorname{v}_1(s'=12) \right) + \\ &= \pi(a=e \mid s=8)*\left(R_{s=8}^{a=e} + \mathcal{P}_{s=8,s'=9}^{a=e}\operatorname{v}_1(s'=9) \right) + \\ &= 0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)=-2 \end{aligned}$
For k=3: e.g.
$\begin{aligned} \operatorname{v}_3(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_2(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_2(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_2(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1-1.75)+0.25*(-1-2)+0.25*(-1-2)=-2.4375 \\ \end{aligned}$
$\begin{aligned} \operatorname{v}_3(s=8) &= \pi(a=n \mid s=8)*\left(R_{s=8}^{a=n} + \mathcal{P}_{s=8,s'=4}^{a=n}\operatorname{v}_2(s'=4) \right) + \\ &= \pi(a=w \mid s=8)*\left(R_{s=8}^{a=w} + \mathcal{P}_{s=8,s'=8}^{a=w} \operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=s \mid s=8)*\left(R_{s=8}^{a=s} + \mathcal{P}_{s=8,s'=12}^{a=s}\operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=e \mid s=8)*\left(R_{s=8}^{a=e} + \mathcal{P}_{s=8,s'=9}^{a=e}\operatorname{v}_2(s'=9) \right) + \\ &= 0.25*(-1-1.75)+0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)=-2.9375 \\ \end{aligned}$
$\begin{aligned} \operatorname{v}_3(s=12) &= \pi(a=n \mid s=12)*\left(R_{s=12}^{a=n} + \mathcal{P}_{s=12,s'=8}^{a=n}\operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=w \mid s=12)*\left(R_{s=12}^{a=w} + \mathcal{P}_{s=12,s'=12}^{a=w} \operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=s \mid s=12)*\left(R_{s=12}^{a=s} + \mathcal{P}_{s=12,s'=12}^{a=s}\operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=e \mid s=12)*\left(R_{s=12}^{a=e} + \mathcal{P}_{s=12,s'=13}^{a=e}\operatorname{v}_2(s'=13) \right) + \\ &= 0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)=-3.0 \end{aligned}$

Policy Iteration

在这里插入图片描述

Algorithm:

Given a policy $\pi$
Loop forever until stopping condition
1. Evaluate the policy $\pi$
  $\operatorname{v}_\pi(s) = \mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+ \ldots \mid S_t =s \right]$
2. Improve the policy by acting greedily with respect to $\operatorname{v}_\pi$
  $\pi' = \operatorname{greedy}(\operatorname{v}_\pi)$

Policy improvement

Proof of why $\pi' \geq \pi$ when use greedy:

Consider a deterministic policy, $\pi(s)$
improve the policy by acting greedily
$\pi'(s) = \underset{a \in \mathcal{A}}{arg\,max}\ q_\pi(s,a)$
This improves the value from any state s over one step
$q_\pi\left(s,\pi'(s) \right) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) \ {\color{red}\geq} \ q_\pi \left(s, \pi(s) \right) = \operatorname{v}_\pi(s) \qquad \forall \ s$
It therefore improves the value function, $\operatorname{v}_{\pi'}(s) \geq \operatorname{v}_\pi(s)$
$\begin{aligned} \operatorname{v}_\pi(s) &\leq q_\pi\left(s,\pi'(s) \right) = \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma \operatorname{v}_{\color{blue}\pi} \left(S_{t+1} \right) \mid S_t = s \right] \\ &\leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma q_{\color{blue}\pi} \left(S_{t+1}, \pi'(S_{t+1}) \right) \mid S_t = s \right] \qquad \text{because step 3.}\ \forall \ s \\ &\leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma R_{t+2} + \gamma^2 q_{\color{blue}\pi} \left(S_{t+2}, \pi'(S_{t+2}) \right) \mid S_t = s \right] \\ &\leq \dots \leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s \right] = \operatorname{v}_{\pi'}(s) \end{aligned}$
Why ${\color{red}\pi'}$ and ${\color{blue}\pi}$ ?

得回到 action-value function 的定义来看待这个问题
$q_\pi(s,a) = \mathbb{E}_\pi \left[R_{t+1} + \gamma \operatorname{v}_\pi(s_{t+1}) \right] = \sum_{a \in \mathcal{A}}{\color{red}\pi(a \mid s)} \left(R_{t+1} + \gamma \operatorname{v}_\pi(s_{t+1}) \right)$
因此可以看出求期望时，即加权平均值的加权为 $\pi(a \mid s)$ 和动作策略是有关的，当我们使用了 greedy 之后，每一个状态下各动作选取的概率会改变，而 state-value funtion 还是由 $\pi$ 算出来的，没进行更新。所以有 ${\color{red}\pi'}$ and ${\color{blue}\pi}$ 的区别

Proof of why converge to $\pi^*$ :

If improvements stop
$q_\pi\left(s,\pi'(s) \right) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) \ {\color{red}=} \ q_\pi \left(s, \pi(s) \right) = \operatorname{v}_\pi(s) \qquad \forall \ s$
Then the Bellman optimality equation has been satisfied
$\operatorname{v}_\pi(s) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a)$
Therefore now $\operatorname{v}_\pi(s) = \operatorname{v}_*(s) \quad \forall s \in \mathcal{S}$
so $\pi$ is an optimal policy

Value Iteration

Principle of Optimality

A policy $\pi(a \mid s)$ achieves the optimal value from state s, i.e. $\operatorname{v}_\pi(s) = \operatorname{v}_*(s)$ , if and only if

for any state s’ reachable from s
$\pi$ achieves the optimal value from state s’, i.e. $\operatorname{v}_\pi(s') = \operatorname{v}_*(s')$

Deterministic Value Iteration

Compare to Iterative Policy Evaluation, please click here, actually the difference between them is how to update (max or expectation)

If we know the solution to subproblems $\operatorname{v}_*(s')$
Then solution $\operatorname{v}_*(s')$ can be found by one-step lookhead
$\operatorname{v}_*(s) \leftarrow \underset{a \in \mathcal{A}}{\operatorname{max}} \left( \mathcal{R}_S^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_*(s') \right)$
The idea of value iteration is to apply these updates iteratively
Intuition: start with final rewards and work backwards

Iterative application of [ $\text{Bellman {\color{red}optimality} backup}$ ](#Bellman optimality backup): $\operatorname{v}_1 \rightarrow\operatorname{v}_2 \rightarrow \ldots \rightarrow \operatorname{v}_*$

Using Synchornous backups:
- at each iteration $k + 1$
- for all states $\in \mathcal{S}$
- update $\operatorname{v}_{k+1}(s)$ from $\operatorname{v}_{k}(s')$
Intermediate value function may not correspond to any policy

How to update:

在这里插入图片描述

$\operatorname{v}_{\color{red}{k+1}}(s) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}k}(s') \right)$
Matrix From:
$\mathbf{v}_{k+1} = \underset{a \in \mathcal{A}}{\operatorname{max}}\left( \mathcal{R}^{\mathbf{a}} + \gamma \mathcal{P}^{\mathbf{a}} \operatorname{v}_k \right)$

A live demo

GridWorld: Dynamic Programming Demo

Summary of DP Algorithms

Problem	bEllman equation	Algorithm
Prediction	Bellman Expectation Equation	Iterative Policy Evaluation
Control	Bellman Expectation Equation + Greedy Policy Improvement	Policy Iteration
Control	Bellman Optimality Equation	Value Iteration

Algorithms are based on state-value function $\operatorname{v}_\pi(s)$ or $\operatorname{v}_*(s)$
m actions and n states, so complexity = O(m*n(s)*n(s’) ) = O(mn²) per iteration