Lect3_Dynamic_Programming

本文介绍了如何使用动态规划解决马尔科夫决策过程(MDP)中的预测与控制问题,详细解析了策略评估、策略迭代及价值迭代等算法,并通过具体实例展示了算法的运作过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Planning by Dynamic Programming

Introduction

Dynamic sequential or temporal component to the problem Programming optimising a “program”

Requirements for DP

  1. Optimal substructure: 将问题分解成多个子问题,寻找子问题的最优解而后组合起来得到大问题的最优解
  2. Overlapping subproblems: 子问题出现多次并且其解决方案可以重复利用

MDP satisfy both requirements:

  1. Bellman equation gives recursive decomposition.
  2. Value function stores and reuses solutions

DP used for planning in an MDP

Prediction:

在这里插入图片描述

Control:

在这里插入图片描述

Policy Evaluation


Iterative Policy Evaluation

Iterative application of [ Bellman  expectation  backup \text{Bellman {\color{red}expectation} backup} Bellman expectation backup](#Bellman expectation backup): v ⁡ 1 → v ⁡ 2 → … → v ⁡ π \operatorname{v}_1 \rightarrow\operatorname{v}_2 \rightarrow \ldots \rightarrow \operatorname{v}_\pi v1v2vπ

  • Using Synchornous backups:
    • at each iteration k + 1 k+1 k+1
    • for all states s ∈ S s \in \mathcal{S} sS
    • update v ⁡ k + 1 ( s ) \operatorname{v}_{k+1}(s) vk+1(s) from v ⁡ k ( s ′ ) \operatorname{v}_{k}(s') vk(s)
    • where s ′ s' s​ is a successor state of s s s
  • Convergence to v ⁡ π \operatorname{v}_\pi vπ can be proved

How to update:

在这里插入图片描述


v ⁡ k + 1 ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R s a + ∑ s ′ ∈ S P s s ′ a v ⁡ k ( s ′ ) ) \operatorname{v}_{\color{red}{k+1}}(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \left(\mathcal{R}_s^a + \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}k}(s') \right) \\ vk+1(s)=aAπ(as)(Rsa+sSPssavk(s))
不由得想起数值分析里面迭代求解 x = f ( x ) x = f(x) x=f(x)​,即 initial  x 0 = number ,  then loop  x k + 1 = f ( x k ) \text{initial}\ x_0 = \text{number}, \ \text{then loop}\ x_{k+1} = f(x_k) initial x0=number, then loop xk+1=f(xk)

Matrix From:
v k + 1 = R π + γ P π v k \mathbf{v}^{k+1} = \mathcal{R}^\pi + \gamma \mathcal{P}^\pi \mathbf{v}^k vk+1=Rπ+γPπvk
where
R π = ∑ a ∈ A π ( a ∣ s ) R s a P π = ∑ s ′ ∈ s P s s ′ π = ∑ s ′ ∈ S ∑ a ∈ A π ( a ∣ s ) P s s ′ a = ∑ a ∈ A π ( a ∣ s ) ∑ s ′ ∈ S P s s ′ a \begin{aligned} \mathcal{R}^\pi &= \sum_{a \in \mathcal{A}}\pi(a \mid s) \mathcal{R}_s^a \\ \mathcal{P}^\pi &= \sum_{s' \in \mathcal{s}} \mathcal{P}_{ss'}^\pi = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}}\pi(a \mid s) \mathcal{P}_{ss'}^a = \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^a \end{aligned} RπPπ=aAπ(as)Rsa=ssPssπ=sSaAπ(as)Pssa=aAπ(as)sSPssa


Example

在这里插入图片描述

  • γ = 1 \gamma =1 γ=1
  • Nonterminal states 1, …, 14; One terminal state (shown twice as shaded squares)
  • Actions leading out of the grid leave state unchanged, e.g. when s = 4 , a = west s = 4, a = \text{west} s=4,a=west, the next state will be s ′ = 4 s' = 4 s=4
  • Transition is deterministic given the action, e.g. P 62 north = P [ s ′ = 2 ∣ s = 6 , a = north ] = 1 \mathcal{P}_{62}^{\text{north}} = \mathcal{P} \left[s' = 2 \mid s=6, a=\text{north}\right] =1 P62north=P[s=2s=6,a=north]=1
  • Reward is -1 until the terminal state is reached
  • Uniform random policy, π ( n ∣ ⋅ ) = π ( e ∣ ⋅ ) = π ( w ∣ ⋅ ) = π ( s ∣ ⋅ ) = 0.25 \pi(n \mid ·) = \pi(e \mid ·) = \pi(w \mid ·) = \pi(s \mid ·) = 0.25 π(n)=π(e)=π(w)=π(s)=0.25

一开始给每一个状态的价值函数都初始化为0,不断地进行迭代就好了。如下:

在这里插入图片描述

计算过程如下:

For k=0:
v ⁡ 0 ( s ) = 0 ∀ s \operatorname{v}_0(s) = 0 \qquad \forall s v0(s)=0s
For k = 1: e.g.
v ⁡ 1 ( s = 4 ) = π ( a = n ∣ s = 4 ) ∗ ( R s = 4 a = n + P s = 4 , s ′ = terminal a = n v ⁡ 0 ( s ′ = terminal ) ) + = π ( a = w ∣ s = 4 ) ∗ ( R s = 4 a = w + P s = 4 , s ′ = 4 a = w v ⁡ 0 ( s ′ = 4 ) ) + = π ( a = s ∣ s = 4 ) ∗ ( R s = 4 a = s + P s = 4 , s ′ = 8 a = s v ⁡ 0 ( s ′ = 8 ) ) + = π ( a = e ∣ s = 4 ) ∗ ( R s = 4 a = e + P s = 4 , s ′ = 5 a = e v ⁡ 0 ( s ′ = 5 ) ) + = 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 + 0 ) = − 1.0 \begin{aligned} \operatorname{v}_1(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_0(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_0(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_0(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_0(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1+0)+0.25*(-1+0)+0.25*(-1+0)=-1.0 \end{aligned} v1(s=4)=π(a=ns=4)(Rs=4a=n+Ps=4,s=terminala=nv0(s=terminal))+=π(a=ws=4)(Rs=4a=w+Ps=4,s=4a=wv0(s=4))+=π(a=ss=4)(Rs=4a=s+Ps=4,s=8a=sv0(s=8))+=π(a=es=4)(Rs=4a=e+Ps=4,s=5a=ev0(s=5))+=0.25(1+0)+0.25(1+0)+0.25(1+0)+0.25(1+0)=1.0
For k = 2: e.g.
v ⁡ 2 ( s = 4 ) = π ( a = n ∣ s = 4 ) ∗ ( R s = 4 a = n + P s = 4 , s ′ = terminal a = n v ⁡ 1 ( s ′ = terminal ) ) + = π ( a = w ∣ s = 4 ) ∗ ( R s = 4 a = w + P s = 4 , s ′ = 4 a = w v ⁡ 1 ( s ′ = 4 ) ) + = π ( a = s ∣ s = 4 ) ∗ ( R s = 4 a = s + P s = 4 , s ′ = 8 a = s v ⁡ 1 ( s ′ = 8 ) ) + = π ( a = e ∣ s = 4 ) ∗ ( R s = 4 a = e + P s = 4 , s ′ = 5 a = e v ⁡ 1 ( s ′ = 5 ) ) + = 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) = − 1.75 \begin{aligned} \operatorname{v}_2(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_1(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_1(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_1(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_1(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)=-1.75 \\ \end{aligned} v2(s=4)=π(a=ns=4)(Rs=4a=n+Ps=4,s=terminala=nv1(s=terminal))+=π(a=ws=4)(Rs=4a=w+Ps=4,s=4a=wv1(s=4))+=π(a=ss=4)(Rs=4a=s+Ps=4,s=8a=sv1(s=8))+=π(a=es=4)(Rs=4a=e+Ps=4,s=5a=ev1(s=5))+=0.25(1+0)+0.25(11)+0.25(11)+0.25(11)=1.75
v ⁡ 2 ( s = 8 ) = π ( a = n ∣ s = 8 ) ∗ ( R s = 8 a = n + P s = 8 , s ′ = 4 a = n v ⁡ 1 ( s ′ = 4 ) ) + = π ( a = w ∣ s = 8 ) ∗ ( R s = 8 a = w + P s = 8 , s ′ = 8 a = w v ⁡ 1 ( s ′ = 8 ) ) + = π ( a = s ∣ s = 8 ) ∗ ( R s = 8 a = s + P s = 8 , s ′ = 12 a = s v ⁡ 1 ( s ′ = 12 ) ) + = π ( a = e ∣ s = 8 ) ∗ ( R s = 8 a = e + P s = 8 , s ′ = 9 a = e v ⁡ 1 ( s ′ = 9 ) ) + = 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) + 0.25 ∗ ( − 1 − 1 ) = − 2 \begin{aligned} \operatorname{v}_2(s=8) &= \pi(a=n \mid s=8)*\left(R_{s=8}^{a=n} + \mathcal{P}_{s=8,s'=4}^{a=n}\operatorname{v}_1(s'=4) \right) + \\ &= \pi(a=w \mid s=8)*\left(R_{s=8}^{a=w} + \mathcal{P}_{s=8,s'=8}^{a=w} \operatorname{v}_1(s'=8) \right) + \\ &= \pi(a=s \mid s=8)*\left(R_{s=8}^{a=s} + \mathcal{P}_{s=8,s'=12}^{a=s}\operatorname{v}_1(s'=12) \right) + \\ &= \pi(a=e \mid s=8)*\left(R_{s=8}^{a=e} + \mathcal{P}_{s=8,s'=9}^{a=e}\operatorname{v}_1(s'=9) \right) + \\ &= 0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)+0.25*(-1-1)=-2 \end{aligned} v2(s=8)=π(a=ns=8)(Rs=8a=n+Ps=8,s=4a=nv1(s=4))+=π(a=ws=8)(Rs=8a=w+Ps=8,s=8a=wv1(s=8))+=π(a=ss=8)(Rs=8a=s+Ps=8,s=12a=sv1(s=12))+=π(a=es=8)(Rs=8a=e+Ps=8,s=9a=ev1(s=9))+=0.25(11)+0.25(11)+0.25(11)+0.25(11)=2
For k=3: e.g.
v ⁡ 3 ( s = 4 ) = π ( a = n ∣ s = 4 ) ∗ ( R s = 4 a = n + P s = 4 , s ′ = terminal a = n v ⁡ 2 ( s ′ = terminal ) ) + = π ( a = w ∣ s = 4 ) ∗ ( R s = 4 a = w + P s = 4 , s ′ = 4 a = w v ⁡ 2 ( s ′ = 4 ) ) + = π ( a = s ∣ s = 4 ) ∗ ( R s = 4 a = s + P s = 4 , s ′ = 8 a = s v ⁡ 2 ( s ′ = 8 ) ) + = π ( a = e ∣ s = 4 ) ∗ ( R s = 4 a = e + P s = 4 , s ′ = 5 a = e v ⁡ 2 ( s ′ = 5 ) ) + = 0.25 ∗ ( − 1 + 0 ) + 0.25 ∗ ( − 1 − 1.75 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) = − 2.4375 \begin{aligned} \operatorname{v}_3(s=4) &= \pi(a=n \mid s=4)*\left(R_{s=4}^{a=n} + \mathcal{P}_{s=4,s'=\text{terminal}}^{a=n}\operatorname{v}_2(s'=\text{terminal}) \right) + \\ &= \pi(a=w \mid s=4)*\left(R_{s=4}^{a=w} + \mathcal{P}_{s=4,s'=4}^{a=w} \operatorname{v}_2(s'=4) \right) + \\ &= \pi(a=s \mid s=4)*\left(R_{s=4}^{a=s} + \mathcal{P}_{s=4,s'=8}^{a=s}\operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=e \mid s=4)*\left(R_{s=4}^{a=e} + \mathcal{P}_{s=4,s'=5}^{a=e}\operatorname{v}_2(s'=5) \right) + \\ &= 0.25*(-1+0)+0.25*(-1-1.75)+0.25*(-1-2)+0.25*(-1-2)=-2.4375 \\ \end{aligned} v3(s=4)=π(a=ns=4)(Rs=4a=n+Ps=4,s=terminala=nv2(s=terminal))+=π(a=ws=4)(Rs=4a=w+Ps=4,s=4a=wv2(s=4))+=π(a=ss=4)(Rs=4a=s+Ps=4,s=8a=sv2(s=8))+=π(a=es=4)(Rs=4a=e+Ps=4,s=5a=ev2(s=5))+=0.25(1+0)+0.25(11.75)+0.25(12)+0.25(12)=2.4375
v ⁡ 3 ( s = 8 ) = π ( a = n ∣ s = 8 ) ∗ ( R s = 8 a = n + P s = 8 , s ′ = 4 a = n v ⁡ 2 ( s ′ = 4 ) ) + = π ( a = w ∣ s = 8 ) ∗ ( R s = 8 a = w + P s = 8 , s ′ = 8 a = w v ⁡ 2 ( s ′ = 8 ) ) + = π ( a = s ∣ s = 8 ) ∗ ( R s = 8 a = s + P s = 8 , s ′ = 12 a = s v ⁡ 2 ( s ′ = 12 ) ) + = π ( a = e ∣ s = 8 ) ∗ ( R s = 8 a = e + P s = 8 , s ′ = 9 a = e v ⁡ 2 ( s ′ = 9 ) ) + = 0.25 ∗ ( − 1 − 1.75 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) = − 2.9375 \begin{aligned} \operatorname{v}_3(s=8) &= \pi(a=n \mid s=8)*\left(R_{s=8}^{a=n} + \mathcal{P}_{s=8,s'=4}^{a=n}\operatorname{v}_2(s'=4) \right) + \\ &= \pi(a=w \mid s=8)*\left(R_{s=8}^{a=w} + \mathcal{P}_{s=8,s'=8}^{a=w} \operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=s \mid s=8)*\left(R_{s=8}^{a=s} + \mathcal{P}_{s=8,s'=12}^{a=s}\operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=e \mid s=8)*\left(R_{s=8}^{a=e} + \mathcal{P}_{s=8,s'=9}^{a=e}\operatorname{v}_2(s'=9) \right) + \\ &= 0.25*(-1-1.75)+0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)=-2.9375 \\ \end{aligned} v3(s=8)=π(a=ns=8)(Rs=8a=n+Ps=8,s=4a=nv2(s=4))+=π(a=ws=8)(Rs=8a=w+Ps=8,s=8a=wv2(s=8))+=π(a=ss=8)(Rs=8a=s+Ps=8,s=12a=sv2(s=12))+=π(a=es=8)(Rs=8a=e+Ps=8,s=9a=ev2(s=9))+=0.25(11.75)+0.25(12)+0.25(12)+0.25(12)=2.9375
v ⁡ 3 ( s = 12 ) = π ( a = n ∣ s = 12 ) ∗ ( R s = 12 a = n + P s = 12 , s ′ = 8 a = n v ⁡ 2 ( s ′ = 8 ) ) + = π ( a = w ∣ s = 12 ) ∗ ( R s = 12 a = w + P s = 12 , s ′ = 12 a = w v ⁡ 2 ( s ′ = 12 ) ) + = π ( a = s ∣ s = 12 ) ∗ ( R s = 12 a = s + P s = 12 , s ′ = 12 a = s v ⁡ 2 ( s ′ = 12 ) ) + = π ( a = e ∣ s = 12 ) ∗ ( R s = 12 a = e + P s = 12 , s ′ = 13 a = e v ⁡ 2 ( s ′ = 13 ) ) + = 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) + 0.25 ∗ ( − 1 − 2 ) = − 3.0 \begin{aligned} \operatorname{v}_3(s=12) &= \pi(a=n \mid s=12)*\left(R_{s=12}^{a=n} + \mathcal{P}_{s=12,s'=8}^{a=n}\operatorname{v}_2(s'=8) \right) + \\ &= \pi(a=w \mid s=12)*\left(R_{s=12}^{a=w} + \mathcal{P}_{s=12,s'=12}^{a=w} \operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=s \mid s=12)*\left(R_{s=12}^{a=s} + \mathcal{P}_{s=12,s'=12}^{a=s}\operatorname{v}_2(s'=12) \right) + \\ &= \pi(a=e \mid s=12)*\left(R_{s=12}^{a=e} + \mathcal{P}_{s=12,s'=13}^{a=e}\operatorname{v}_2(s'=13) \right) + \\ &= 0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)+0.25*(-1-2)=-3.0 \end{aligned} v3(s=12)=π(a=ns=12)(Rs=12a=n+Ps=12,s=8a=nv2(s=8))+=π(a=ws=12)(Rs=12a=w+Ps=12,s=12a=wv2(s=12))+=π(a=ss=12)(Rs=12a=s+Ps=12,s=12a=sv2(s=12))+=π(a=es=12)(Rs=12a=e+Ps=12,s=13a=ev2(s=13))+=0.25(12)+0.25(12)+0.25(12)+0.25(12)=3.0

Policy Iteration

在这里插入图片描述

Algorithm:

  1. Given a policy π \pi π

  2. Loop forever until stopping condition

    1. Evaluate the policy π \pi π
      v ⁡ π ( s ) = E [ R t + 1 + γ R t + 2 + … ∣ S t = s ] \operatorname{v}_\pi(s) = \mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+ \ldots \mid S_t =s \right] vπ(s)=E[Rt+1+γRt+2+St=s]

    2. Improve the policy by acting greedily with respect to v ⁡ π \operatorname{v}_\pi vπ​​
      π ′ = greedy ⁡ ( v ⁡ π ) \pi' = \operatorname{greedy}(\operatorname{v}_\pi) π=greedy(vπ)

Policy improvement

Proof of why π ′ ≥ π \pi' \geq \pi ππ​ when use greedy:

  1. Consider a deterministic policy, a = π ( s ) a = \pi(s) a=π(s)

  2. improve the policy by acting greedily
    π ′ ( s ) = a r g   m a x a ∈ A   q π ( s , a ) \pi'(s) = \underset{a \in \mathcal{A}}{arg\,max}\ q_\pi(s,a) π(s)=aAargmax qπ(s,a)

  3. This improves the value from any state s over one step
    q π ( s , π ′ ( s ) ) = max ⁡ a ∈ A   q π ( s , a )   ≥   q π ( s , π ( s ) ) = v ⁡ π ( s ) ∀   s q_\pi\left(s,\pi'(s) \right) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) \ {\color{red}\geq} \ q_\pi \left(s, \pi(s) \right) = \operatorname{v}_\pi(s) \qquad \forall \ s qπ(s,π(s))=aAmax qπ(s,a)  qπ(s,π(s))=vπ(s) s

  4. It therefore improves the value function, v ⁡ π ′ ( s ) ≥ v ⁡ π ( s ) \operatorname{v}_{\pi'}(s) \geq \operatorname{v}_\pi(s) vπ(s)vπ(s)
    v ⁡ π ( s ) ≤ q π ( s , π ′ ( s ) ) = E π ′ [ R t + 1 + γ v ⁡ π ( S t + 1 ) ∣ S t = s ] ≤ E π ′ [ R t + 1 + γ q π ( S t + 1 , π ′ ( S t + 1 ) ) ∣ S t = s ] because step 3.  ∀   s ≤ E π ′ [ R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , π ′ ( S t + 2 ) ) ∣ S t = s ] ≤ ⋯ ≤ E π ′ [ R t + 1 + γ R t + 2 + ⋯ ∣ S t = s ] = v ⁡ π ′ ( s ) \begin{aligned} \operatorname{v}_\pi(s) &\leq q_\pi\left(s,\pi'(s) \right) = \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma \operatorname{v}_{\color{blue}\pi} \left(S_{t+1} \right) \mid S_t = s \right] \\ &\leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma q_{\color{blue}\pi} \left(S_{t+1}, \pi'(S_{t+1}) \right) \mid S_t = s \right] \qquad \text{because step 3.}\ \forall \ s \\ &\leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma R_{t+2} + \gamma^2 q_{\color{blue}\pi} \left(S_{t+2}, \pi'(S_{t+2}) \right) \mid S_t = s \right] \\ &\leq \dots \leq \mathbb{E}_{\color{red}\pi'} \left[R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s \right] = \operatorname{v}_{\pi'}(s) \end{aligned} vπ(s)qπ(s,π(s))=Eπ[Rt+1+γvπ(St+1)St=s]Eπ[Rt+1+γqπ(St+1,π(St+1))St=s]because step 3.  sEπ[Rt+1+γRt+2+γ2qπ(St+2,π(St+2))St=s]Eπ[Rt+1+γRt+2+St=s]=vπ(s)
    Why π ′ {\color{red}\pi'} π and π {\color{blue}\pi} π​ ?

    得回到 action-value function 的定义来看待这个问题
    q π ( s , a ) = E π [ R t + 1 + γ v ⁡ π ( s t + 1 ) ] = ∑ a ∈ A π ( a ∣ s ) ( R t + 1 + γ v ⁡ π ( s t + 1 ) ) q_\pi(s,a) = \mathbb{E}_\pi \left[R_{t+1} + \gamma \operatorname{v}_\pi(s_{t+1}) \right] = \sum_{a \in \mathcal{A}}{\color{red}\pi(a \mid s)} \left(R_{t+1} + \gamma \operatorname{v}_\pi(s_{t+1}) \right) qπ(s,a)=Eπ[Rt+1+γvπ(st+1)]=aAπ(as)(Rt+1+γvπ(st+1))
    因此可以看出求期望时,即加权平均值的 加权为 π ( a ∣ s ) \pi(a \mid s) π(as) 和动作策略是有关的,当我们使用了 greedy 之后,每一个状态下各动作选取的概率会改变,而 state-value funtion 还是由 π \pi π 算出来的,没进行更新。所以有 π ′ {\color{red}\pi'} π and π {\color{blue}\pi} π 的区别

Proof of why converge to π ∗ \pi^* π:

  1. If improvements stop
    q π ( s , π ′ ( s ) ) = max ⁡ a ∈ A   q π ( s , a )   =   q π ( s , π ( s ) ) = v ⁡ π ( s ) ∀   s q_\pi\left(s,\pi'(s) \right) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) \ {\color{red}=} \ q_\pi \left(s, \pi(s) \right) = \operatorname{v}_\pi(s) \qquad \forall \ s qπ(s,π(s))=aAmax qπ(s,a) = qπ(s,π(s))=vπ(s) s

  2. Then the Bellman optimality equation has been satisfied
    v ⁡ π ( s ) = max ⁡ a ∈ A   q π ( s , a ) \operatorname{v}_\pi(s) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ q_\pi(s,a) vπ(s)=aAmax qπ(s,a)

  3. Therefore now v ⁡ π ( s ) = v ⁡ ∗ ( s ) ∀ s ∈ S \operatorname{v}_\pi(s) = \operatorname{v}_*(s) \quad \forall s \in \mathcal{S} vπ(s)=v(s)sS

  4. so π \pi π is an optimal policy

Value Iteration

Principle of Optimality

A policy π ( a ∣ s ) \pi(a \mid s) π(as) achieves the optimal value from state s, i.e. v ⁡ π ( s ) = v ⁡ ∗ ( s ) \operatorname{v}_\pi(s) = \operatorname{v}_*(s) vπ(s)=v(s), if and only if

  • for any state s’ reachable from s
  • π \pi π achieves the optimal value from state s’, i.e. v ⁡ π ( s ′ ) = v ⁡ ∗ ( s ′ ) \operatorname{v}_\pi(s') = \operatorname{v}_*(s') vπ(s)=v(s)

Deterministic Value Iteration

Compare to Iterative Policy Evaluation, please click here, actually the difference between them is how to update (max or expectation)

  • If we know the solution to subproblems v ⁡ ∗ ( s ′ ) \operatorname{v}_*(s') v(s)

  • Then solution v ⁡ ∗ ( s ′ ) \operatorname{v}_*(s') v(s) can be found by one-step lookhead
    v ⁡ ∗ ( s ) ← max ⁡ a ∈ A ( R S a + γ ∑ s ′ ∈ S P s s ′ a v ⁡ ∗ ( s ′ ) ) \operatorname{v}_*(s) \leftarrow \underset{a \in \mathcal{A}}{\operatorname{max}} \left( \mathcal{R}_S^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_*(s') \right) v(s)aAmax(RSa+γsSPssav(s))

  • The idea of value iteration is to apply these updates iteratively

  • Intuition: start with final rewards and work backwards

Iterative application of [ Bellman  optimality  backup \text{Bellman {\color{red}optimality} backup} Bellman optimality backup​](#Bellman optimality backup): v ⁡ 1 → v ⁡ 2 → … → v ⁡ ∗ \operatorname{v}_1 \rightarrow\operatorname{v}_2 \rightarrow \ldots \rightarrow \operatorname{v}_* v1v2v

  • Using Synchornous backups:
    • at each iteration k + 1 k+1 k+1
    • for all states s ∈ S s \in \mathcal{S} sS
    • update v ⁡ k + 1 ( s ) \operatorname{v}_{k+1}(s) vk+1(s) from v ⁡ k ( s ′ ) \operatorname{v}_{k}(s') vk(s)
  • Intermediate value function may not correspond to any policy

How to update:

在这里插入图片描述


v ⁡ k + 1 ( s ) = max ⁡ a ∈ A   ( R s a + γ ∑ s ′ ∈ S P s s ′ a v ⁡ k ( s ′ ) ) \operatorname{v}_{\color{red}{k+1}}(s) = \underset{a \in \mathcal{A}}{\operatorname{max}}\ \left(\mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a \operatorname{v}_{\color{red}k}(s') \right) vk+1(s)=aAmax (Rsa+γsSPssavk(s))
Matrix From:
v k + 1 = max ⁡ a ∈ A ( R a + γ P a v ⁡ k ) \mathbf{v}_{k+1} = \underset{a \in \mathcal{A}}{\operatorname{max}}\left( \mathcal{R}^{\mathbf{a}} + \gamma \mathcal{P}^{\mathbf{a}} \operatorname{v}_k \right) vk+1=aAmax(Ra+γPavk)


A live demo

GridWorld: Dynamic Programming Demo

Summary of DP Algorithms

ProblembEllman equationAlgorithm
PredictionBellman Expectation EquationIterative Policy Evaluation
ControlBellman Expectation Equation
+ Greedy Policy Improvement
Policy Iteration
ControlBellman Optimality EquationValue Iteration
  • Algorithms are based on state-value function v ⁡ π ( s ) \operatorname{v}_\pi(s) vπ(s) or v ⁡ ∗ ( s ) \operatorname{v}_*(s) v(s)
  • m actions and n states, so complexity = O(m*n(s)*n(s’) ) = O(mn2) per iteration
通达信行情API是金融数据提供商通达信(TongDaXin)为开发者和金融机构提供的接口服务,用于获取实时及历史的股票、期货、期权等金融市场数据。这个API允许用户在自己的应用程序中集成通达信的数据服务,实现个性化数据分析、交易策略开发等功能。 1. **API基本概念** - **API**:Application Programming Interface,应用程序编程接口,是软件之间交互的一种方式,提供预定义的函数和方法,使得其他软件能够调用特定功能。 - **通达信**:国内知名的金融终端软件提供商,提供股票、期货、基金等市场数据,以及交易服务。 2. **通达信API的功能** - **实时行情**:获取股票、期货、期权等市场的实时报价信息,包括最新价、涨跌额、涨跌幅、成交量等。 - **历史数据**:获取历史交易日的K线数据、分时数据、交易量等信息,支持自定义时间段查询。 - **深度数据**:获取买卖盘口的五档报价和成交量,有助于分析市场买卖意愿。 - **资讯信息**:获取公告、研报、新闻等市场资讯。 - **交易委托**:通过API进行交易下单、撤单等操作,实现自动化交易。 3. **TdxHqApi** - **TdxHqApi** 是通达信行情API的具体实现,它包含了调用通达信数据服务的各种函数和类,如获取股票列表、获取实时行情、获取历史数据等。 - 开发者需要按照API文档的指示,导入TdxHqApi库,然后通过调用相应的函数来获取所需数据。 4. **使用步骤** - **安装**:下载并安装通达信API的SDK,通常包括头文件和动态链接库。 - **初始化**:在代码中实例化API对象,进行连接设置,如服务器地址、端口号等。 - **连接**:连接到通达信服务器,进行身份验证。 - **数据请求**:调用对应的API函数,例如`GetS
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值