Reinforcement Learning Exercise 3.22

最新推荐文章于 2019-10-04 14:21:04 发布

YeXiang\^-^/

最新推荐文章于 2019-10-04 14:21:04 发布

阅读量983

点赞数

CC 4.0 BY-SA版权

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.youkuaiyun.com/ballade2012/article/details/89648185

reinforcement learning 专栏收录该内容

37 篇文章

订阅专栏

博客围绕一个持续马尔可夫决策过程（MDP）展开，在顶部状态有左右两个动作可选，存在两个确定性策略。通过推导q∗(s,a)表达式，得出贝尔曼最优方程，分别讨论了γ为0、0.5、0.9时的最优策略，γ=0时πleft最优，γ=0.5时两者皆可，γ=0.9时πright最优。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Exercise 3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, $πleft\pi_{left}$ and $πright\pi_{right}$ . What policy is optimal if $γ=0\gamma = 0$ ? If $γ=0.9\gamma = 0.9$ ? If $γ=0.5\gamma = 0.5$ ?
在这里插入图片描述
Before to solve this problem, we have to deduce the expression of $q_*(s,a)$ in terms of $Rs,s′aR_{s,s'}^a$ and $Ps,s′aP_{s,s'}^a$ .
First,
$\begin{aligned} q_*(s,a) &= \mathbb E[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a')|S_t=s,A_t=a] \\ &= \sum_{s',r}\Bigl \{p(s',r|s,a) \bigl [ r + \gamma \max_{a'}q_*(s',a) \bigr ] \Bigr \} \\ &= \sum_{s', r} \bigl [ rp(s',r|s,a) \bigr ] + \sum_{s',r} \bigl [ p(s',r|s,a) \gamma \max_{a'}q_*(s',a') \bigr ] \\ &= \sum_r \bigl [ rp(r|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \mathbb E(r|s,a) + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s'} \bigl [ \mathbb E(r|s', s, a)p(s'|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s'} \Bigl \{ \bigl [ \mathbb E(r|s',s,a) + \gamma \max_{a'} q_*(s',a') \bigr ] p(s'|s,a) \Bigr \} \end{aligned}$
denote $E(r∣s′,s,a)=Rs,s′a\mathbb E(r|s',s,a) = R_{s,s'}^a$ and $p(s′∣s,a)=Ps,s′ap(s'|s,a)=P_{s,s'}^a$ , we get the expression we wanted
$q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \tag{1}$
Next, we name the three status in circles as $s_A$ , $s_B$ , $s_C$ , and denote the action to left as $a_l$ , the action to right as $a_r$ .
在这里插入图片描述
According to equation (1) we can get Bellman optimality equation for $q_*$ of the three status.
$\begin{aligned} q_{*, \pi_{left}}(s_A, a_l)&=\Bigl \{R_{s_A, s_B}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_B, a)\bigr ] \Bigr \} P_{s_A, s_B}^{a_l} + \Bigl \{R_{s_A, s_C}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_C, a) \bigr ] \Bigr \} P_{s_A, s_C}^{a_l}\\ &= \bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_l} \\ q_{*, \pi_{right}}(s_A, a_r)&=\Bigl \{R_{s_A, s_B}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_B, a) \bigr ] \Bigr \} P_{s_A, s_B}^{a_r} + \Bigl \{R_{s_A, s_C}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_C, a)\bigr ] \Bigr \} P_{s_A, s_C}^{a_r}\\ &= \bigl [ R_{s_A, s_B}^{a_r} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_r} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\ q_*(s_B, a)&=\Bigl \{R_{s_B, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_B, s_A}^{a} \\ q_*(s_C, a)&=\Bigl \{R_{s_C, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_C, s_A}^{a} \\ \end{aligned}$ $\because P_{s_A, s_B}^{a_r} = 0, P_{s_A, s_C}^{a_l} = 0\\ \begin{aligned} \therefore q_{*, \pi_{left}}(s_A, a_l)&=\bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} \\ q_{*, \pi_{right}}(s_A, a_r)&= \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\ \end{aligned}$
Now, let’s discuss the cases in different $γ\gamma$ .
For $γ=0\gamma = 0$ :
$\begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0 \cdot q_*(s_B, a) \bigr ] \cdot 1 = 1\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0 \cdot q_*(s_C,a) \bigr ] \cdot 1 = 0 \end{aligned}$
So, $πleft\pi_{left}$ is the optimal policy when $γ=0\gamma = 0$ .

For $γ=0.5\gamma = 0.5$ :
$\begin{aligned} q_*(s_B, a)&=\Bigl \{0+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\ &=0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_*(s_C, a)&=\Bigl \{2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\ &= 2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.5 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\ &= 1 + 0.5 \cdot q_*(s_B, a)\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.5 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\ &= 0.5 \cdot q_*(s_C,a) \end{aligned}$
Assume $q∗,πleft(sA,al)≥q∗,π−left(sC,al)q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)$ then we have:
$\begin{aligned} q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \end{aligned}$
therefore,
$\begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\ q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {5}{3} \end{aligned}$
Here, $q∗,πleft(sA,al)<q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)$ , conflict with the assumption, so the assumption fails.
Assume $q∗,πleft(sA,al)≤q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)$ then we have:
$\begin{aligned} q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \end{aligned}$
therefore,
$\begin{aligned} q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {4}{3}\\ q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\ \end{aligned}$
Here $q∗,πleft(sA,al)=q∗,πright(sA,ar)q_{*,\pi_{left}}(s_A, a_l) = q_{*,\pi_{right}}(s_A, a_r)$ , assumption is correct. So, both $q∗,πleft(sA,al)q_{*,\pi_{left}}(s_A, a_l)$ and $q∗,πright(sA,ar)q_{*,\pi_{right}}(s_A, a_r)$ are optimal policies for $γ=0.5\gamma = 0.5$ .

For $γ=0.9\gamma = 0.9$ :
$\begin{aligned} q_*(s_B, a)&=\Bigl \{0+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\ &=0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_*(s_C, a)&=\Bigl \{2+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\ &= 2+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\ q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.9 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\ &= 1 + 0.9 \cdot q_*(s_B, a)\\ q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.9 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\ &= 0.9 \cdot q_*(s_C,a) \end{aligned}$
Assume $q∗,πleft(sA,al)≥q∗,π−left(sC,al)q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)$ then we have:
$\begin{aligned} q_*(s_B, a) &= 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_*(s_C, a) &= 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \end{aligned}$
therefore,
$\begin{aligned} q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {100}{19} = \frac {500}{95}\\ q_{*,\pi_{right}}(s_A, a_r) &= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {729}{95} \end{aligned}$
Here, $q∗,πleft(sA,al)<q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)$ , conflict with the assumption, so the assumption fails.
Assume $q∗,πleft(sA,al)≤q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)$ then we have:
$\begin{aligned} q_*(s_B, a) &= 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_*(s_C, a) &= 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \end{aligned}$
therefore,
$\begin{aligned} q_{*,\pi_{right}}(s_A, a_r) &= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\ q_{*,\pi_{right}}(s_A, a_r) &= \frac {180}{19}\\ q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\ q_{*,\pi_{left}}(s_A, a_l) &= \frac {1648}{190}\\ \end{aligned}$
Here, $q∗,πleft(sA,al)<q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)$ , assumption is correct. So, $πright\pi_{right}$ is the optimal policy for $γ=0.9\gamma = 0.9$