Exercise 3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, πleft\pi_{left}πleft and πright\pi_{right}πright. What policy is optimal if γ=0\gamma = 0γ=0? If γ=0.9\gamma = 0.9γ=0.9? If γ=0.5\gamma = 0.5γ=0.5?

Before to solve this problem, we have to deduce the expression of q∗(s,a)q_*(s,a)q∗(s,a) in terms of Rs,s′aR_{s,s'}^aRs,s′a and Ps,s′aP_{s,s'}^aPs,s′a.
First,
q∗(s,a)=E[Rt+1+γmaxa′q∗(St+1,a′)∣St=s,At=a]=∑s′,r{p(s′,r∣s,a)[r+γmaxa′q∗(s′,a)]}=∑s′,r[rp(s′,r∣s,a)]+∑s′,r[p(s′,r∣s,a)γmaxa′q∗(s′,a′)]=∑r[rp(r∣s,a)]+∑s′[p(s′∣s,a)γmaxa′q∗(s′,a′)]=E(r∣s,a)+∑s′[p(s′∣s,a)γmaxa′q∗(s′,a′)]=∑s′[E(r∣s′,s,a)p(s′∣s,a)]+∑s′[p(s′∣s,a)γmaxa′q∗(s′,a′)]=∑s′{[E(r∣s′,s,a)+γmaxa′q∗(s′,a′)]p(s′∣s,a)}
\begin{aligned}
q_*(s,a) &= \mathbb E[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a')|S_t=s,A_t=a] \\
&= \sum_{s',r}\Bigl \{p(s',r|s,a) \bigl [ r + \gamma \max_{a'}q_*(s',a) \bigr ] \Bigr \} \\
&= \sum_{s', r} \bigl [ rp(s',r|s,a) \bigr ] + \sum_{s',r} \bigl [ p(s',r|s,a) \gamma \max_{a'}q_*(s',a') \bigr ] \\
&= \sum_r \bigl [ rp(r|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\
&= \mathbb E(r|s,a) + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\
&= \sum_{s'} \bigl [ \mathbb E(r|s', s, a)p(s'|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\
&= \sum_{s'} \Bigl \{ \bigl [ \mathbb E(r|s',s,a) + \gamma \max_{a'} q_*(s',a') \bigr ] p(s'|s,a) \Bigr \}
\end{aligned}
q∗(s,a)=E[Rt+1+γa′maxq∗(St+1,a′)∣St=s,At=a]=s′,r∑{p(s′,r∣s,a)[r+γa′maxq∗(s′,a)]}=s′,r∑[rp(s′,r∣s,a)]+s′,r∑[p(s′,r∣s,a)γa′maxq∗(s′,a′)]=r∑[rp(r∣s,a)]+s′∑[p(s′∣s,a)γa′maxq∗(s′,a′)]=E(r∣s,a)+s′∑[p(s′∣s,a)γa′maxq∗(s′,a′)]=s′∑[E(r∣s′,s,a)p(s′∣s,a)]+s′∑[p(s′∣s,a)γa′maxq∗(s′,a′)]=s′∑{[E(r∣s′,s,a)+γa′maxq∗(s′,a′)]p(s′∣s,a)}
denote E(r∣s′,s,a)=Rs,s′a\mathbb E(r|s',s,a) = R_{s,s'}^aE(r∣s′,s,a)=Rs,s′a and p(s′∣s,a)=Ps,s′ap(s'|s,a)=P_{s,s'}^ap(s′∣s,a)=Ps,s′a, we get the expression we wanted
(1)q∗(s,a)=∑s′{[Rs,s′a+γmaxa′q∗(s′,a′)]Ps,s′a}
q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \tag{1}
q∗(s,a)=s′∑{[Rs,s′a+γa′maxq∗(s′,a′)]Ps,s′a}(1)
Next, we name the three status in circles as sAs_AsA, sBs_BsB, sCs_CsC, and denote the action to left as ala_lal, the action to right as ara_rar.

According to equation (1) we can get Bellman optimality equation for q∗q_*q∗ of the three status.
q∗,πleft(sA,al)={RsA,sBal+γmaxa′[q∗(sB,a)]}PsA,sBal+{RsA,sCal+γmaxa′[q∗(sC,a)]}PsA,sCal=[RsA,sBal+γq∗(sB,a)]PsA,sBal+[RsA,sCar+γq∗(sC,a)]PsA,sCalq∗,πright(sA,ar)={RsA,sBar+γmaxa′[q∗(sB,a)]}PsA,sBar+{RsA,sCar+γmaxa′[q∗(sC,a)]}PsA,sCar=[RsA,sBar+γq∗(sB,a)]PsA,sBar+[RsA,sCar+γq∗(sC,a)]PsA,sCarq∗(sB,a)={RsB,sAa+γmaxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsB,sAaq∗(sC,a)={RsC,sAa+γmaxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsC,sAa
\begin{aligned}
q_{*, \pi_{left}}(s_A, a_l)&=\Bigl \{R_{s_A, s_B}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_B, a)\bigr ] \Bigr \} P_{s_A, s_B}^{a_l} + \Bigl \{R_{s_A, s_C}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_C, a) \bigr ] \Bigr \} P_{s_A, s_C}^{a_l}\\
&= \bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_l} \\
q_{*, \pi_{right}}(s_A, a_r)&=\Bigl \{R_{s_A, s_B}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_B, a) \bigr ] \Bigr \} P_{s_A, s_B}^{a_r} + \Bigl \{R_{s_A, s_C}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_C, a)\bigr ] \Bigr \} P_{s_A, s_C}^{a_r}\\
&= \bigl [ R_{s_A, s_B}^{a_r} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_r} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\
q_*(s_B, a)&=\Bigl \{R_{s_B, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_B, s_A}^{a} \\
q_*(s_C, a)&=\Bigl \{R_{s_C, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_C, s_A}^{a} \\
\end{aligned}q∗,πleft(sA,al)q∗,πright(sA,ar)q∗(sB,a)q∗(sC,a)={RsA,sBal+γa′max[q∗(sB,a)]}PsA,sBal+{RsA,sCal+γa′max[q∗(sC,a)]}PsA,sCal=[RsA,sBal+γq∗(sB,a)]PsA,sBal+[RsA,sCar+γq∗(sC,a)]PsA,sCal={RsA,sBar+γa′max[q∗(sB,a)]}PsA,sBar+{RsA,sCar+γa′max[q∗(sC,a)]}PsA,sCar=[RsA,sBar+γq∗(sB,a)]PsA,sBar+[RsA,sCar+γq∗(sC,a)]PsA,sCar={RsB,sAa+γa′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsB,sAa={RsC,sAa+γa′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsC,sAa∵PsA,sBar=0,PsA,sCal=0∴q∗,πleft(sA,al)=[RsA,sBal+γq∗(sB,a)]PsA,sBalq∗,πright(sA,ar)=[RsA,sCar+γq∗(sC,a)]PsA,sCar
\because P_{s_A, s_B}^{a_r} = 0, P_{s_A, s_C}^{a_l} = 0\\
\begin{aligned}
\therefore q_{*, \pi_{left}}(s_A, a_l)&=\bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} \\
q_{*, \pi_{right}}(s_A, a_r)&= \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\
\end{aligned}
∵PsA,sBar=0,PsA,sCal=0∴q∗,πleft(sA,al)q∗,πright(sA,ar)=[RsA,sBal+γq∗(sB,a)]PsA,sBal=[RsA,sCar+γq∗(sC,a)]PsA,sCar
Now, let’s discuss the cases in different γ\gammaγ.
For γ=0\gamma = 0γ=0:
q∗,πleft(sA,al)=[1+0⋅q∗(sB,a)]⋅1=1q∗,πright(sA,ar)=[0+0⋅q∗(sC,a)]⋅1=0
\begin{aligned}
q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0 \cdot q_*(s_B, a) \bigr ] \cdot 1 = 1\\
q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0 \cdot q_*(s_C,a) \bigr ] \cdot 1 = 0
\end{aligned}
q∗,πleft(sA,al)q∗,πright(sA,ar)=[1+0⋅q∗(sB,a)]⋅1=1=[0+0⋅q∗(sC,a)]⋅1=0
So, πleft\pi_{left}πleft is the optimal policy when γ=0\gamma = 0γ=0.
For γ=0.5\gamma = 0.5γ=0.5:
q∗(sB,a)={0+0.5maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.5maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗(sC,a)={2+0.5maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.5maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗,πleft(sA,al)=[1+0.5⋅q∗(sB,a)]⋅1=1+0.5⋅q∗(sB,a)q∗,πright(sA,ar)=[0+0.5⋅q∗(sC,a)]⋅1=0.5⋅q∗(sC,a)
\begin{aligned}
q_*(s_B, a)&=\Bigl \{0+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\
&=0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\
q_*(s_C, a)&=\Bigl \{2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\
&= 2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\
q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.5 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\
&= 1 + 0.5 \cdot q_*(s_B, a)\\
q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.5 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\
&= 0.5 \cdot q_*(s_C,a)
\end{aligned}
q∗(sB,a)q∗(sC,a)q∗,πleft(sA,al)q∗,πright(sA,ar)={0+0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]={2+0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]=[1+0.5⋅q∗(sB,a)]⋅1=1+0.5⋅q∗(sB,a)=[0+0.5⋅q∗(sC,a)]⋅1=0.5⋅q∗(sC,a)
Assume q∗,πleft(sA,al)≥q∗,π−left(sC,al)q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)q∗,πleft(sA,al)≥q∗,π−left(sC,al) then we have:
q∗(sB,a)=0.5⋅q∗,πleft(sA,al)q∗(sC,a)=2+0.5⋅q∗,πleft(sA,al)
\begin{aligned}
q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\
q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)
\end{aligned}
q∗(sB,a)q∗(sC,a)=0.5⋅q∗,πleft(sA,al)=2+0.5⋅q∗,πleft(sA,al)
therefore,
q∗,πleft(sA,al)=1+0.5⋅0.5⋅q∗,πleft(sA,al)q∗,πleft(sA,al)=43q∗,πright(sA,ar)=0.5⋅[2+0.5⋅q∗,πleft(sA,al)]q∗,πright(sA,ar)=53
\begin{aligned}
q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\
q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\
q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\
q_{*,\pi_{right}}(s_A, a_r) &= \frac {5}{3}
\end{aligned}
q∗,πleft(sA,al)q∗,πleft(sA,al)q∗,πright(sA,ar)q∗,πright(sA,ar)=1+0.5⋅0.5⋅q∗,πleft(sA,al)=34=0.5⋅[2+0.5⋅q∗,πleft(sA,al)]=35
Here, q∗,πleft(sA,al)<q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)<q∗,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume q∗,πleft(sA,al)≤q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)≤q∗,πright(sC,al) then we have:
q∗(sB,a)=0.5⋅q∗,πright(sA,ar)q∗(sC,a)=2+0.5⋅q∗,πright(sA,ar)
\begin{aligned}
q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\
q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)
\end{aligned}
q∗(sB,a)q∗(sC,a)=0.5⋅q∗,πright(sA,ar)=2+0.5⋅q∗,πright(sA,ar)
therefore,
q∗,πright(sA,ar)=0.5⋅[2+0.5⋅q∗,πright(sA,ar)]q∗,πright(sA,ar)=43q∗,πleft(sA,al)=1+0.5⋅0.5⋅q∗,πright(sA,ar)q∗,πleft(sA,al)=43
\begin{aligned}
q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\
q_{*,\pi_{right}}(s_A, a_r) &= \frac {4}{3}\\
q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\
q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\
\end{aligned}
q∗,πright(sA,ar)q∗,πright(sA,ar)q∗,πleft(sA,al)q∗,πleft(sA,al)=0.5⋅[2+0.5⋅q∗,πright(sA,ar)]=34=1+0.5⋅0.5⋅q∗,πright(sA,ar)=34
Here q∗,πleft(sA,al)=q∗,πright(sA,ar)q_{*,\pi_{left}}(s_A, a_l) = q_{*,\pi_{right}}(s_A, a_r)q∗,πleft(sA,al)=q∗,πright(sA,ar), assumption is correct. So, both q∗,πleft(sA,al)q_{*,\pi_{left}}(s_A, a_l)q∗,πleft(sA,al) and q∗,πright(sA,ar)q_{*,\pi_{right}}(s_A, a_r)q∗,πright(sA,ar) are optimal policies for γ=0.5\gamma = 0.5γ=0.5.
For γ=0.9\gamma = 0.9γ=0.9:
q∗(sB,a)={0+0.9maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.9maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗(sC,a)={2+0.9maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.9maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗,πleft(sA,al)=[1+0.9⋅q∗(sB,a)]⋅1=1+0.9⋅q∗(sB,a)q∗,πright(sA,ar)=[0+0.9⋅q∗(sC,a)]⋅1=0.9⋅q∗(sC,a)
\begin{aligned}
q_*(s_B, a)&=\Bigl \{0+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\
&=0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\
q_*(s_C, a)&=\Bigl \{2+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\
&= 2+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\
q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.9 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\
&= 1 + 0.9 \cdot q_*(s_B, a)\\
q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.9 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\
&= 0.9 \cdot q_*(s_C,a)
\end{aligned}
q∗(sB,a)q∗(sC,a)q∗,πleft(sA,al)q∗,πright(sA,ar)={0+0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]={2+0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]=[1+0.9⋅q∗(sB,a)]⋅1=1+0.9⋅q∗(sB,a)=[0+0.9⋅q∗(sC,a)]⋅1=0.9⋅q∗(sC,a)
Assume q∗,πleft(sA,al)≥q∗,π−left(sC,al)q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)q∗,πleft(sA,al)≥q∗,π−left(sC,al) then we have:
q∗(sB,a)=0.9⋅q∗,πleft(sA,al)q∗(sC,a)=2+0.9⋅q∗,πleft(sA,al)
\begin{aligned}
q_*(s_B, a) &= 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\
q_*(s_C, a) &= 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)
\end{aligned}
q∗(sB,a)q∗(sC,a)=0.9⋅q∗,πleft(sA,al)=2+0.9⋅q∗,πleft(sA,al)
therefore,
q∗,πleft(sA,al)=1+0.9⋅0.9⋅q∗,πleft(sA,al)q∗,πleft(sA,al)=10019=50095q∗,πright(sA,ar)=0.9⋅[2+0.9⋅q∗,πleft(sA,al)]q∗,πright(sA,ar)=72995
\begin{aligned}
q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\
q_{*,\pi_{left}}(s_A, a_l) &= \frac {100}{19} = \frac {500}{95}\\
q_{*,\pi_{right}}(s_A, a_r) &= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\
q_{*,\pi_{right}}(s_A, a_r) &= \frac {729}{95}
\end{aligned}
q∗,πleft(sA,al)q∗,πleft(sA,al)q∗,πright(sA,ar)q∗,πright(sA,ar)=1+0.9⋅0.9⋅q∗,πleft(sA,al)=19100=95500=0.9⋅[2+0.9⋅q∗,πleft(sA,al)]=95729
Here, q∗,πleft(sA,al)<q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)<q∗,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume q∗,πleft(sA,al)≤q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)≤q∗,πright(sC,al) then we have:
q∗(sB,a)=0.9⋅q∗,πright(sA,ar)q∗(sC,a)=2+0.9⋅q∗,πright(sA,ar)
\begin{aligned}
q_*(s_B, a) &= 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\
q_*(s_C, a) &= 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)
\end{aligned}
q∗(sB,a)q∗(sC,a)=0.9⋅q∗,πright(sA,ar)=2+0.9⋅q∗,πright(sA,ar)
therefore,
q∗,πright(sA,ar)=0.9⋅[2+0.9⋅q∗,πright(sA,ar)]q∗,πright(sA,ar)=18019q∗,πleft(sA,al)=1+0.9⋅0.9⋅q∗,πright(sA,ar)q∗,πleft(sA,al)=1648190
\begin{aligned}
q_{*,\pi_{right}}(s_A, a_r) &= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\
q_{*,\pi_{right}}(s_A, a_r) &= \frac {180}{19}\\
q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\
q_{*,\pi_{left}}(s_A, a_l) &= \frac {1648}{190}\\
\end{aligned}
q∗,πright(sA,ar)q∗,πright(sA,ar)q∗,πleft(sA,al)q∗,πleft(sA,al)=0.9⋅[2+0.9⋅q∗,πright(sA,ar)]=19180=1+0.9⋅0.9⋅q∗,πright(sA,ar)=1901648
Here, q∗,πleft(sA,al)<q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)<q∗,πright(sC,al), assumption is correct. So, πright\pi_{right}πright is the optimal policy for γ=0.9\gamma = 0.9γ=0.9
博客围绕一个持续马尔可夫决策过程(MDP)展开,在顶部状态有左右两个动作可选,存在两个确定性策略。通过推导q∗(s,a)表达式,得出贝尔曼最优方程,分别讨论了γ为0、0.5、0.9时的最优策略,γ=0时πleft最优,γ=0.5时两者皆可,γ=0.9时πright最优。
1965

被折叠的 条评论
为什么被折叠?



