Exercise 3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, πleft\pi_{left}πleft and πright\pi_{right}πright. What policy is optimal if γ=0\gamma = 0γ=0? If γ=0.9\gamma = 0.9γ=0.9? If γ=0.5\gamma = 0.5γ=0.5?
Before to solve this problem, we have to deduce the expression of q∗(s,a)q_*(s,a)q∗(s,a) in terms of Rs,s′aR_{s,s'}^aRs,s′a and Ps,s′aP_{s,s'}^aPs,s′a.
First,
q∗(s,a)=E[Rt+1+γmaxa′q∗(St+1,a′)∣St=s,At=a]=∑s′,r{p(s′,r∣s,a)[r+γmaxa′q∗(s′,a)]}=∑s′,r[rp(s′,r∣s,a)]+∑s′,r[p(s′,r∣s,a)γmaxa′q∗(s′,a′)]=∑r[rp(r∣s,a)]+∑s′[p(s′∣s,a)γmaxa′q∗(s′,a′)]=E(r∣s,a)+∑s′[p(s′∣s,a)γmaxa′q∗(s′,a′)]=∑s′[E(r∣s′,s,a)p(s′∣s,a)]+∑s′[p(s′∣s,a)γmaxa′q∗(s′,a′)]=∑s′{[E(r∣s′,s,a)+γmaxa′q∗(s′,a′)]p(s′∣s,a)}
\begin{aligned}
q_*(s,a) &= \mathbb E[R_{t+1} + \gamma \max_{a'} q_*(S_{t+1}, a')|S_t=s,A_t=a] \\
&= \sum_{s',r}\Bigl \{p(s',r|s,a) \bigl [ r + \gamma \max_{a'}q_*(s',a) \bigr ] \Bigr \} \\
&= \sum_{s', r} \bigl [ rp(s',r|s,a) \bigr ] + \sum_{s',r} \bigl [ p(s',r|s,a) \gamma \max_{a'}q_*(s',a') \bigr ] \\
&= \sum_r \bigl [ rp(r|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\
&= \mathbb E(r|s,a) + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\
&= \sum_{s'} \bigl [ \mathbb E(r|s', s, a)p(s'|s,a) \bigr ] + \sum_{s'} \bigl [ p(s'|s,a) \gamma \max_{a'} q_*(s', a') \bigr ] \\
&= \sum_{s'} \Bigl \{ \bigl [ \mathbb E(r|s',s,a) + \gamma \max_{a'} q_*(s',a') \bigr ] p(s'|s,a) \Bigr \}
\end{aligned}
q∗(s,a)=E[Rt+1+γa′maxq∗(St+1,a′)∣St=s,At=a]=s′,r∑{p(s′,r∣s,a)[r+γa′maxq∗(s′,a)]}=s′,r∑[rp(s′,r∣s,a)]+s′,r∑[p(s′,r∣s,a)γa′maxq∗(s′,a′)]=r∑[rp(r∣s,a)]+s′∑[p(s′∣s,a)γa′maxq∗(s′,a′)]=E(r∣s,a)+s′∑[p(s′∣s,a)γa′maxq∗(s′,a′)]=s′∑[E(r∣s′,s,a)p(s′∣s,a)]+s′∑[p(s′∣s,a)γa′maxq∗(s′,a′)]=s′∑{[E(r∣s′,s,a)+γa′maxq∗(s′,a′)]p(s′∣s,a)}
denote E(r∣s′,s,a)=Rs,s′a\mathbb E(r|s',s,a) = R_{s,s'}^aE(r∣s′,s,a)=Rs,s′a and p(s′∣s,a)=Ps,s′ap(s'|s,a)=P_{s,s'}^ap(s′∣s,a)=Ps,s′a, we get the expression we wanted
(1)q∗(s,a)=∑s′{[Rs,s′a+γmaxa′q∗(s′,a′)]Ps,s′a}
q_*(s,a)=\sum_{s'} \Bigl \{ \bigl [ R_{s,s'}^a + \gamma \max_{a'} q_*(s',a') \bigr ] P_{s,s'}^a \Bigr \} \tag{1}
q∗(s,a)=s′∑{[Rs,s′a+γa′maxq∗(s′,a′)]Ps,s′a}(1)
Next, we name the three status in circles as sAs_AsA, sBs_BsB, sCs_CsC, and denote the action to left as ala_lal, the action to right as ara_rar.
According to equation (1) we can get Bellman optimality equation for q∗q_*q∗ of the three status.
q∗,πleft(sA,al)={RsA,sBal+γmaxa′[q∗(sB,a)]}PsA,sBal+{RsA,sCal+γmaxa′[q∗(sC,a)]}PsA,sCal=[RsA,sBal+γq∗(sB,a)]PsA,sBal+[RsA,sCar+γq∗(sC,a)]PsA,sCalq∗,πright(sA,ar)={RsA,sBar+γmaxa′[q∗(sB,a)]}PsA,sBar+{RsA,sCar+γmaxa′[q∗(sC,a)]}PsA,sCar=[RsA,sBar+γq∗(sB,a)]PsA,sBar+[RsA,sCar+γq∗(sC,a)]PsA,sCarq∗(sB,a)={RsB,sAa+γmaxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsB,sAaq∗(sC,a)={RsC,sAa+γmaxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsC,sAa
\begin{aligned}
q_{*, \pi_{left}}(s_A, a_l)&=\Bigl \{R_{s_A, s_B}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_B, a)\bigr ] \Bigr \} P_{s_A, s_B}^{a_l} + \Bigl \{R_{s_A, s_C}^{a_l}+\gamma \max_{a'} \bigl [ q_*(s_C, a) \bigr ] \Bigr \} P_{s_A, s_C}^{a_l}\\
&= \bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_l} \\
q_{*, \pi_{right}}(s_A, a_r)&=\Bigl \{R_{s_A, s_B}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_B, a) \bigr ] \Bigr \} P_{s_A, s_B}^{a_r} + \Bigl \{R_{s_A, s_C}^{a_r}+\gamma \max_{a'} \bigl [ q_*(s_C, a)\bigr ] \Bigr \} P_{s_A, s_C}^{a_r}\\
&= \bigl [ R_{s_A, s_B}^{a_r} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_r} + \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\
q_*(s_B, a)&=\Bigl \{R_{s_B, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_B, s_A}^{a} \\
q_*(s_C, a)&=\Bigl \{R_{s_C, s_A}^{a}+\gamma \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} P_{s_C, s_A}^{a} \\
\end{aligned}q∗,πleft(sA,al)q∗,πright(sA,ar)q∗(sB,a)q∗(sC,a)={RsA,sBal+γa′max[q∗(sB,a)]}PsA,sBal+{RsA,sCal+γa′max[q∗(sC,a)]}PsA,sCal=[RsA,sBal+γq∗(sB,a)]PsA,sBal+[RsA,sCar+γq∗(sC,a)]PsA,sCal={RsA,sBar+γa′max[q∗(sB,a)]}PsA,sBar+{RsA,sCar+γa′max[q∗(sC,a)]}PsA,sCar=[RsA,sBar+γq∗(sB,a)]PsA,sBar+[RsA,sCar+γq∗(sC,a)]PsA,sCar={RsB,sAa+γa′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsB,sAa={RsC,sAa+γa′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}PsC,sAa∵PsA,sBar=0,PsA,sCal=0∴q∗,πleft(sA,al)=[RsA,sBal+γq∗(sB,a)]PsA,sBalq∗,πright(sA,ar)=[RsA,sCar+γq∗(sC,a)]PsA,sCar
\because P_{s_A, s_B}^{a_r} = 0, P_{s_A, s_C}^{a_l} = 0\\
\begin{aligned}
\therefore q_{*, \pi_{left}}(s_A, a_l)&=\bigl [ R_{s_A, s_B}^{a_l} + \gamma q_*(s_B, a) \bigr ] P_{s_A, s_B}^{a_l} \\
q_{*, \pi_{right}}(s_A, a_r)&= \bigl [ R_{s_A, s_C}^{a_r} + \gamma q_*(s_C, a) \bigr ] P_{s_A, s_C}^{a_r} \\
\end{aligned}
∵PsA,sBar=0,PsA,sCal=0∴q∗,πleft(sA,al)q∗,πright(sA,ar)=[RsA,sBal+γq∗(sB,a)]PsA,sBal=[RsA,sCar+γq∗(sC,a)]PsA,sCar
Now, let’s discuss the cases in different γ\gammaγ.
For γ=0\gamma = 0γ=0:
q∗,πleft(sA,al)=[1+0⋅q∗(sB,a)]⋅1=1q∗,πright(sA,ar)=[0+0⋅q∗(sC,a)]⋅1=0
\begin{aligned}
q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0 \cdot q_*(s_B, a) \bigr ] \cdot 1 = 1\\
q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0 \cdot q_*(s_C,a) \bigr ] \cdot 1 = 0
\end{aligned}
q∗,πleft(sA,al)q∗,πright(sA,ar)=[1+0⋅q∗(sB,a)]⋅1=1=[0+0⋅q∗(sC,a)]⋅1=0
So, πleft\pi_{left}πleft is the optimal policy when γ=0\gamma = 0γ=0.
For γ=0.5\gamma = 0.5γ=0.5:
q∗(sB,a)={0+0.5maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.5maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗(sC,a)={2+0.5maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.5maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗,πleft(sA,al)=[1+0.5⋅q∗(sB,a)]⋅1=1+0.5⋅q∗(sB,a)q∗,πright(sA,ar)=[0+0.5⋅q∗(sC,a)]⋅1=0.5⋅q∗(sC,a)
\begin{aligned}
q_*(s_B, a)&=\Bigl \{0+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\
&=0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\
q_*(s_C, a)&=\Bigl \{2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\
&= 2+0.5 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\
q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.5 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\
&= 1 + 0.5 \cdot q_*(s_B, a)\\
q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.5 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\
&= 0.5 \cdot q_*(s_C,a)
\end{aligned}
q∗(sB,a)q∗(sC,a)q∗,πleft(sA,al)q∗,πright(sA,ar)={0+0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]={2+0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.5a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]=[1+0.5⋅q∗(sB,a)]⋅1=1+0.5⋅q∗(sB,a)=[0+0.5⋅q∗(sC,a)]⋅1=0.5⋅q∗(sC,a)
Assume q∗,πleft(sA,al)≥q∗,π−left(sC,al)q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)q∗,πleft(sA,al)≥q∗,π−left(sC,al) then we have:
q∗(sB,a)=0.5⋅q∗,πleft(sA,al)q∗(sC,a)=2+0.5⋅q∗,πleft(sA,al)
\begin{aligned}
q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\
q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)
\end{aligned}
q∗(sB,a)q∗(sC,a)=0.5⋅q∗,πleft(sA,al)=2+0.5⋅q∗,πleft(sA,al)
therefore,
q∗,πleft(sA,al)=1+0.5⋅0.5⋅q∗,πleft(sA,al)q∗,πleft(sA,al)=43q∗,πright(sA,ar)=0.5⋅[2+0.5⋅q∗,πleft(sA,al)]q∗,πright(sA,ar)=53
\begin{aligned}
q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{left}}(s_A, a_l)\\
q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\
q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\
q_{*,\pi_{right}}(s_A, a_r) &= \frac {5}{3}
\end{aligned}
q∗,πleft(sA,al)q∗,πleft(sA,al)q∗,πright(sA,ar)q∗,πright(sA,ar)=1+0.5⋅0.5⋅q∗,πleft(sA,al)=34=0.5⋅[2+0.5⋅q∗,πleft(sA,al)]=35
Here, q∗,πleft(sA,al)<q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)<q∗,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume q∗,πleft(sA,al)≤q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)≤q∗,πright(sC,al) then we have:
q∗(sB,a)=0.5⋅q∗,πright(sA,ar)q∗(sC,a)=2+0.5⋅q∗,πright(sA,ar)
\begin{aligned}
q_*(s_B, a) &= 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\
q_*(s_C, a) &= 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)
\end{aligned}
q∗(sB,a)q∗(sC,a)=0.5⋅q∗,πright(sA,ar)=2+0.5⋅q∗,πright(sA,ar)
therefore,
q∗,πright(sA,ar)=0.5⋅[2+0.5⋅q∗,πright(sA,ar)]q∗,πright(sA,ar)=43q∗,πleft(sA,al)=1+0.5⋅0.5⋅q∗,πright(sA,ar)q∗,πleft(sA,al)=43
\begin{aligned}
q_{*,\pi_{right}}(s_A, a_r) &= 0.5 \cdot \bigl [ 2+0.5 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\
q_{*,\pi_{right}}(s_A, a_r) &= \frac {4}{3}\\
q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.5 \cdot 0.5 \cdot q_{*,\pi_{right}}(s_A, a_r)\\
q_{*,\pi_{left}}(s_A, a_l) &= \frac {4}{3}\\
\end{aligned}
q∗,πright(sA,ar)q∗,πright(sA,ar)q∗,πleft(sA,al)q∗,πleft(sA,al)=0.5⋅[2+0.5⋅q∗,πright(sA,ar)]=34=1+0.5⋅0.5⋅q∗,πright(sA,ar)=34
Here q∗,πleft(sA,al)=q∗,πright(sA,ar)q_{*,\pi_{left}}(s_A, a_l) = q_{*,\pi_{right}}(s_A, a_r)q∗,πleft(sA,al)=q∗,πright(sA,ar), assumption is correct. So, both q∗,πleft(sA,al)q_{*,\pi_{left}}(s_A, a_l)q∗,πleft(sA,al) and q∗,πright(sA,ar)q_{*,\pi_{right}}(s_A, a_r)q∗,πright(sA,ar) are optimal policies for γ=0.5\gamma = 0.5γ=0.5.
For γ=0.9\gamma = 0.9γ=0.9:
q∗(sB,a)={0+0.9maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.9maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗(sC,a)={2+0.9maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.9maxa′[q∗,πleft(sA,al),q∗,πright(sA,ar)]q∗,πleft(sA,al)=[1+0.9⋅q∗(sB,a)]⋅1=1+0.9⋅q∗(sB,a)q∗,πright(sA,ar)=[0+0.9⋅q∗(sC,a)]⋅1=0.9⋅q∗(sC,a)
\begin{aligned}
q_*(s_B, a)&=\Bigl \{0+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot 1 \\
&=0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\
q_*(s_C, a)&=\Bigl \{2+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ] \Bigr \} \cdot1 \\
&= 2+0.9 \max_{a'} \bigl [ q_{*, \pi_{left}}(s_A, a_l), q_{*, \pi_{right}}(s_A, a_r) \bigr ]\\
q_{*,\pi_{left}}(s_A, a_l) &= \bigl [ 1 + 0.9 \cdot q_*(s_B, a) \bigr ] \cdot 1 \\
&= 1 + 0.9 \cdot q_*(s_B, a)\\
q_{*,\pi_{right}}(s_A, a_r) &= \bigl [ 0 + 0.9 \cdot q_*(s_C,a) \bigr ] \cdot 1 \\
&= 0.9 \cdot q_*(s_C,a)
\end{aligned}
q∗(sB,a)q∗(sC,a)q∗,πleft(sA,al)q∗,πright(sA,ar)={0+0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]={2+0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]}⋅1=2+0.9a′max[q∗,πleft(sA,al),q∗,πright(sA,ar)]=[1+0.9⋅q∗(sB,a)]⋅1=1+0.9⋅q∗(sB,a)=[0+0.9⋅q∗(sC,a)]⋅1=0.9⋅q∗(sC,a)
Assume q∗,πleft(sA,al)≥q∗,π−left(sC,al)q_{*,\pi_{left}}(s_A, a_l) \geq q_{*,\pi-{left}}(s_C, a_l)q∗,πleft(sA,al)≥q∗,π−left(sC,al) then we have:
q∗(sB,a)=0.9⋅q∗,πleft(sA,al)q∗(sC,a)=2+0.9⋅q∗,πleft(sA,al)
\begin{aligned}
q_*(s_B, a) &= 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\
q_*(s_C, a) &= 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)
\end{aligned}
q∗(sB,a)q∗(sC,a)=0.9⋅q∗,πleft(sA,al)=2+0.9⋅q∗,πleft(sA,al)
therefore,
q∗,πleft(sA,al)=1+0.9⋅0.9⋅q∗,πleft(sA,al)q∗,πleft(sA,al)=10019=50095q∗,πright(sA,ar)=0.9⋅[2+0.9⋅q∗,πleft(sA,al)]q∗,πright(sA,ar)=72995
\begin{aligned}
q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{left}}(s_A, a_l)\\
q_{*,\pi_{left}}(s_A, a_l) &= \frac {100}{19} = \frac {500}{95}\\
q_{*,\pi_{right}}(s_A, a_r) &= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{left}}(s_A, a_l) \bigr ] \\
q_{*,\pi_{right}}(s_A, a_r) &= \frac {729}{95}
\end{aligned}
q∗,πleft(sA,al)q∗,πleft(sA,al)q∗,πright(sA,ar)q∗,πright(sA,ar)=1+0.9⋅0.9⋅q∗,πleft(sA,al)=19100=95500=0.9⋅[2+0.9⋅q∗,πleft(sA,al)]=95729
Here, q∗,πleft(sA,al)<q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)<q∗,πright(sC,al), conflict with the assumption, so the assumption fails.
Assume q∗,πleft(sA,al)≤q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \le q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)≤q∗,πright(sC,al) then we have:
q∗(sB,a)=0.9⋅q∗,πright(sA,ar)q∗(sC,a)=2+0.9⋅q∗,πright(sA,ar)
\begin{aligned}
q_*(s_B, a) &= 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\
q_*(s_C, a) &= 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)
\end{aligned}
q∗(sB,a)q∗(sC,a)=0.9⋅q∗,πright(sA,ar)=2+0.9⋅q∗,πright(sA,ar)
therefore,
q∗,πright(sA,ar)=0.9⋅[2+0.9⋅q∗,πright(sA,ar)]q∗,πright(sA,ar)=18019q∗,πleft(sA,al)=1+0.9⋅0.9⋅q∗,πright(sA,ar)q∗,πleft(sA,al)=1648190
\begin{aligned}
q_{*,\pi_{right}}(s_A, a_r) &= 0.9 \cdot \bigl [ 2+0.9 \cdot q_{*,\pi_{right}}(s_A, a_r) \bigr ] \\
q_{*,\pi_{right}}(s_A, a_r) &= \frac {180}{19}\\
q_{*,\pi_{left}}(s_A, a_l) &= 1 + 0.9 \cdot 0.9 \cdot q_{*,\pi_{right}}(s_A, a_r)\\
q_{*,\pi_{left}}(s_A, a_l) &= \frac {1648}{190}\\
\end{aligned}
q∗,πright(sA,ar)q∗,πright(sA,ar)q∗,πleft(sA,al)q∗,πleft(sA,al)=0.9⋅[2+0.9⋅q∗,πright(sA,ar)]=19180=1+0.9⋅0.9⋅q∗,πright(sA,ar)=1901648
Here, q∗,πleft(sA,al)<q∗,πright(sC,al)q_{*,\pi_{left}}(s_A, a_l) \lt q_{*,\pi_{right}}(s_C, a_l)q∗,πleft(sA,al)<q∗,πright(sC,al), assumption is correct. So, πright\pi_{right}πright is the optimal policy for γ=0.9\gamma = 0.9γ=0.9