Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. Assume that the transitions from the original states are unchanged. What, then, is vπ(15)v_\pi(15)vπ(15) for the equiprobable random policy? Now suppose the dynamics of state 13 are also changed, such that action down from state 13 takes the agent to the new state 15. What is vπ(15)v_\pi(15)vπ(15) for the equiprobable random policy in this case?
For the assumption that the transitions from the original states are unchanged, according to equation (4.4), we have:
vπ(s)=∑aπ(a∣s)∑s′,rp(s′,r∣s,a)[r+γvπ(s′)]=∑aπ(a∣s)∑s′{∑r[r⋅p(s′,r∣s,a)]+∑r[p(s′,r∣s,a)⋅γvπ(s′)]}=∑aπ(a∣s)∑s′{∑r[r⋅p(r∣s′,s,a)⋅p(s′∣s,a)]+p(s′∣s,a)⋅γvπ(s′)}=∑aπ(a∣s)∑s′{p(s′∣s,a)[∑rr⋅p(r∣s′,s,a)+γvπ(s′)]}=∑aπ(a∣s)∑s′{Ps,s′a[Rs,s′a+γvπ(s′)]}
\begin{aligned}
v_\pi(s) &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \bigl [ r + \gamma v_\pi(s')\bigr ] \\
&= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ \sum_r \Bigl [ r \cdot p(s',r \mid s, a) \Bigr ] + \sum_r \Bigl [ p(s', r \mid s,a ) \cdot \gamma v_\pi(s') \Bigr ]\biggr \} \\
&= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ \sum_r \Bigl [ r \cdot p(r \mid s', s, a) \cdot p(s' \mid s,a) \Bigr ] + p(s' \mid s,a ) \cdot \gamma v_\pi(s') \biggr \} \\
&= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ p(s' \mid s,a) \Bigl [ \sum_r r \cdot p(r \mid s', s, a) + \gamma v_\pi(s') \Bigr ]\biggr \} \\
&= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ P_{s,s'}^a \Bigl [ R_{s,s'}^a + \gamma v_\pi(s') \Bigr ] \biggr \}
\end{aligned}
vπ(s)=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]=a∑π(a∣s)s′∑{r∑[r⋅p(s′,r∣s,a)]+r∑[p(s′,r∣s,a)⋅γvπ(s′)]}=a∑π(a∣s)s′∑{r∑[r⋅p(r∣s′,s,a)⋅p(s′∣s,a)]+p(s′∣s,a)⋅γvπ(s′)}=a∑π(a∣s)s′∑{p(s′∣s,a)[r∑r⋅p(r∣s′,s,a)+γvπ(s′)]}=a∑π(a∣s)s′∑{Ps,s′a[Rs,s′a+γvπ(s′)]}
So,
vπ(15)=∑aπ(a∣15)⋅{P15,12left[R15,12left+γvπ(12)]+P15,13up[R15,13up+γvπ(13)]+P15,14right[R15,14right+γvπ(14)]+P15,15down[R15,15down+γvπ(15)]}
\begin{aligned}
v_\pi(15) &= \sum_a \pi( a \mid 15) \cdot \biggl \{ P_{15,12}^{left} \Bigl[ R_{15,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{15,13}^{up} \Bigl[ R_{15,13}^{up} + \gamma v_\pi(13) \Bigr ] \\
& \quad + P_{15,14}^{right} \Bigl[ R_{15,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{15,15}^{down} \Bigl[ R_{15,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \}
\end{aligned}
vπ(15)=a∑π(a∣15)⋅{P15,12left[R15,12left+γvπ(12)]+P15,13up[R15,13up+γvπ(13)]+P15,14right[R15,14right+γvπ(14)]+P15,15down[R15,15down+γvπ(15)]}
Because the agent follows the equiprobable random policy, for all actions π(a∣s)=1/4\pi(a \mid s) = 1 / 4π(a∣s)=1/4. And the action is deterministic, so:
Ps,s′a={1 if a leads to s′0if a doesn’t lead to s′
P_{s,s'}^a =
\begin{cases}
1 & \text{ if $a$ leads to $s'$} \\
0 & \text{if $a$ doesn't lead to $s'$}
\end{cases}
Ps,s′a={10 if a leads to s′if a doesn’t lead to s′
According to Figure 4.2, we have:
vπ(15)=14{1⋅[−1+γ(−22)]+1⋅[−1+γ(−20)]+1⋅[−1+γ(−14)]+1⋅[−1+γvπ(15)]}=−1−14γ+γvπ(15)
\begin{aligned}
v_\pi(15) &= \frac {1}{4} \biggl \{ 1 \cdot \Bigl [ -1 + \gamma (-22) \Bigr ] + 1 \cdot \Bigl [ -1 + \gamma (-20) \Bigr ] \\
& \quad + 1 \cdot \Bigl [ -1 + \gamma (-14) \Bigr ] + 1 \cdot \Bigl [ -1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\
&= -1 - 14 \gamma + \gamma v_\pi(15) \\
\end{aligned}
vπ(15)=41{1⋅[−1+γ(−22)]+1⋅[−1+γ(−20)]+1⋅[−1+γ(−14)]+1⋅[−1+γvπ(15)]}=−1−14γ+γvπ(15)
∴vπ(15)=4+56γγ−4
\therefore v_\pi(15) = \frac {4 + 56 \gamma} {\gamma - 4}
∴vπ(15)=γ−44+56γ
For the assumption that the dynamics of state 13 are also changed, similarly we have:
vπ(13)=∑aπ(a∣13)⋅{P13,12left[R13,12left+γvπ(12)]+P13,9up[R13,9up+γvπ(9)]+P13,14right[R13,14right+γvπ(14)]+P13,15down[R13,15down+γvπ(15)]}vπ(15)=∑aπ(a∣15)⋅{P15,12left[R15,12left+γvπ(12)]+P15,13up[R15,13up+γvπ(13)]+P15,14right[R15,14right+γvπ(14)]+P15,15down[R15,15down+γvπ(15)]}
\begin{aligned}
v_\pi(13) &= \sum_a \pi( a \mid 13) \cdot \biggl \{ P_{13,12}^{left} \Bigl[ R_{13,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{13,9}^{up} \Bigl[ R_{13,9}^{up} + \gamma v_\pi(9) \Bigr ] \\
& \quad + P_{13,14}^{right} \Bigl[ R_{13,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{13,15}^{down} \Bigl[ R_{13,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \\
v_\pi(15) &= \sum_a \pi( a \mid 15) \cdot \biggl \{ P_{15,12}^{left} \Bigl[ R_{15,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{15,13}^{up} \Bigl[ R_{15,13}^{up} + \gamma v_\pi(13) \Bigr ] \\
& \quad + P_{15,14}^{right} \Bigl[ R_{15,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{15,15}^{down} \Bigl[ R_{15,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \}
\end{aligned}
vπ(13)vπ(15)=a∑π(a∣13)⋅{P13,12left[R13,12left+γvπ(12)]+P13,9up[R13,9up+γvπ(9)]+P13,14right[R13,14right+γvπ(14)]+P13,15down[R13,15down+γvπ(15)]}=a∑π(a∣15)⋅{P15,12left[R15,12left+γvπ(12)]+P15,13up[R15,13up+γvπ(13)]+P15,14right[R15,14right+γvπ(14)]+P15,15down[R15,15down+γvπ(15)]}
vπ(13)=14⋅{1[−1+γ(−22)]+1[(−1+γ(−20)]+1[(−1+γ(−14)]+1[(−1+γvπ(15)]}=−1−14γ+14γvπ(15)(1)vπ(15)=14⋅{1[−1+γ(−22)]+1[−1+γvπ(13)]+1[−1+γ(−14)]+1[−1+γvπ(15)]}=−1−9γ+14γvπ(13)+14γvπ(15)(2)
\begin{aligned}
v_\pi(13) &= \frac{1}{4} \cdot \biggl \{ 1 \Bigl[ -1 + \gamma(-22) \Bigr ] + 1 \Bigl[ (-1 + \gamma (-20) \Bigr ] \\
& \quad + 1 \Bigl[ (-1 + \gamma (-14) \Bigr ] + 1 \Bigl[ (-1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\
&= -1 - 14 \gamma + \frac {1}{4} \gamma v_\pi(15) \qquad \qquad \qquad \qquad \qquad \qquad \quad{(1)}\\
v_\pi(15) &= \frac{1}{4} \cdot \biggl \{ 1 \Bigl[ -1 + \gamma (-22) \Bigr ] + 1 \Bigl[ -1 + \gamma v_\pi(13) \Bigr ] \\
& \quad +1 \Bigl[ -1 + \gamma (-14) \Bigr ] + 1 \Bigl[ -1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\
& = -1 - 9 \gamma + \frac{1}{4} \gamma v_\pi(13) +\frac{1}{4}\gamma v_\pi(15) \qquad \qquad \qquad \qquad{(2)}
\end{aligned}
vπ(13)vπ(15)=41⋅{1[−1+γ(−22)]+1[(−1+γ(−20)]+1[(−1+γ(−14)]+1[(−1+γvπ(15)]}=−1−14γ+41γvπ(15)(1)=41⋅{1[−1+γ(−22)]+1[−1+γvπ(13)]+1[−1+γ(−14)]+1[−1+γvπ(15)]}=−1−9γ+41γvπ(13)+41γvπ(15)(2)
Then we have equation set:
vπ(13)−14γvπ(15)=−1−14γ(3)−14γvπ(13)+(1−14γ)vπ(15)=−1−9γ(4)
\begin{aligned}
v_\pi(13) - \frac {1}{4} \gamma v_\pi(15)&= -1 - 14 \gamma \qquad \qquad \qquad \qquad{(3)}\\
-\frac{1}{4} \gamma v_\pi(13) +(1-\frac{1}{4}\gamma )v_\pi(15) & = -1 - 9 \gamma \qquad \qquad \qquad \qquad{(4)}
\end{aligned}
vπ(13)−41γvπ(15)−41γvπ(13)+(1−41γ)vπ(15)=−1−14γ(3)=−1−9γ(4)
By solving equation set (3) and (4), we can obtain:
vπ(15)=14γ2+37γ+414γ2+γ−4vπ(13)=19γ2+224γ−16γ2+4γ−16
\begin{aligned}
v_\pi(15) &= \frac{14\gamma^2 + 37 \gamma + 4}{ \frac{1}{4}\gamma^2 + \gamma - 4} \\
v_\pi(13) &= \frac{19\gamma^2 + 224\gamma -16}{\gamma^2+4\gamma-16}
\end{aligned}
vπ(15)vπ(13)=41γ2+γ−414γ2+37γ+4=γ2+4γ−1619γ2+224γ−16