Exercise 4.2 In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. Assume that the transitions from the original states are unchanged. What, then, is v π ( 15 ) v_\pi(15) vπ(15) for the equiprobable random policy? Now suppose the dynamics of state 13 are also changed, such that action down from state 13 takes the agent to the new state 15. What is v π ( 15 ) v_\pi(15) vπ(15) for the equiprobable random policy in this case?
For the assumption that the transitions from the original states are unchanged, according to equation (4.4), we have:
v
π
(
s
)
=
∑
a
π
(
a
∣
s
)
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
v
π
(
s
′
)
]
=
∑
a
π
(
a
∣
s
)
∑
s
′
{
∑
r
[
r
⋅
p
(
s
′
,
r
∣
s
,
a
)
]
+
∑
r
[
p
(
s
′
,
r
∣
s
,
a
)
⋅
γ
v
π
(
s
′
)
]
}
=
∑
a
π
(
a
∣
s
)
∑
s
′
{
∑
r
[
r
⋅
p
(
r
∣
s
′
,
s
,
a
)
⋅
p
(
s
′
∣
s
,
a
)
]
+
p
(
s
′
∣
s
,
a
)
⋅
γ
v
π
(
s
′
)
}
=
∑
a
π
(
a
∣
s
)
∑
s
′
{
p
(
s
′
∣
s
,
a
)
[
∑
r
r
⋅
p
(
r
∣
s
′
,
s
,
a
)
+
γ
v
π
(
s
′
)
]
}
=
∑
a
π
(
a
∣
s
)
∑
s
′
{
P
s
,
s
′
a
[
R
s
,
s
′
a
+
γ
v
π
(
s
′
)
]
}
\begin{aligned} v_\pi(s) &= \sum_a \pi(a \mid s) \sum_{s',r} p(s', r \mid s, a) \bigl [ r + \gamma v_\pi(s')\bigr ] \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ \sum_r \Bigl [ r \cdot p(s',r \mid s, a) \Bigr ] + \sum_r \Bigl [ p(s', r \mid s,a ) \cdot \gamma v_\pi(s') \Bigr ]\biggr \} \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ \sum_r \Bigl [ r \cdot p(r \mid s', s, a) \cdot p(s' \mid s,a) \Bigr ] + p(s' \mid s,a ) \cdot \gamma v_\pi(s') \biggr \} \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ p(s' \mid s,a) \Bigl [ \sum_r r \cdot p(r \mid s', s, a) + \gamma v_\pi(s') \Bigr ]\biggr \} \\ &= \sum_a \pi(a \mid s) \sum_{s'} \biggl \{ P_{s,s'}^a \Bigl [ R_{s,s'}^a + \gamma v_\pi(s') \Bigr ] \biggr \} \end{aligned}
vπ(s)=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]=a∑π(a∣s)s′∑{r∑[r⋅p(s′,r∣s,a)]+r∑[p(s′,r∣s,a)⋅γvπ(s′)]}=a∑π(a∣s)s′∑{r∑[r⋅p(r∣s′,s,a)⋅p(s′∣s,a)]+p(s′∣s,a)⋅γvπ(s′)}=a∑π(a∣s)s′∑{p(s′∣s,a)[r∑r⋅p(r∣s′,s,a)+γvπ(s′)]}=a∑π(a∣s)s′∑{Ps,s′a[Rs,s′a+γvπ(s′)]}
So,
v
π
(
15
)
=
∑
a
π
(
a
∣
15
)
⋅
{
P
15
,
12
l
e
f
t
[
R
15
,
12
l
e
f
t
+
γ
v
π
(
12
)
]
+
P
15
,
13
u
p
[
R
15
,
13
u
p
+
γ
v
π
(
13
)
]
+
P
15
,
14
r
i
g
h
t
[
R
15
,
14
r
i
g
h
t
+
γ
v
π
(
14
)
]
+
P
15
,
15
d
o
w
n
[
R
15
,
15
d
o
w
n
+
γ
v
π
(
15
)
]
}
\begin{aligned} v_\pi(15) &= \sum_a \pi( a \mid 15) \cdot \biggl \{ P_{15,12}^{left} \Bigl[ R_{15,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{15,13}^{up} \Bigl[ R_{15,13}^{up} + \gamma v_\pi(13) \Bigr ] \\ & \quad + P_{15,14}^{right} \Bigl[ R_{15,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{15,15}^{down} \Bigl[ R_{15,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \end{aligned}
vπ(15)=a∑π(a∣15)⋅{P15,12left[R15,12left+γvπ(12)]+P15,13up[R15,13up+γvπ(13)]+P15,14right[R15,14right+γvπ(14)]+P15,15down[R15,15down+γvπ(15)]}
Because the agent follows the equiprobable random policy, for all actions
π
(
a
∣
s
)
=
1
/
4
\pi(a \mid s) = 1 / 4
π(a∣s)=1/4. And the action is deterministic, so:
P
s
,
s
′
a
=
{
1
if
a
leads to
s
′
0
if
a
doesn’t lead to
s
′
P_{s,s'}^a = \begin{cases} 1 & \text{ if $a$ leads to $s'$} \\ 0 & \text{if $a$ doesn't lead to $s'$} \end{cases}
Ps,s′a={10 if a leads to s′if a doesn’t lead to s′
According to Figure 4.2, we have:
v
π
(
15
)
=
1
4
{
1
⋅
[
−
1
+
γ
(
−
22
)
]
+
1
⋅
[
−
1
+
γ
(
−
20
)
]
+
1
⋅
[
−
1
+
γ
(
−
14
)
]
+
1
⋅
[
−
1
+
γ
v
π
(
15
)
]
}
=
−
1
−
14
γ
+
γ
v
π
(
15
)
\begin{aligned} v_\pi(15) &= \frac {1}{4} \biggl \{ 1 \cdot \Bigl [ -1 + \gamma (-22) \Bigr ] + 1 \cdot \Bigl [ -1 + \gamma (-20) \Bigr ] \\ & \quad + 1 \cdot \Bigl [ -1 + \gamma (-14) \Bigr ] + 1 \cdot \Bigl [ -1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\ &= -1 - 14 \gamma + \gamma v_\pi(15) \\ \end{aligned}
vπ(15)=41{1⋅[−1+γ(−22)]+1⋅[−1+γ(−20)]+1⋅[−1+γ(−14)]+1⋅[−1+γvπ(15)]}=−1−14γ+γvπ(15)
∴
v
π
(
15
)
=
4
+
56
γ
γ
−
4
\therefore v_\pi(15) = \frac {4 + 56 \gamma} {\gamma - 4}
∴vπ(15)=γ−44+56γ
For the assumption that the dynamics of state 13 are also changed, similarly we have:
v
π
(
13
)
=
∑
a
π
(
a
∣
13
)
⋅
{
P
13
,
12
l
e
f
t
[
R
13
,
12
l
e
f
t
+
γ
v
π
(
12
)
]
+
P
13
,
9
u
p
[
R
13
,
9
u
p
+
γ
v
π
(
9
)
]
+
P
13
,
14
r
i
g
h
t
[
R
13
,
14
r
i
g
h
t
+
γ
v
π
(
14
)
]
+
P
13
,
15
d
o
w
n
[
R
13
,
15
d
o
w
n
+
γ
v
π
(
15
)
]
}
v
π
(
15
)
=
∑
a
π
(
a
∣
15
)
⋅
{
P
15
,
12
l
e
f
t
[
R
15
,
12
l
e
f
t
+
γ
v
π
(
12
)
]
+
P
15
,
13
u
p
[
R
15
,
13
u
p
+
γ
v
π
(
13
)
]
+
P
15
,
14
r
i
g
h
t
[
R
15
,
14
r
i
g
h
t
+
γ
v
π
(
14
)
]
+
P
15
,
15
d
o
w
n
[
R
15
,
15
d
o
w
n
+
γ
v
π
(
15
)
]
}
\begin{aligned} v_\pi(13) &= \sum_a \pi( a \mid 13) \cdot \biggl \{ P_{13,12}^{left} \Bigl[ R_{13,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{13,9}^{up} \Bigl[ R_{13,9}^{up} + \gamma v_\pi(9) \Bigr ] \\ & \quad + P_{13,14}^{right} \Bigl[ R_{13,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{13,15}^{down} \Bigl[ R_{13,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \\ v_\pi(15) &= \sum_a \pi( a \mid 15) \cdot \biggl \{ P_{15,12}^{left} \Bigl[ R_{15,12}^{left} + \gamma v_\pi(12) \Bigr ] + P_{15,13}^{up} \Bigl[ R_{15,13}^{up} + \gamma v_\pi(13) \Bigr ] \\ & \quad + P_{15,14}^{right} \Bigl[ R_{15,14}^{right} + \gamma v_\pi(14) \Bigr ] + P_{15,15}^{down} \Bigl[ R_{15,15}^{down} + \gamma v_\pi(15) \Bigr ]\biggr \} \end{aligned}
vπ(13)vπ(15)=a∑π(a∣13)⋅{P13,12left[R13,12left+γvπ(12)]+P13,9up[R13,9up+γvπ(9)]+P13,14right[R13,14right+γvπ(14)]+P13,15down[R13,15down+γvπ(15)]}=a∑π(a∣15)⋅{P15,12left[R15,12left+γvπ(12)]+P15,13up[R15,13up+γvπ(13)]+P15,14right[R15,14right+γvπ(14)]+P15,15down[R15,15down+γvπ(15)]}
v
π
(
13
)
=
1
4
⋅
{
1
[
−
1
+
γ
(
−
22
)
]
+
1
[
(
−
1
+
γ
(
−
20
)
]
+
1
[
(
−
1
+
γ
(
−
14
)
]
+
1
[
(
−
1
+
γ
v
π
(
15
)
]
}
=
−
1
−
14
γ
+
1
4
γ
v
π
(
15
)
(
1
)
v
π
(
15
)
=
1
4
⋅
{
1
[
−
1
+
γ
(
−
22
)
]
+
1
[
−
1
+
γ
v
π
(
13
)
]
+
1
[
−
1
+
γ
(
−
14
)
]
+
1
[
−
1
+
γ
v
π
(
15
)
]
}
=
−
1
−
9
γ
+
1
4
γ
v
π
(
13
)
+
1
4
γ
v
π
(
15
)
(
2
)
\begin{aligned} v_\pi(13) &= \frac{1}{4} \cdot \biggl \{ 1 \Bigl[ -1 + \gamma(-22) \Bigr ] + 1 \Bigl[ (-1 + \gamma (-20) \Bigr ] \\ & \quad + 1 \Bigl[ (-1 + \gamma (-14) \Bigr ] + 1 \Bigl[ (-1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\ &= -1 - 14 \gamma + \frac {1}{4} \gamma v_\pi(15) \qquad \qquad \qquad \qquad \qquad \qquad \quad{(1)}\\ v_\pi(15) &= \frac{1}{4} \cdot \biggl \{ 1 \Bigl[ -1 + \gamma (-22) \Bigr ] + 1 \Bigl[ -1 + \gamma v_\pi(13) \Bigr ] \\ & \quad +1 \Bigl[ -1 + \gamma (-14) \Bigr ] + 1 \Bigl[ -1 + \gamma v_\pi(15) \Bigr ]\biggr \} \\ & = -1 - 9 \gamma + \frac{1}{4} \gamma v_\pi(13) +\frac{1}{4}\gamma v_\pi(15) \qquad \qquad \qquad \qquad{(2)} \end{aligned}
vπ(13)vπ(15)=41⋅{1[−1+γ(−22)]+1[(−1+γ(−20)]+1[(−1+γ(−14)]+1[(−1+γvπ(15)]}=−1−14γ+41γvπ(15)(1)=41⋅{1[−1+γ(−22)]+1[−1+γvπ(13)]+1[−1+γ(−14)]+1[−1+γvπ(15)]}=−1−9γ+41γvπ(13)+41γvπ(15)(2)
Then we have equation set:
v
π
(
13
)
−
1
4
γ
v
π
(
15
)
=
−
1
−
14
γ
(
3
)
−
1
4
γ
v
π
(
13
)
+
(
1
−
1
4
γ
)
v
π
(
15
)
=
−
1
−
9
γ
(
4
)
\begin{aligned} v_\pi(13) - \frac {1}{4} \gamma v_\pi(15)&= -1 - 14 \gamma \qquad \qquad \qquad \qquad{(3)}\\ -\frac{1}{4} \gamma v_\pi(13) +(1-\frac{1}{4}\gamma )v_\pi(15) & = -1 - 9 \gamma \qquad \qquad \qquad \qquad{(4)} \end{aligned}
vπ(13)−41γvπ(15)−41γvπ(13)+(1−41γ)vπ(15)=−1−14γ(3)=−1−9γ(4)
By solving equation set (3) and (4), we can obtain:
v
π
(
15
)
=
14
γ
2
+
37
γ
+
4
1
4
γ
2
+
γ
−
4
v
π
(
13
)
=
19
γ
2
+
224
γ
−
16
γ
2
+
4
γ
−
16
\begin{aligned} v_\pi(15) &= \frac{14\gamma^2 + 37 \gamma + 4}{ \frac{1}{4}\gamma^2 + \gamma - 4} \\ v_\pi(13) &= \frac{19\gamma^2 + 224\gamma -16}{\gamma^2+4\gamma-16} \end{aligned}
vπ(15)vπ(13)=41γ2+γ−414γ2+37γ+4=γ2+4γ−1619γ2+224γ−16
网格世界状态评估
本文探讨了在网格世界中增加新状态后的价值函数计算,特别是针对均匀随机策略下新增状态15的价值函数vπ(15),并分析了当状态13的动态改变时对vπ(15)的影响。

1819





