Exercise 3.29 Rewrite the four Bellman equations for the four value functions ( v π v_\pi vπ, v ∗ v* v∗, q π q_\pi qπ, and q ∗ q_* q∗) in terms of the three argument function p (3.4) and the two-argument function r(3.5).
For
v
π
v_\pi
vπ:
v
π
(
s
)
=
∑
a
π
(
a
∣
s
)
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
v
π
(
s
′
)
]
=
∑
a
π
(
a
∣
s
)
[
∑
s
′
,
r
r
p
(
s
′
,
r
∣
s
,
a
)
+
∑
s
′
,
r
γ
v
π
(
s
′
)
p
(
s
′
,
r
∣
s
,
a
)
]
=
∑
a
π
(
a
∣
s
)
[
∑
r
r
p
(
r
∣
s
,
a
)
+
∑
s
′
γ
v
π
(
s
′
)
p
(
s
′
∣
s
,
a
)
]
=
∑
a
π
(
a
∣
s
)
[
r
(
s
,
a
)
+
∑
s
′
γ
v
π
(
s
′
)
p
(
s
′
∣
s
,
a
)
]
\begin{aligned} v_\pi(s) &= \sum_a \pi(a|s) \sum_{s', r}p(s', r | s, a) \bigl [ r + \gamma v_\pi(s') \bigr ] \\ &= \sum_a \pi(a|s) \bigl [ \sum_{s', r}rp(s', r | s, a) + \sum_{s',r}\gamma v_\pi(s') p(s', r | s, a) \bigr ] \\ &= \sum_a \pi(a|s) \bigl [ \sum_{r}rp( r | s, a) + \sum_{s'}\gamma v_\pi(s') p(s' | s, a) \bigr ] \\ &= \sum_a \pi(a|s) \bigl [ r(s,a)+ \sum_{s'}\gamma v_\pi(s') p(s' | s, a) \bigr ] \\ \end{aligned}
vπ(s)=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γvπ(s′)]=a∑π(a∣s)[s′,r∑rp(s′,r∣s,a)+s′,r∑γvπ(s′)p(s′,r∣s,a)]=a∑π(a∣s)[r∑rp(r∣s,a)+s′∑γvπ(s′)p(s′∣s,a)]=a∑π(a∣s)[r(s,a)+s′∑γvπ(s′)p(s′∣s,a)]
For
v
∗
v_*
v∗:
v
∗
(
s
)
=
max
a
{
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
v
∗
(
s
′
)
]
}
=
max
a
{
∑
s
′
,
r
r
p
(
s
′
,
r
∣
s
,
a
)
+
∑
s
′
,
r
γ
v
∗
(
s
′
)
p
(
s
′
,
r
∣
s
,
a
)
}
=
max
a
{
∑
r
r
p
(
r
∣
s
,
a
)
+
∑
s
′
γ
v
∗
(
s
′
)
p
(
s
′
∣
s
,
a
)
}
=
max
a
{
r
(
s
,
a
)
+
∑
s
′
γ
v
∗
(
s
′
)
p
(
s
′
∣
s
,
a
)
}
\begin{aligned} v_*(s) &= \max_a \Bigl \{ \sum_{s',r} p(s',r|s,a) \bigl [ r + \gamma v_*(s') \bigr ] \Bigr \} \\ &= \max_a \Bigl \{ \sum_{s',r} r p(s',r|s,a) + \sum_{s',r}\gamma v_*(s') p(s',r|s,a) \Bigr \} \\ &= \max_a \Bigl \{ \sum_{r} r p(r|s,a) + \sum_{s'}\gamma v_*(s') p(s'|s,a) \Bigr \} \\ &= \max_a \Bigl \{ r(s,a) + \sum_{s'}\gamma v_*(s') p(s'|s,a) \Bigr \} \\ \end{aligned}
v∗(s)=amax{s′,r∑p(s′,r∣s,a)[r+γv∗(s′)]}=amax{s′,r∑rp(s′,r∣s,a)+s′,r∑γv∗(s′)p(s′,r∣s,a)}=amax{r∑rp(r∣s,a)+s′∑γv∗(s′)p(s′∣s,a)}=amax{r(s,a)+s′∑γv∗(s′)p(s′∣s,a)}
For
q
π
q_\pi
qπ, please look into exercise 3.19. https://blog.youkuaiyun.com/ballade2012/article/details/89164995
For
q
∗
q_*
q∗:
q
∗
(
s
,
a
)
=
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
[
r
+
γ
max
a
′
q
∗
(
s
′
,
a
′
)
]
=
∑
s
′
,
r
r
p
(
s
′
,
r
∣
s
,
a
)
+
∑
s
′
,
r
p
(
s
′
,
r
∣
s
,
a
)
γ
max
a
′
q
∗
(
s
′
,
a
′
)
=
∑
r
r
p
(
r
∣
s
,
a
)
+
∑
s
′
p
(
s
′
∣
s
,
a
)
γ
max
a
′
q
∗
(
s
′
,
a
′
)
=
r
(
s
,
a
)
+
γ
∑
s
′
P
s
,
s
′
a
max
a
′
q
∗
(
s
′
,
a
′
)
\begin{aligned} q_*(s,a) &= \sum_{s',r} p(s', r|s,a) \bigl [ r + \gamma \max_{a'} q_*(s', a') \bigr ] \\ &= \sum_{s', r} r p(s',r|s,a) + \sum_{s',r} p(s',r|s,a) \gamma \max_{a'} q_*(s', a') \\ &= \sum_{ r} r p(r|s,a) + \sum_{s'} p(s'|s,a) \gamma \max_{a'} q_*(s', a') \\ &= r (s,a) + \gamma \sum_{s'} P_{s,s'}^a \max_{a'} q_*(s', a') \\ \end{aligned}
q∗(s,a)=s′,r∑p(s′,r∣s,a)[r+γa′maxq∗(s′,a′)]=s′,r∑rp(s′,r∣s,a)+s′,r∑p(s′,r∣s,a)γa′maxq∗(s′,a′)=r∑rp(r∣s,a)+s′∑p(s′∣s,a)γa′maxq∗(s′,a′)=r(s,a)+γs′∑Ps,s′aa′maxq∗(s′,a′)