Exercise 3.13 Give an equation for qπq_\piqπ in terms of vπv_\pivπ and the four-argument ppp.
First, we need to derive a formula from multiplication formula of probability theory:
p(x∣y)=p(x,y)p(y)=∑zp(x,y,z)p(y)=∑z[p(x∣y,z)⋅p(z∣y)⋅p(y)]p(y)=∑z[p(x∣y,z)⋅p(z∣y)](1)
\begin{aligned}
p(x|y) &= \frac {p(x,y)}{p(y)} \\
&= \frac {\sum_z p(x,y,z)}{p(y)} \\
&= \frac {\sum_z \bigl [ p(x |y,z) \cdot p(z|y) \cdot p(y) \bigr ] } { p(y) } \\
&= \sum_z \bigl [ p(x|y,z) \cdot p(z|y) \bigr ] \qquad \qquad{(1)}\\
\end{aligned}
p(x∣y)=p(y)p(x,y)=p(y)∑zp(x,y,z)=p(y)∑z[p(x∣y,z)⋅p(z∣y)⋅p(y)]=z∑[p(x∣y,z)⋅p(z∣y)](1)
With formula (1), we can calculate qπ(s,a)q_\pi(s,a)qπ(s,a) as below:
qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(Rt+1+γGt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(Gt+1∣St=s,At=a)=∑rr⋅Pr(Rt+1=r∣St=s,At=a)+γ∑gt+1gt+1⋅Pr(Gt+1=gt+1∣St=s,At=a)
\begin{aligned}
q_\pi(s,a)&= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\
&=\mathbb E_\pi(R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\
&= \mathbb E_\pi(R_{t+1} | S_t = s, A_t = a) + \gamma \mathbb E_\pi(G_{t+1}|S_t = s, A_t = a) \\
&= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\
&\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=g_{t+1}|S_t=s, A_t=a) \\
\end{aligned}
qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(Rt+1+γGt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(Gt+1∣St=s,At=a)=r∑r⋅Pr(Rt+1=r∣St=s,At=a)+γgt+1∑gt+1⋅Pr(Gt+1=gt+1∣St=s,At=a)
Here, according to definition, gt+1=∑k=0∞γk⋅rt+2+kg_{t+1} = \sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}gt+1=∑k=0∞γk⋅rt+2+k. And use formula (1), we can derive:
qπ(s,a)=∑rr⋅Pr(Rt+1=r∣St=s,At=a)+γ∑gt+1gt+1⋅Pr(Gt+1=∑k=0∞γk⋅rt+2+k∣St=s,At=a)=∑rr⋅∑s′Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γ∑gt+1gt+1⋅∑s′Pr(Gt+1=∑k=0∞γk⋅rt+2+k∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,at=a)
\begin{aligned}
q_\pi(s,a) &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\
&\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a) \\
&= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\
&\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a, S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\
\end{aligned}
qπ(s,a)=r∑r⋅Pr(Rt+1=r∣St=s,At=a)+γgt+1∑gt+1⋅Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St=s,At=a)=r∑r⋅s′∑Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γgt+1∑gt+1⋅s′∑Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,at=a)
Because in Markov Process, Gt+1G_{t+1}Gt+1 is the reward of status St+1=s′S_{t+1} = s'St+1=s′, the information of St=sS_t = sSt=s and At=aA_t = aAt=a are no effect on Gt+1G_{t+1}Gt+1. So:
qπ(s,a)=∑rr⋅∑s′Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γ∑gt+1gt+1⋅∑s′Pr(Gt+1=∑k=0∞γk⋅rt+2+k∣St+1=s′)⋅Pr(St+1=s′∣St=s,at=a)=∑s′{Pr(St+1=s′∣St=s,At=a)⋅[∑rr⋅Pr(Rt+1=r∣St=s,At=a,St+1=s′)+γ∑gt+1gt+1⋅Pr(Gt+1=∑k=0∞γk⋅rt+2+k∣St+1=s′)]}=∑s′{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅Eπ(Gt+1∣St+1=s′)]}=∑s′{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅vπ(s′)]}
\begin{aligned}
q_\pi(s,a) &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\
&\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\
&= \sum_{s'} \biggl \{ Pr(S_{t+1} = s' | S_t=s, A_t=a) \cdot \Bigl [ \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \\
&\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \Bigr ] \biggr \}\\
&=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot \mathbb E_\pi(G_{t+1}|S_{t+1}=s') \Bigr ] \biggr \} \\
&=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \\
\end{aligned}
qπ(s,a)=r∑r⋅s′∑Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γgt+1∑gt+1⋅s′∑Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St+1=s′)⋅Pr(St+1=s′∣St=s,at=a)=s′∑{Pr(St+1=s′∣St=s,At=a)⋅[r∑r⋅Pr(Rt+1=r∣St=s,At=a,St+1=s′)+γgt+1∑gt+1⋅Pr(Gt+1=k=0∑∞γk⋅rt+2+k∣St+1=s′)]}=s′∑{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅Eπ(Gt+1∣St+1=s′)]}=s′∑{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅vπ(s′)]}
Denote p(s′∣a,s)=Ps,s′ap(s' | a , s ) = P_{s,s'}^ap(s′∣a,s)=Ps,s′a and Eπ(r∣s,a,s′)=Rs,s′a\mathbb E_\pi ( r | s, a, s' ) = R_{s,s'}^aEπ(r∣s,a,s′)=Rs,s′a, then
qπ(s,a)=∑s′{Ps,s′a⋅[Rs,s′a+γ⋅vπ(s′)]}(2)
q_\pi(s,a) = \sum_{s'} \biggl \{ P_{s,s'}^a \cdot \Bigl [ R_{s,s'}^a + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \qquad{(2)}
qπ(s,a)=s′∑{Ps,s′a⋅[Rs,s′a+γ⋅vπ(s′)]}(2)
Here, the equation (2) is the result.