Exercise 3.19 The value of an action, qπ(s,a)q_\pi(s, a)qπ(s,a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:
Give the equation corresponding to this intuition and diagram for the action value, qπ(s,a)q_\pi(s, a)qπ(s,a), in terms of the expected next reward, Rt+1R_{t+1}Rt+1, and the expected next state value, vπ(St+1)v_\pi(S_t+1)vπ(St+1), given that St=sS_t =sSt=s and At=aA_t =aAt=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p(s′,r∣s,a)p(s', r|s, a)p(s′,r∣s,a) defined by (3.2), such that no expected value notation appears in the equation.
qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(Rt+1+γGt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(Gt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(∑k=0∞γkRt+k+2∣St=s,At=a)=Rt+1(s,a)+γ∑s′[Eπ(∑k=0∞γkRt+k+2∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)]
\begin{aligned}
q_\pi(s,a) &= \mathbb E_\pi(G_t | S_t = s, A_t = a) \\
&= \mathbb E_\pi ( R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\
&= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( G_{t+1} | S_t = s, A_t = a) \\
&= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a) \\
&= R_{t+1}(s,a) + \gamma \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s'| S_t = s, A_t = a) \bigr] \\
\end{aligned}
qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(Rt+1+γGt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(Gt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(k=0∑∞γkRt+k+2∣St=s,At=a)=Rt+1(s,a)+γs′∑[Eπ(k=0∑∞γkRt+k+2∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)]
denote Pr(St+1=s′∣St=s,At=a)=Ps,s′aPr(S_{t+1} = s'| S_t = s, A_t = a) = P_{s,s'}^aPr(St+1=s′∣St=s,At=a)=Ps,s′a
and ∵St\because S_t∵St and AtA_tAt give no information to Rt+2+kR_{t+2+k}Rt+2+k
∴Eπ(∑k=0∞γkRt+k+2∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)=Eπ(∑k=0∞γkRt+k+2∣St+1=s′)Ps,s′a=υπ(St+1)Ps,s′a\therefore \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s'| S_t = s, A_t = a)\\
\begin{aligned}
&= \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_{t+1} = s') P_{s,s'}^a \\
&= \upsilon_\pi(S_{t+1}) P_{s,s'}^a \\
\end{aligned}
∴Eπ(k=0∑∞γkRt+k+2∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)=Eπ(k=0∑∞γkRt+k+2∣St+1=s′)Ps,s′a=υπ(St+1)Ps,s′a
(1)∴qπ(s,a)=Rt+1(s,a)+γ∑s′υπ(St+1)Ps,s′a
\therefore
\begin{aligned}
q_\pi(s,a) = R_{t+1}(s,a) + \gamma \sum_{s'} \upsilon_\pi(S_{t+1}) P_{s,s'}^a \tag{1}
\end{aligned}
∴qπ(s,a)=Rt+1(s,a)+γs′∑υπ(St+1)Ps,s′a(1)
Above is the first equantion.
∵Rt+1(s,a)=Eπ(Rt+1∣St=s,At=a)=∑s′[Eπ(Rt+1∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)]=∑r∑s′[Eπ(Rt+1∣St=s,At=a,St+1=s′)Pr(St+1=s′,Rt+1=r∣St=s,At=a)]
\begin{aligned}
\because
R_{t+1}(s,a) &= \mathbb E_\pi (R_{t+1} | S_t = s, A_t = a) \\
&= \sum_{s'} \bigl[ \mathbb E_\pi (R_{t+1}|S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s' | S_t = s, A_t = a) \bigr] \\
&= \sum_r \sum_{s'} \bigl[ \mathbb E_\pi (R_{t+1}|S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a) \bigr] \\
\end{aligned}
∵Rt+1(s,a)=Eπ(Rt+1∣St=s,At=a)=s′∑[Eπ(Rt+1∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)]=r∑s′∑[Eπ(Rt+1∣St=s,At=a,St+1=s′)Pr(St+1=s′,Rt+1=r∣St=s,At=a)]
denote Eπ(Rt+1∣St=s,At=a,St+1=s′)=Rs,s′aPr(St+1=s′,Rt+1=r∣St=s,At=a)=p(s′,r∣s,a)
\mathbb E_\pi (R_{t+1}|S_t = s, A_t = a, S_{t+1} = s') = R_{s,s'}^a \\
Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a) = p(s', r | s, a)
Eπ(Rt+1∣St=s,At=a,St+1=s′)=Rs,s′aPr(St+1=s′,Rt+1=r∣St=s,At=a)=p(s′,r∣s,a)
∴Rt+1(s,a)=∑r∑s′Rs,s′ap(s′,r∣s,a)
\therefore R_{t+1}(s, a) = \sum_r \sum_{s'} R_{s,s'}^a p(s', r | s, a)
∴Rt+1(s,a)=r∑s′∑Rs,s′ap(s′,r∣s,a)
∵γ∑s′υπ(St+1)Ps,s′a=γ∑r∑s′υπ(St+1)Pr(St+1=s′,Rt+1=r∣St=s,At=a)=γ∑r∑s′υπ(St+1)p(s′,r∣s,a)
\begin{aligned}
\because
\gamma \sum_{s'} \upsilon_\pi(S_{t+1}) P_{s,s'}^a
&= \gamma \sum_r \sum_{s'} \upsilon_\pi(S_{t+1}) Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a) \\
&= \gamma \sum_r \sum_{s'} \upsilon_\pi(S_{t+1}) p(s', r | s, a)
\end{aligned}
∵γs′∑υπ(St+1)Ps,s′a=γr∑s′∑υπ(St+1)Pr(St+1=s′,Rt+1=r∣St=s,At=a)=γr∑s′∑υπ(St+1)p(s′,r∣s,a)
(2)∴qπ(s,a)=∑r∑s′Rs,s′ap(s′,r∣s,a)+γ∑r∑s′υπ(St+1)p(s′,r∣s,a)=∑r∑s′[Rs,s′a+γυπ(St+1)]p(s′,r∣s,a)
\begin{aligned}
\therefore
q_\pi(s,a) &= \sum_r \sum_{s'} R_{s,s'}^a p(s', r | s, a) + \gamma \sum_r \sum_{s'} \upsilon_\pi(S_{t+1}) p(s', r | s, a) \\
&= \sum_r \sum_{s'} \bigl[ R_{s, s'}^a + \gamma \upsilon_\pi( S_{t+1} ) \bigr ] p(s', r | s, a) \tag{2}
\end{aligned}
∴qπ(s,a)=r∑s′∑Rs,s′ap(s′,r∣s,a)+γr∑s′∑υπ(St+1)p(s′,r∣s,a)=r∑s′∑[Rs,s′a+γυπ(St+1)]p(s′,r∣s,a)(2)
This is the second equation.