Exercise 3.17 What is the Bellman equation for action values, that is, for qπq_\piqπ? It must give the action value qπ(s,a)q_\pi(s, a)qπ(s,a) in terms of the action values, qπ(s′,a′)q_\pi(s', a')qπ(s′,a′), of possible successors to the state–action pair (s, a). Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
According to definition:
Qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(∑k=0∞γkRt+k+1∣St=s,At=a)=∑s′[Eπ(∑k=0∞γkRt+k+1∣St=s,At=a,St+1=s′)P(St+1=s′∣At=a,St=s)]=∑s′{[Eπ(Rt+1∣St=s,At=a,St+1=s)+Eπ(∑k=1∞γkRt+1+k)]P(St+1=s′∣At=a,St=s)}
\begin{aligned}
Q_\pi(s,a) &= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\
&= \mathbb E_\pi (\sum_{k=0}^\infty \gamma^k R_t+k+1 | S_t=s, A_t=a) \\
&= \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, A_t=a, S_{t+1}=s' ) P( S_{t+1} =s' | A_t = a, S_t = s ) \bigr] \\
&= \sum_{s'} \Bigl\{ \bigl[ \mathbb E_\pi ( R_{t+1} | S_t = s , A_t = a , S_{t+1} = s ) + \mathbb E_\pi ( \sum_{k=1}^\infty \gamma^k R_{t+1+k} ) \bigr] P( S_{t+1} = s' | A_t = a , S_t = s ) \Bigr\}
\end{aligned}
Qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(k=0∑∞γkRt+k+1∣St=s,At=a)=s′∑[Eπ(k=0∑∞γkRt+k+1∣St=s,At=a,St+1=s′)P(St+1=s′∣At=a,St=s)]=s′∑{[Eπ(Rt+1∣St=s,At=a,St+1=s)+Eπ(k=1∑∞γkRt+1+k)]P(St+1=s′∣At=a,St=s)}
Denote
P(St+1=s′∣At=a,St=s)=Ps,s′aP(S_{t+1} = s' | A_t = a , S_t = s ) = P_{s,s'}^aP(St+1=s′∣At=a,St=s)=Ps,s′a
Eπ(Rt+1∣St=s,At=a,St+1=s′)=Rs,s′a\mathbb E_\pi (R_{t+1} | S_t = s , A_t = a , S_{t+1} = s' ) = R_{s,s'}^aEπ(Rt+1∣St=s,At=a,St+1=s′)=Rs,s′a
then:
Qπ(s,a)=∑s′Rs,s′aPss′a+∑s′[E(∑k=1∞γkRt+1+k∣St=s,At=a,St+1=s′)Ps,s′a]=∑s′Rs,s′aPss′a+γ∑s′[E(∑k=1∞γk−1Rt+1+k∣St=s,At=a,St+1=s′)Ps,s′a]=∑s′Rs,s′aPss′a+γ∑s′[E(∑k=0∞γkRt+2+k∣St=s,At=a,St+1=s′)Ps,s′a]=∑s′Rs,s′aPss′a+γ∑s′[E(∑k=0∞γkRt+2+k∣St+1=s′)Ps,s′a]=∑s′Rs,s′aPss′a+γ∑s′{∑a′[E(∑k=0∞γkRt+2+k∣St+1=s′,At+1=a′)P(At+1=a′∣St+1=s′)]Ps,s′a}
\begin{aligned}
Q_\pi(s,a) &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \sum_{s'} \bigl[ \mathbb E(\sum_{k=1}^\infty \gamma^k R_{t+1+k} | S_t = s, A_t = a, S_{t+1} = s' ) P_{s,s'}^a \bigr] \\
&= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=1}^\infty \gamma^{k-1} R_{t+1+k} | S_t=s,A_t=a,S_{t+1}=s' ) P_{s,s'}^a \bigr] \\
&= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_t = s , A_t = a , S_{t+1} = s' ) P_{s,s'}^a \bigr] \\
&= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' ) P_{s,s'}^a \bigr] \\
&= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \Bigl\{ \sum_{a'} \bigl[ \mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a' ) P( A_{t+1} = a' | S_{t+1} =s' ) \bigr] P_{s,s'}^a \Bigr\} \\
\end{aligned}
Qπ(s,a)=s′∑Rs,s′aPss′a+s′∑[E(k=1∑∞γkRt+1+k∣St=s,At=a,St+1=s′)Ps,s′a]=s′∑Rs,s′aPss′a+γs′∑[E(k=1∑∞γk−1Rt+1+k∣St=s,At=a,St+1=s′)Ps,s′a]=s′∑Rs,s′aPss′a+γs′∑[E(k=0∑∞γkRt+2+k∣St=s,At=a,St+1=s′)Ps,s′a]=s′∑Rs,s′aPss′a+γs′∑[E(k=0∑∞γkRt+2+k∣St+1=s′)Ps,s′a]=s′∑Rs,s′aPss′a+γs′∑{a′∑[E(k=0∑∞γkRt+2+k∣St+1=s′,At+1=a′)P(At+1=a′∣St+1=s′)]Ps,s′a}
According to definition
E(∑k=0∞γkRt+2+k∣St+1=s′,At+1=a′)=Qπ(s′,a′)P(At+1=a′∣St+1=s′)=π(s′,a′)
\mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a') = Q_\pi(s',a') \\
P( A_{t+1} = a' | S_{t+1} = s' ) = \pi(s',a')
E(k=0∑∞γkRt+2+k∣St+1=s′,At+1=a′)=Qπ(s′,a′)P(At+1=a′∣St+1=s′)=π(s′,a′)
so
Qπ(s,a)=∑s′Rs,s′aPss′a+γ∑s′[∑a′Qπ(s′,a′)π(s′,a′)]Ps,s′a
Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a
Qπ(s,a)=s′∑Rs,s′aPss′a+γs′∑[a′∑Qπ(s′,a′)π(s′,a′)]Ps,s′a
This is the Bellman equation of action-value.