Exercise 3.11 If the current state is StS_tSt, and actions are selected according to stochastic policy π\piπ, then what is the expectation of Rt+1R_{t+1}Rt+1 in terms of π\piπ and the four-argument function ppp(3.2)?
Pr(St=s,At=a)=p(a∣s)⋅Pr(St=s)=π(a∣s)⋅Pr(St=s)(1)
\begin{aligned}
Pr(S_t = s, A_t = a) &= p(a|s) \cdot Pr(S_t = s) \\
&= \pi(a|s) \cdot Pr(S_t = s) \qquad{(1)}
\end{aligned}
Pr(St=s,At=a)=p(a∣s)⋅Pr(St=s)=π(a∣s)⋅Pr(St=s)(1)
E(Rt+1∣St=s)=∑r∈R[r⋅Pr(Rt+1=r∣St=s)]=∑r∈R[r⋅∑s′∈SPr(Rt+1=r,St+1=s′∣St=s)]=∑r∈R[r⋅∑s′∈SPr(Rt+1=r,St+1=s′,St=s)Pr(St=s)]=∑r∈R[r⋅∑s′∈S∑a∈APr(Rt+1=r,St+1=s′,St=s,At=a)Pr(St=s)]=∑r∈R[r⋅∑s′∈S∑a∈APr(Rt+1=r,St+1=s′∣St=s,At=a)⋅Pr(St=s,At=a)Pr(St=s)](2)
\begin{aligned}
\mathbb E(R_{t+1}|S_t = s) &= \sum_{r \in \mathbb R} \bigl [ r \cdot Pr(R_{t+1} = r|S_t = s) \bigr ] \\
&= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s) \bigr ] \\
&= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{Pr (R_{t+1} = r, S_{t+1} = s', S_t = s)}{Pr(S_t = s)} \bigr ] \\
&= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s', S_t = s, A_t = a)}{Pr(S_t = s)} \bigr ] \\
&= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot Pr(S_t = s, A_t = a)}{Pr(S_t = s)} \bigr ] \qquad{(2)} \\
\end{aligned}
E(Rt+1∣St=s)=r∈R∑[r⋅Pr(Rt+1=r∣St=s)]=r∈R∑[r⋅s′∈S∑Pr(Rt+1=r,St+1=s′∣St=s)]=r∈R∑[r⋅s′∈S∑Pr(St=s)Pr(Rt+1=r,St+1=s′,St=s)]=r∈R∑[r⋅s′∈S∑Pr(St=s)∑a∈APr(Rt+1=r,St+1=s′,St=s,At=a)]=r∈R∑[r⋅s′∈S∑Pr(St=s)∑a∈APr(Rt+1=r,St+1=s′∣St=s,At=a)⋅Pr(St=s,At=a)](2)
Substitute equation (1) into (2), there is :
E(Rt+1∣St=s)=∑r∈R[r⋅∑s′∈S∑a∈APr(Rt+1=r,St+1=s′∣St=s,At=a)⋅π(a∣s)⋅Pr(St=s)Pr(St=s)]=∑r∈R{r⋅∑s′∈S∑a∈A[Pr(Rt+1=r,St+1=s′∣St=s,At=a)⋅π(a∣s)]}=∑r∈R{r⋅∑s′∈S∑a∈A[p(r,s′∣s,a)⋅π(a∣s)]}(3)
\begin{aligned}
\mathbb E(R_{t+1}|S_t = s) &= \sum_{r \in \mathbb R} \bigl [ r \cdot \sum_{s' \in S} \frac{\sum_{a \in \mathcal A}Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot \pi(a|s)\cdot Pr(S_t=s)}{Pr(S_t = s)} \bigr ] \\
&= \sum_{r \in \mathbb R} \Bigl \{ r \cdot \sum_{s' \in S} \sum_{a \in \mathcal A} \bigl [ Pr (R_{t+1} = r, S_{t+1} = s' | S_t = s, A_t = a) \cdot \pi(a|s) \bigr ] \Bigr \} \\
&= \sum_{r \in \mathbb R} \Bigl \{ r \cdot \sum_{s' \in S} \sum_{a \in \mathcal A} \bigl [ p(r, s' | s, a) \cdot \pi(a|s) \bigr ] \Bigr \} \qquad{(3)} \\
\end{aligned}
E(Rt+1∣St=s)=r∈R∑[r⋅s′∈S∑Pr(St=s)∑a∈APr(Rt+1=r,St+1=s′∣St=s,At=a)⋅π(a∣s)⋅Pr(St=s)]=r∈R∑{r⋅s′∈S∑a∈A∑[Pr(Rt+1=r,St+1=s′∣St=s,At=a)⋅π(a∣s)]}=r∈R∑{r⋅s′∈S∑a∈A∑[p(r,s′∣s,a)⋅π(a∣s)]}(3)
Equation (3) is the result.