Exercise 5.6 What is the equation analogous to (5.6) for action values Q(s,a)Q(s, a)Q(s,a) instead ofstate values V(s)V(s)V(s), again given returns generated using bbb?
Given a starting state StS_tSt, starting action AtA_tAt, the probability of the subsequent state-action trajectory, St+1,At+1,⋯ ,STS_{t+1}, A_{t+1}, \cdots , S_TSt+1,At+1,⋯,ST occurring under any policy π\piπ is
Pr(St+1,At+1,⋯ ,ST−1,AT−1,ST∣St,At:T−1∼π)=p(St+1∣St,At)π(At+1∣St+1)⋯p(ST−1∣ST−2,AT−2)π(AT−1∣ST−1)p(ST∣ST−1,AT−1)=∏k=tT−1π(Ak∣Sk)p(Sk+1∣Sk,Ak)π(At∣St) \begin{aligned} &Pr(S_{t+1}, A_{t+1},\cdots, S_{T-1}, A_{T-1}, S_T \mid S_t, A_{t:T-1}\sim \pi)\\ &\qquad = p(S_{t+1} \mid S_t, A_t) \pi(A_{t+1}|S_{t+1}) \cdots p(S_{T-1} \mid S_{T-2}, A_{T-2})\pi(A_{T-1} \mid S_{T-1}) p(S_T \mid S_{T-1}, A_{T-1})\\ &\qquad =\frac {\prod_{k=t}^{T - 1} \pi(A_k \mid S_k)p(S_{k+1}\mid S_k, A_k)} {\pi(A_t \mid S_t)} \end{aligned} Pr(St+1,At+1,⋯,ST−1,AT−1,ST∣St,At:T−1∼π)=p(St+1∣St,At)π(At+1∣St+1)⋯p(ST−1