Reinforcement Learning- Exercise 3.17

最新推荐文章于 2024-09-03 21:46:48 发布

原创最新推荐文章于 2024-09-03 21:46:48 发布 · 671 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Reinforcement Learning

reinforcement learning 专栏收录该内容

37 篇文章

订阅专栏

本文推导了针对策略π下状态-动作对(s,a)的行动价值Qπ(s,a)的贝尔曼方程，该方程将当前状态-动作对的价值表示为后续可能状态-动作对的价值的期望。

Exercise 3.17 What is the Bellman equation for action values, that is, for $qπq_\pi$ ? It must give the action value $qπ(s,a)q_\pi(s, a)$ in terms of the action values, $qπ(s′,a′)q_\pi(s', a')$ , of possible successors to the state–action pair (s, a). Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
在这里插入图片描述

According to definition：
$\begin{aligned} Q_\pi(s,a) &= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &= \mathbb E_\pi (\sum_{k=0}^\infty \gamma^k R_t+k+1 | S_t=s, A_t=a) \\ &= \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, A_t=a, S_{t+1}=s' ) P( S_{t+1} =s' | A_t = a, S_t = s ) \bigr] \\ &= \sum_{s'} \Bigl\{ \bigl[ \mathbb E_\pi ( R_{t+1} | S_t = s , A_t = a , S_{t+1} = s ) + \mathbb E_\pi ( \sum_{k=1}^\infty \gamma^k R_{t+1+k} ) \bigr] P( S_{t+1} = s' | A_t = a , S_t = s ) \Bigr\} \end{aligned}$
Denote
$P(St+1=s′∣At=a,St=s)=Ps,s′aP(S_{t+1} = s' | A_t = a , S_t = s ) = P_{s,s'}^a$
$Eπ(Rt+1∣St=s,At=a,St+1=s′)=Rs,s′a\mathbb E_\pi (R_{t+1} | S_t = s , A_t = a , S_{t+1} = s' ) = R_{s,s'}^a$
then：
$\begin{aligned} Q_\pi(s,a) &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \sum_{s'} \bigl[ \mathbb E(\sum_{k=1}^\infty \gamma^k R_{t+1+k} | S_t = s, A_t = a, S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=1}^\infty \gamma^{k-1} R_{t+1+k} | S_t=s,A_t=a,S_{t+1}=s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_t = s , A_t = a , S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \Bigl\{ \sum_{a'} \bigl[ \mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a' ) P( A_{t+1} = a' | S_{t+1} =s' ) \bigr] P_{s,s'}^a \Bigr\} \\ \end{aligned}$
According to definition
$\mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a') = Q_\pi(s',a') \\ P( A_{t+1} = a' | S_{t+1} = s' ) = \pi(s',a')$
so
$Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a$
This is the Bellman equation of action-value.