Reinforcement Learning Exercise 3.19

最新推荐文章于 2019-10-04 14:21:04 发布

YeXiang\^-^/

最新推荐文章于 2019-10-04 14:21:04 发布

阅读量716

点赞数 1

CC 4.0 BY-SA版权

分类专栏： reinforcement learning 文章标签： reinforcement learning

本文链接：https://blog.youkuaiyun.com/ballade2012/article/details/89164995

reinforcement learning 专栏收录该内容

37 篇文章

订阅专栏

本文详细探讨了强化学习中行动值函数qπ(s,a)的数学表示，涉及预期下一个奖励Rt+1和剩余奖励的期望总和。通过小的备份图直观解释，给出两个方程，一个不包含策略条件的期望值，另一个将期望值明确写为状态转移概率p(s',r∣s,a)的形式。" 119970320,10591925,多项式展开正确性判断,"['c++', '算法', '数学', '模拟计算']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Exercise 3.19 The value of an action, $qπ(s,a)q_\pi(s, a)$ , depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:
在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the action value, $qπ(s,a)q_\pi(s, a)$ , in terms of the expected next reward, $R_{t+1}$ , and the expected next state value, $vπ(St+1)v_\pi(S_t+1)$ , given that $S_t =s$ and $A_t =a$ . This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of $p (s^{'}, r ∣ s, a)$ defined by (3.2), such that no expected value notation appears in the equation.

$\begin{aligned} q_\pi(s,a) &= \mathbb E_\pi(G_t | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a) \\ &= R_{t+1}(s,a) + \gamma \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s'| S_t = s, A_t = a) \bigr] \\ \end{aligned}$
denote $Pr(St+1=s′∣St=s,At=a)=Ps,s′aPr(S_{t+1} = s'| S_t = s, A_t = a) = P_{s,s'}^a$
and $∵St\because S_t$ and $A_t$ give no information to $R_{t+2+k}$
$∴Eπ(∑k=0∞γkRt+k+2∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)=Eπ(∑k=0∞γkRt+k+2∣St+1=s′)Ps,s′a=υπ(St+1)Ps,s′a\therefore \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s'| S_t = s, A_t = a)\\ \begin{aligned} &= \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_{t+1} = s') P_{s,s'}^a \\ &= \upsilon_\pi(S_{t+1}) P_{s,s'}^a \\ \end{aligned}$
$\therefore \begin{aligned} q_\pi(s,a) = R_{t+1}(s,a) + \gamma \sum_{s'} \upsilon_\pi(S_{t+1}) P_{s,s'}^a \tag{1} \end{aligned}$
Above is the first equantion.
$\begin{aligned} \because R_{t+1}(s,a) &= \mathbb E_\pi (R_{t+1} | S_t = s, A_t = a) \\ &= \sum_{s'} \bigl[ \mathbb E_\pi (R_{t+1}|S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s' | S_t = s, A_t = a) \bigr] \\ &= \sum_r \sum_{s'} \bigl[ \mathbb E_\pi (R_{t+1}|S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a) \bigr] \\ \end{aligned}$
denote $\mathbb E_\pi (R_{t+1}|S_t = s, A_t = a, S_{t+1} = s') = R_{s,s'}^a \\ Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a) = p(s', r | s, a)$
$\therefore R_{t+1}(s, a) = \sum_r \sum_{s'} R_{s,s'}^a p(s', r | s, a)$
$\begin{aligned} \because \gamma \sum_{s'} \upsilon_\pi(S_{t+1}) P_{s,s'}^a &= \gamma \sum_r \sum_{s'} \upsilon_\pi(S_{t+1}) Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a) \\ &= \gamma \sum_r \sum_{s'} \upsilon_\pi(S_{t+1}) p(s', r | s, a) \end{aligned}$
$\begin{aligned} \therefore q_\pi(s,a) &= \sum_r \sum_{s'} R_{s,s'}^a p(s', r | s, a) + \gamma \sum_r \sum_{s'} \upsilon_\pi(S_{t+1}) p(s', r | s, a) \\ &= \sum_r \sum_{s'} \bigl[ R_{s, s'}^a + \gamma \upsilon_\pi( S_{t+1} ) \bigr ] p(s', r | s, a) \tag{2} \end{aligned}$
This is the second equation.