Reinforcement Learning Exercise 3.19

本文详细探讨了强化学习中行动值函数qπ(s,a)的数学表示,涉及预期下一个奖励Rt+1和剩余奖励的期望总和。通过小的备份图直观解释,给出两个方程,一个不包含策略条件的期望值,另一个将期望值明确写为状态转移概率p(s',r∣s,a)的形式。" 119970320,10591925,多项式展开正确性判断,"['c++', '算法', '数学', '模拟计算']

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Exercise 3.19 The value of an action, qπ(s,a)q_\pi(s, a)qπ(s,a), depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:
在这里插入图片描述
Give the equation corresponding to this intuition and diagram for the action value, qπ(s,a)q_\pi(s, a)qπ(s,a), in terms of the expected next reward, Rt+1R_{t+1}Rt+1, and the expected next state value, vπ(St+1)v_\pi(S_t+1)vπ(St+1), given that St=sS_t =sSt=s and At=aA_t =aAt=a. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of p(s′,r∣s,a)p(s', r|s, a)p(s,rs,a) defined by (3.2), such that no expected value notation appears in the equation.

qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(Rt+1+γGt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(Gt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(∑k=0∞γkRt+k+2∣St=s,At=a)=Rt+1(s,a)+γ∑s′[Eπ(∑k=0∞γkRt+k+2∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)] \begin{aligned} q_\pi(s,a) &= \mathbb E_\pi(G_t | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi ( R_{t+1} | S_t= s, A_t = a) + \gamma \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a) \\ &= R_{t+1}(s,a) + \gamma \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s'| S_t = s, A_t = a) \bigr] \\ \end{aligned} qπ(s,a)=Eπ(GtSt=s,At=a)=Eπ(Rt+1+γGt+1St=s,At=a)=Eπ(Rt+1St=s,At=a)+γEπ(Gt+1St=s,At=a)=Eπ(Rt+1St=s,At=a)+γEπ(k=0γkRt+k+2St=s,At=a)=Rt+1(s,a)+γs[Eπ(k=0γkRt+k+2St=s,At=a,St+1=s)Pr(St+1=sSt=s,At=a)]
denote Pr(St+1=s′∣St=s,At=a)=Ps,s′aPr(S_{t+1} = s'| S_t = s, A_t = a) = P_{s,s'}^aPr(St+1=sSt=s,At=a)=Ps,sa
and ∵St\because S_tSt and AtA_tAt give no information to Rt+2+kR_{t+2+k}Rt+2+k
∴Eπ(∑k=0∞γkRt+k+2∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)=Eπ(∑k=0∞γkRt+k+2∣St+1=s′)Ps,s′a=υπ(St+1)Ps,s′a\therefore \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s'| S_t = s, A_t = a)\\ \begin{aligned} &= \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+2} | S_{t+1} = s') P_{s,s'}^a \\ &= \upsilon_\pi(S_{t+1}) P_{s,s'}^a \\ \end{aligned} Eπ(k=0γkRt+k+2St=s,At=a,St+1=s)Pr(St+1=sSt=s,At=a)=Eπ(k=0γkRt+k+2St+1=s)Ps,sa=υπ(St+1)Ps,sa
(1)∴qπ(s,a)=Rt+1(s,a)+γ∑s′υπ(St+1)Ps,s′a \therefore \begin{aligned} q_\pi(s,a) = R_{t+1}(s,a) + \gamma \sum_{s'} \upsilon_\pi(S_{t+1}) P_{s,s'}^a \tag{1} \end{aligned} qπ(s,a)=Rt+1(s,a)+γsυπ(St+1)Ps,sa(1)
Above is the first equantion.
∵Rt+1(s,a)=Eπ(Rt+1∣St=s,At=a)=∑s′[Eπ(Rt+1∣St=s,At=a,St+1=s′)Pr(St+1=s′∣St=s,At=a)]=∑r∑s′[Eπ(Rt+1∣St=s,At=a,St+1=s′)Pr(St+1=s′,Rt+1=r∣St=s,At=a)] \begin{aligned} \because R_{t+1}(s,a) &= \mathbb E_\pi (R_{t+1} | S_t = s, A_t = a) \\ &= \sum_{s'} \bigl[ \mathbb E_\pi (R_{t+1}|S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s' | S_t = s, A_t = a) \bigr] \\ &= \sum_r \sum_{s'} \bigl[ \mathbb E_\pi (R_{t+1}|S_t = s, A_t = a, S_{t+1} = s') Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a) \bigr] \\ \end{aligned} Rt+1(s,a)=Eπ(Rt+1St=s,At=a)=s[Eπ(Rt+1St=s,At=a,St+1=s)Pr(St+1=sSt=s,At=a)]=rs[Eπ(Rt+1St=s,At=a,St+1=s)Pr(St+1=s,Rt+1=rSt=s,At=a)]
denote Eπ(Rt+1∣St=s,At=a,St+1=s′)=Rs,s′aPr(St+1=s′,Rt+1=r∣St=s,At=a)=p(s′,r∣s,a) \mathbb E_\pi (R_{t+1}|S_t = s, A_t = a, S_{t+1} = s') = R_{s,s'}^a \\ Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a) = p(s', r | s, a) Eπ(Rt+1St=s,At=a,St+1=s)=Rs,saPr(St+1=s,Rt+1=rSt=s,At=a)=p(s,rs,a)
∴Rt+1(s,a)=∑r∑s′Rs,s′ap(s′,r∣s,a) \therefore R_{t+1}(s, a) = \sum_r \sum_{s'} R_{s,s'}^a p(s', r | s, a) Rt+1(s,a)=rsRs,sap(s,rs,a)
∵γ∑s′υπ(St+1)Ps,s′a=γ∑r∑s′υπ(St+1)Pr(St+1=s′,Rt+1=r∣St=s,At=a)=γ∑r∑s′υπ(St+1)p(s′,r∣s,a) \begin{aligned} \because \gamma \sum_{s'} \upsilon_\pi(S_{t+1}) P_{s,s'}^a &= \gamma \sum_r \sum_{s'} \upsilon_\pi(S_{t+1}) Pr(S_{t+1} = s', R_{t+1} = r | S_t = s, A_t = a) \\ &= \gamma \sum_r \sum_{s'} \upsilon_\pi(S_{t+1}) p(s', r | s, a) \end{aligned} γsυπ(St+1)Ps,sa=γrsυπ(St+1)Pr(St+1=s,Rt+1=rSt=s,At=a)=γrsυπ(St+1)p(s,rs,a)
(2)∴qπ(s,a)=∑r∑s′Rs,s′ap(s′,r∣s,a)+γ∑r∑s′υπ(St+1)p(s′,r∣s,a)=∑r∑s′[Rs,s′a+γυπ(St+1)]p(s′,r∣s,a) \begin{aligned} \therefore q_\pi(s,a) &= \sum_r \sum_{s'} R_{s,s'}^a p(s', r | s, a) + \gamma \sum_r \sum_{s'} \upsilon_\pi(S_{t+1}) p(s', r | s, a) \\ &= \sum_r \sum_{s'} \bigl[ R_{s, s'}^a + \gamma \upsilon_\pi( S_{t+1} ) \bigr ] p(s', r | s, a) \tag{2} \end{aligned} qπ(s,a)=rsRs,sap(s,rs,a)+γrsυπ(St+1)p(s,rs,a)=rs[Rs,sa+γυπ(St+1)]p(s,rs,a)(2)
This is the second equation.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值