Reinforcement Learning- Exercise 3.17

本文推导了针对策略π下状态-动作对(s,a)的行动价值Qπ(s,a)的贝尔曼方程,该方程将当前状态-动作对的价值表示为后续可能状态-动作对的价值的期望。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Exercise 3.17 What is the Bellman equation for action values, that is, for qπq_\piqπ? It must give the action value qπ(s,a)q_\pi(s, a)qπ(s,a) in terms of the action values, qπ(s′,a′)q_\pi(s', a')qπ(s,a), of possible successors to the state–action pair (s, a). Hint: the backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.
在这里插入图片描述

According to definition:
Qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(∑k=0∞γkRt+k+1∣St=s,At=a)=∑s′[Eπ(∑k=0∞γkRt+k+1∣St=s,At=a,St+1=s′)P(St+1=s′∣At=a,St=s)]=∑s′{[Eπ(Rt+1∣St=s,At=a,St+1=s)+Eπ(∑k=1∞γkRt+1+k)]P(St+1=s′∣At=a,St=s)} \begin{aligned} Q_\pi(s,a) &= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &= \mathbb E_\pi (\sum_{k=0}^\infty \gamma^k R_t+k+1 | S_t=s, A_t=a) \\ &= \sum_{s'} \bigl[ \mathbb E_\pi ( \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t=s, A_t=a, S_{t+1}=s' ) P( S_{t+1} =s' | A_t = a, S_t = s ) \bigr] \\ &= \sum_{s'} \Bigl\{ \bigl[ \mathbb E_\pi ( R_{t+1} | S_t = s , A_t = a , S_{t+1} = s ) + \mathbb E_\pi ( \sum_{k=1}^\infty \gamma^k R_{t+1+k} ) \bigr] P( S_{t+1} = s' | A_t = a , S_t = s ) \Bigr\} \end{aligned} Qπ(s,a)=Eπ(GtSt=s,At=a)=Eπ(k=0γkRt+k+1St=s,At=a)=s[Eπ(k=0γkRt+k+1St=s,At=a,St+1=s)P(St+1=sAt=a,St=s)]=s{[Eπ(Rt+1St=s,At=a,St+1=s)+Eπ(k=1γkRt+1+k)]P(St+1=sAt=a,St=s)}
Denote
P(St+1=s′∣At=a,St=s)=Ps,s′aP(S_{t+1} = s' | A_t = a , S_t = s ) = P_{s,s'}^aP(St+1=sAt=a,St=s)=Ps,sa
Eπ(Rt+1∣St=s,At=a,St+1=s′)=Rs,s′a\mathbb E_\pi (R_{t+1} | S_t = s , A_t = a , S_{t+1} = s' ) = R_{s,s'}^aEπ(Rt+1St=s,At=a,St+1=s)=Rs,sa
then:
Qπ(s,a)=∑s′Rs,s′aPss′a+∑s′[E(∑k=1∞γkRt+1+k∣St=s,At=a,St+1=s′)Ps,s′a]=∑s′Rs,s′aPss′a+γ∑s′[E(∑k=1∞γk−1Rt+1+k∣St=s,At=a,St+1=s′)Ps,s′a]=∑s′Rs,s′aPss′a+γ∑s′[E(∑k=0∞γkRt+2+k∣St=s,At=a,St+1=s′)Ps,s′a]=∑s′Rs,s′aPss′a+γ∑s′[E(∑k=0∞γkRt+2+k∣St+1=s′)Ps,s′a]=∑s′Rs,s′aPss′a+γ∑s′{∑a′[E(∑k=0∞γkRt+2+k∣St+1=s′,At+1=a′)P(At+1=a′∣St+1=s′)]Ps,s′a} \begin{aligned} Q_\pi(s,a) &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \sum_{s'} \bigl[ \mathbb E(\sum_{k=1}^\infty \gamma^k R_{t+1+k} | S_t = s, A_t = a, S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=1}^\infty \gamma^{k-1} R_{t+1+k} | S_t=s,A_t=a,S_{t+1}=s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_t = s , A_t = a , S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \mathbb E ( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' ) P_{s,s'}^a \bigr] \\ &= \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \Bigl\{ \sum_{a'} \bigl[ \mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a' ) P( A_{t+1} = a' | S_{t+1} =s' ) \bigr] P_{s,s'}^a \Bigr\} \\ \end{aligned} Qπ(s,a)=sRs,saPssa+s[E(k=1γkRt+1+kSt=s,At=a,St+1=s)Ps,sa]=sRs,saPssa+γs[E(k=1γk1Rt+1+kSt=s,At=a,St+1=s)Ps,sa]=sRs,saPssa+γs[E(k=0γkRt+2+kSt=s,At=a,St+1=s)Ps,sa]=sRs,saPssa+γs[E(k=0γkRt+2+kSt+1=s)Ps,sa]=sRs,saPssa+γs{a[E(k=0γkRt+2+kSt+1=s,At+1=a)P(At+1=aSt+1=s)]Ps,sa}
According to definition
E(∑k=0∞γkRt+2+k∣St+1=s′,At+1=a′)=Qπ(s′,a′)P(At+1=a′∣St+1=s′)=π(s′,a′) \mathbb E( \sum_{k=0}^\infty \gamma^k R_{t+2+k} | S_{t+1} = s' , A_{t+1} = a') = Q_\pi(s',a') \\ P( A_{t+1} = a' | S_{t+1} = s' ) = \pi(s',a') E(k=0γkRt+2+kSt+1=s,At+1=a)=Qπ(s,a)P(At+1=aSt+1=s)=π(s,a)
so
Qπ(s,a)=∑s′Rs,s′aPss′a+γ∑s′[∑a′Qπ(s′,a′)π(s′,a′)]Ps,s′a Q_\pi(s,a) = \sum_{s'} R_{s,s'}^a P_{ss'}^a + \gamma \sum_{s'} \bigl[ \sum_{a'} Q_\pi(s',a') \pi(s',a') \bigr] P_{s,s'}^a Qπ(s,a)=sRs,saPssa+γs[aQπ(s,a)π(s,a)]Ps,sa
This is the Bellman equation of action-value.

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值