Reinforcement Learning exercise 3.13

博客围绕强化学习展开,给出了用vπ和四参数p表示qπ的方程推导过程。先从概率论乘法公式推导出公式(1),再据此计算qπ(s,a),结合马尔可夫过程特性逐步推导,最终得出qπ(s,a)的表达式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Exercise 3.13 Give an equation for qπq_\piqπ in terms of vπv_\pivπ and the four-argument ppp.

First, we need to derive a formula from multiplication formula of probability theory:
p(x∣y)=p(x,y)p(y)=∑zp(x,y,z)p(y)=∑z[p(x∣y,z)⋅p(z∣y)⋅p(y)]p(y)=∑z[p(x∣y,z)⋅p(z∣y)](1) \begin{aligned} p(x|y) &= \frac {p(x,y)}{p(y)} \\ &= \frac {\sum_z p(x,y,z)}{p(y)} \\ &= \frac {\sum_z \bigl [ p(x |y,z) \cdot p(z|y) \cdot p(y) \bigr ] } { p(y) } \\ &= \sum_z \bigl [ p(x|y,z) \cdot p(z|y) \bigr ] \qquad \qquad{(1)}\\ \end{aligned} p(xy)=p(y)p(x,y)=p(y)zp(x,y,z)=p(y)z[p(xy,z)p(zy)p(y)]=z[p(xy,z)p(zy)](1)
With formula (1), we can calculate qπ(s,a)q_\pi(s,a)qπ(s,a) as below:
qπ(s,a)=Eπ(Gt∣St=s,At=a)=Eπ(Rt+1+γGt+1∣St=s,At=a)=Eπ(Rt+1∣St=s,At=a)+γEπ(Gt+1∣St=s,At=a)=∑rr⋅Pr(Rt+1=r∣St=s,At=a)+γ∑gt+1gt+1⋅Pr(Gt+1=gt+1∣St=s,At=a) \begin{aligned} q_\pi(s,a)&= \mathbb E_\pi(G_t|S_t=s,A_t=a) \\ &=\mathbb E_\pi(R_{t+1} + \gamma G_{t+1} | S_t = s, A_t = a) \\ &= \mathbb E_\pi(R_{t+1} | S_t = s, A_t = a) + \gamma \mathbb E_\pi(G_{t+1}|S_t = s, A_t = a) \\ &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=g_{t+1}|S_t=s, A_t=a) \\ \end{aligned} qπ(s,a)=Eπ(GtSt=s,At=a)=Eπ(Rt+1+γGt+1St=s,At=a)=Eπ(Rt+1St=s,At=a)+γEπ(Gt+1St=s,At=a)=rrPr(Rt+1=rSt=s,At=a)+γgt+1gt+1Pr(Gt+1=gt+1St=s,At=a)
Here, according to definition, gt+1=∑k=0∞γk⋅rt+2+kg_{t+1} = \sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}gt+1=k=0γkrt+2+k. And use formula (1), we can derive:
qπ(s,a)=∑rr⋅Pr(Rt+1=r∣St=s,At=a)+γ∑gt+1gt+1⋅Pr(Gt+1=∑k=0∞γk⋅rt+2+k∣St=s,At=a)=∑rr⋅∑s′Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γ∑gt+1gt+1⋅∑s′Pr(Gt+1=∑k=0∞γk⋅rt+2+k∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,at=a) \begin{aligned} q_\pi(s,a) &= \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a) \\ &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k}|S_t=s, A_t=a, S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ \end{aligned} qπ(s,a)=rrPr(Rt+1=rSt=s,At=a)+γgt+1gt+1Pr(Gt+1=k=0γkrt+2+kSt=s,At=a)=rrsPr(Rt+1=rSt=s,At=a,St+1=s)Pr(St+1=sSt=s,At=a)+γgt+1gt+1sPr(Gt+1=k=0γkrt+2+kSt=s,At=a,St+1=s)Pr(St+1=sSt=s,at=a)
Because in Markov Process, Gt+1G_{t+1}Gt+1 is the reward of status St+1=s′S_{t+1} = s'St+1=s, the information of St=sS_t = sSt=s and At=aA_t = aAt=a are no effect on Gt+1G_{t+1}Gt+1. So:
qπ(s,a)=∑rr⋅∑s′Pr(Rt+1=r∣St=s,At=a,St+1=s′)⋅Pr(St+1=s′∣St=s,At=a)+γ∑gt+1gt+1⋅∑s′Pr(Gt+1=∑k=0∞γk⋅rt+2+k∣St+1=s′)⋅Pr(St+1=s′∣St=s,at=a)=∑s′{Pr(St+1=s′∣St=s,At=a)⋅[∑rr⋅Pr(Rt+1=r∣St=s,At=a,St+1=s′)+γ∑gt+1gt+1⋅Pr(Gt+1=∑k=0∞γk⋅rt+2+k∣St+1=s′)]}=∑s′{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅Eπ(Gt+1∣St+1=s′)]}=∑s′{p(s′∣s,a)⋅[Eπ(r∣s,a,s′)+γ⋅vπ(s′)]} \begin{aligned} q_\pi(s,a) &= \sum_r r \cdot \sum_{s'} Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \cdot Pr(S_{t+1} = s' | S_t=s, A_t=a) \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot \sum_{s'} Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \cdot Pr(S_{t+1}=s'| S_t=s, a_t = a) \\ &= \sum_{s'} \biggl \{ Pr(S_{t+1} = s' | S_t=s, A_t=a) \cdot \Bigl [ \sum_r r \cdot Pr(R_{t+1} = r | S_t = s, A_t = a, S_{t+1} = s') \\ &\quad + \gamma \sum_{g_{t+1}} g_{t+1} \cdot Pr(G_{t+1}=\sum_{k=0}^{\infty} \gamma^k \cdot r_{t+2+k} | S_{t+1} = s') \Bigr ] \biggr \}\\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot \mathbb E_\pi(G_{t+1}|S_{t+1}=s') \Bigr ] \biggr \} \\ &=\sum_{s'} \biggl \{ p(s'| s,a) \cdot \Bigl [ \mathbb E_\pi(r|s,a,s') + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \\ \end{aligned} qπ(s,a)=rrsPr(Rt+1=rSt=s,At=a,St+1=s)Pr(St+1=sSt=s,At=a)+γgt+1gt+1sPr(Gt+1=k=0γkrt+2+kSt+1=s)Pr(St+1=sSt=s,at=a)=s{Pr(St+1=sSt=s,At=a)[rrPr(Rt+1=rSt=s,At=a,St+1=s)+γgt+1gt+1Pr(Gt+1=k=0γkrt+2+kSt+1=s)]}=s{p(ss,a)[Eπ(rs,a,s)+γEπ(Gt+1St+1=s)]}=s{p(ss,a)[Eπ(rs,a,s)+γvπ(s)]}
Denote p(s′∣a,s)=Ps,s′ap(s' | a , s ) = P_{s,s'}^ap(sa,s)=Ps,sa and Eπ(r∣s,a,s′)=Rs,s′a\mathbb E_\pi ( r | s, a, s' ) = R_{s,s'}^aEπ(rs,a,s)=Rs,sa, then
qπ(s,a)=∑s′{Ps,s′a⋅[Rs,s′a+γ⋅vπ(s′)]}(2) q_\pi(s,a) = \sum_{s'} \biggl \{ P_{s,s'}^a \cdot \Bigl [ R_{s,s'}^a + \gamma \cdot v_\pi(s') \Bigr ] \biggr \} \qquad{(2)} qπ(s,a)=s{Ps,sa[Rs,sa+γvπ(s)]}(2)
Here, the equation (2) is the result.

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值