Reinforcement Learning Exercise 7.4

本文详细证明了Sarsa算法中n步回报可以精确地用一种新颖的TD误差来表示,通过数学推导展示了从一步到n步回报转换的过程,为理解强化学习中的时间差分方法提供了深入解析。

Exercise 7.4 Prove that the n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, as
Gt:t+n=Qt−1(St,At)+∑k=tmin(t+n,T)−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)] G_{t:t+n}=Q_{t-1}(S_t,A_t)+\sum_{k=t}^{min(t+n,T)-1} \gamma^{k-t}[R_{k+1} + \gamma Q_k( S_{k+1}, A_{k+1}) - Q_{k-1}(S_k,A_k)] Gt:t+n=Qt1(St,At)+k=tmin(t+n,T)1γkt[Rk+1+γQk(Sk+1,Ak+1)Qk1(Sk,Ak)]
Prove:
First Gt:t+nG_{t:t+n}Gt:t+n can be written in terms of the sum of difference:
Gt:t+n=Gt:t+1−Gt:t+1+Gt:t+2−Gt:t+2+⋯+Gt:t+n−2−Gt:t+n−2+Gt:t+n−1−Gt:t+n−1+Gt:t+n=Gt:t+1+(Gt:t+2−Gt:t+1)+⋯+(Gt:t+n−Gt:t+n−1)=Gt:t+1+∑i=2n(Gt:t+i−Gt:t+i−1)(1) \begin{aligned} G_{t:t+n} &= G_{t:t+1} -G_{t:t+1} +G_{t:t+2} -G_{t:t+2} + \cdots + G_{t:t+n-2} -G_{t:t+n-2} +G_{t:t+n-1} -G_{t:t+n-1} + G_{t:t+n}\\ &=G_{t:t+1} + (G_{t:t+2} - G_{t:t+1}) + \cdots +(G_{t:t+n} - G_{t:t+n-1})\\ &=G_{t:t+1}+\sum_{i=2}^n(G_{t:t+i}-G_{t:t+i-1}) \tag{1} \end{aligned} Gt:t+n=Gt:t+1Gt:t+1+Gt:t+2Gt:t+2++Gt:t+n2Gt:t+n2+Gt:t+n1Gt:t+n1+Gt:t+n=Gt:t+1+(Gt:t+2Gt:t+1)++(Gt:t+nGt:t+n1)=Gt:t+1+i=2n(Gt:t+iGt:t+i1)(1)
According to Sarsa (7.4)
Gt:t+n≐Rt+1+γRt+2+⋯+γn−1Rt+n+γnQt+n−1(St+n,At+n),n≥1,0≤t<T−n(7.4) G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}), \qquad n \geq1, 0 \leq t < T-n \tag{7.4} Gt:t+nRt+1+γRt+2++γn1Rt+n+γnQt+n1(St+n,At+n),n1,0t<Tn(7.4)
there is:
Gt:t+n−Gt:t+n−1=γn−1Rt+n+γnQt+n−1(St+n,At+n)−γn−1Qt+n−2(St+n−1,At+n−1)=γn−1[Rt+n+γQt+n−1(St+n,At+n)−Qt+n−2(St+n−1,At+n−1)](2) \begin{aligned} G_{t:t+n} - G_{t:t+n-1} & = \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n} , A_{t+n}) - \gamma^{n-1} Q_{t+n-2}(S_{t+n-1}, A_{t+n-1}) \\ &= \gamma^{n-1}\bigl[ R_{t+n} + \gamma Q_{t+n-1}(S_{t+n} , A_{t+n}) -Q_{t+n-2}(S_{t+n-1} , A_{t+n-1})\bigr] \tag{2} \end{aligned} Gt:t+nGt:t+n1=γn1Rt+n+γnQt+n1(St+n,At+n)γn1Qt+n2(St+n1,At+n1)=γn1[Rt+n+γQt+n1(St+n,At+n)Qt+n2(St+n1,At+n1)](2)
and for n=1n=1n=1, there is:
Gt:t+1=γ0Rt+1+γ1Qt+1−1(St+1,At+1)=γ0Rt+1+γ1Qt+1−1(St+1,At+1)−Qt−1(St,At)+Qt−1(St,At)=Qt−1(St,At)+γ0[Rt+1+γQt(St+1,At+1)−Qt−1(St,At)](3) \begin{aligned} G_{t:t+1} &=\gamma^0 R_{t+1} + \gamma^1 Q_{t+1-1} (S_{t+1}, A_{t+1}) \\ &=\gamma^0R_{t+1} + \gamma^1 Q_{t+1-1} (S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) + Q_{t-1}(S_t, A_t) \\ &= Q_{t-1}(S_t, A_t) + \gamma^0 \bigl[ R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) \bigr ] \tag{3} \end{aligned} Gt:t+1=γ0Rt+1+γ1Qt+11(St+1,At+1)=γ0Rt+1+γ1Qt+11(St+1,At+1)Qt1(St,At)+Qt1(St,At)=Qt1(St,At)+γ0[Rt+1+γQt(St+1,At+1)Qt1(St,At)](3)
Substitute equation (2) and (3) into (1), we get:
Gt:t+n=Qt−1(St,At)+γ0[Rt+1+γQt(St+1,At+1)−Qt−1(St,At)]+∑i=2nγi−1[Rt+i+γQt+i−1(St+i,At+i)−Qt+i−2(St+i−1,At+i−1)]=Qt−1(St,At)+∑i=1nγi−1[Rt+i+γQt+i−1(St+i,At+i)−Qt+i−2(St+i−1,At+i−1)](4) \begin{aligned} G_{t:t+n} &= Q_{t-1}(S_t,A_t) + \gamma^0 \bigl[ R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) \bigr ] \\ &\quad+ \sum_{i=2}^n \gamma^{i-1}\bigl[ R_{t+i} + \gamma Q_{t+i-1}(S_{t+i} , A_{t+i}) -Q_{t+i-2}(S_{t+i-1} , A_{t+i-1})\bigr] \\ &= Q_{t-1}(S_t,A_t) + \sum_{i=1}^n \gamma^{i-1}\bigl[ R_{t+i} + \gamma Q_{t+i-1}(S_{t+i} , A_{t+i}) -Q_{t+i-2}(S_{t+i-1} , A_{t+i-1})\bigr] \tag{4}\\ \end{aligned} Gt:t+n=Qt1(St,At)+γ0[Rt+1+γQt(St+1,At+1)Qt1(St,At)]+i=2nγi1[Rt+i+γQt+i1(St+i,At+i)Qt+i2(St+i1,At+i1)]=Qt1(St,At)+i=1nγi1[Rt+i+γQt+i1(St+i,At+i)Qt+i2(St+i1,At+i1)](4)
Let k=i+t−1k =i+t-1k=i+t1, so i=k−t+1i =k-t+1i=kt+1 equation (4) can be written as:
Gt:t+n=Qt−1(St,At)+∑k=tt+n−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)](5) G_{t:t+n} = Q_{t-1}(S_t, A_t) + \sum_{k=t}^{t+n-1}\gamma^{k-t}\bigl[ R_{k+1} + \gamma Q_{k}(S_{k+1} , A_{k+1}) -Q_{k-1}(S_{k} , A_{k})\bigr] \tag{5} Gt:t+n=Qt1(St,At)+k=tt+n1γkt[Rk+1+γQk(Sk+1,Ak+1)Qk1(Sk,Ak)](5)
t+nt +nt+n should not larger than TTT, so equation (5) can be written as:
Gt:t+n=Qt−1(St,At)+∑k=tmin(t+n,T)−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)] G_{t:t+n}=Q_{t-1}(S_t,A_t)+\sum_{k=t}^{min(t+n,T)-1} \gamma^{k-t}[R_{k+1} + \gamma Q_k( S_{k+1}, A_{k+1}) - Q_{k-1}(S_k,A_k)] Gt:t+n=Qt1(St,At)+k=tmin(t+n,T)1γkt[Rk+1+γQk(Sk+1,Ak+1)Qk1(Sk,Ak)]
PROVED.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值