Exercise 7.4 Prove that the n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, as
Gt:t+n=Qt−1(St,At)+∑k=tmin(t+n,T)−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)]
G_{t:t+n}=Q_{t-1}(S_t,A_t)+\sum_{k=t}^{min(t+n,T)-1} \gamma^{k-t}[R_{k+1} + \gamma Q_k( S_{k+1}, A_{k+1}) - Q_{k-1}(S_k,A_k)]
Gt:t+n=Qt−1(St,At)+k=t∑min(t+n,T)−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)]
Prove:
First Gt:t+nG_{t:t+n}Gt:t+n can be written in terms of the sum of difference:
Gt:t+n=Gt:t+1−Gt:t+1+Gt:t+2−Gt:t+2+⋯+Gt:t+n−2−Gt:t+n−2+Gt:t+n−1−Gt:t+n−1+Gt:t+n=Gt:t+1+(Gt:t+2−Gt:t+1)+⋯+(Gt:t+n−Gt:t+n−1)=Gt:t+1+∑i=2n(Gt:t+i−Gt:t+i−1)(1)
\begin{aligned}
G_{t:t+n} &= G_{t:t+1} -G_{t:t+1} +G_{t:t+2} -G_{t:t+2} + \cdots + G_{t:t+n-2} -G_{t:t+n-2} +G_{t:t+n-1} -G_{t:t+n-1} + G_{t:t+n}\\
&=G_{t:t+1} + (G_{t:t+2} - G_{t:t+1}) + \cdots +(G_{t:t+n} - G_{t:t+n-1})\\
&=G_{t:t+1}+\sum_{i=2}^n(G_{t:t+i}-G_{t:t+i-1}) \tag{1}
\end{aligned}
Gt:t+n=Gt:t+1−Gt:t+1+Gt:t+2−Gt:t+2+⋯+Gt:t+n−2−Gt:t+n−2+Gt:t+n−1−Gt:t+n−1+Gt:t+n=Gt:t+1+(Gt:t+2−Gt:t+1)+⋯+(Gt:t+n−Gt:t+n−1)=Gt:t+1+i=2∑n(Gt:t+i−Gt:t+i−1)(1)
According to Sarsa (7.4)
Gt:t+n≐Rt+1+γRt+2+⋯+γn−1Rt+n+γnQt+n−1(St+n,At+n),n≥1,0≤t<T−n(7.4)
G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}), \qquad n \geq1, 0 \leq t < T-n \tag{7.4}
Gt:t+n≐Rt+1+γRt+2+⋯+γn−1Rt+n+γnQt+n−1(St+n,At+n),n≥1,0≤t<T−n(7.4)
there is:
Gt:t+n−Gt:t+n−1=γn−1Rt+n+γnQt+n−1(St+n,At+n)−γn−1Qt+n−2(St+n−1,At+n−1)=γn−1[Rt+n+γQt+n−1(St+n,At+n)−Qt+n−2(St+n−1,At+n−1)](2)
\begin{aligned}
G_{t:t+n} - G_{t:t+n-1} & = \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n} , A_{t+n}) - \gamma^{n-1} Q_{t+n-2}(S_{t+n-1}, A_{t+n-1}) \\
&= \gamma^{n-1}\bigl[ R_{t+n} + \gamma Q_{t+n-1}(S_{t+n} , A_{t+n}) -Q_{t+n-2}(S_{t+n-1} , A_{t+n-1})\bigr] \tag{2}
\end{aligned}
Gt:t+n−Gt:t+n−1=γn−1Rt+n+γnQt+n−1(St+n,At+n)−γn−1Qt+n−2(St+n−1,At+n−1)=γn−1[Rt+n+γQt+n−1(St+n,At+n)−Qt+n−2(St+n−1,At+n−1)](2)
and for n=1n=1n=1, there is:
Gt:t+1=γ0Rt+1+γ1Qt+1−1(St+1,At+1)=γ0Rt+1+γ1Qt+1−1(St+1,At+1)−Qt−1(St,At)+Qt−1(St,At)=Qt−1(St,At)+γ0[Rt+1+γQt(St+1,At+1)−Qt−1(St,At)](3)
\begin{aligned}
G_{t:t+1} &=\gamma^0 R_{t+1} + \gamma^1 Q_{t+1-1} (S_{t+1}, A_{t+1}) \\
&=\gamma^0R_{t+1} + \gamma^1 Q_{t+1-1} (S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) + Q_{t-1}(S_t, A_t) \\
&= Q_{t-1}(S_t, A_t) + \gamma^0 \bigl[ R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) \bigr ] \tag{3}
\end{aligned}
Gt:t+1=γ0Rt+1+γ1Qt+1−1(St+1,At+1)=γ0Rt+1+γ1Qt+1−1(St+1,At+1)−Qt−1(St,At)+Qt−1(St,At)=Qt−1(St,At)+γ0[Rt+1+γQt(St+1,At+1)−Qt−1(St,At)](3)
Substitute equation (2) and (3) into (1), we get:
Gt:t+n=Qt−1(St,At)+γ0[Rt+1+γQt(St+1,At+1)−Qt−1(St,At)]+∑i=2nγi−1[Rt+i+γQt+i−1(St+i,At+i)−Qt+i−2(St+i−1,At+i−1)]=Qt−1(St,At)+∑i=1nγi−1[Rt+i+γQt+i−1(St+i,At+i)−Qt+i−2(St+i−1,At+i−1)](4)
\begin{aligned}
G_{t:t+n} &= Q_{t-1}(S_t,A_t) + \gamma^0 \bigl[ R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) \bigr ] \\
&\quad+ \sum_{i=2}^n \gamma^{i-1}\bigl[ R_{t+i} + \gamma Q_{t+i-1}(S_{t+i} , A_{t+i}) -Q_{t+i-2}(S_{t+i-1} , A_{t+i-1})\bigr] \\
&= Q_{t-1}(S_t,A_t) + \sum_{i=1}^n \gamma^{i-1}\bigl[ R_{t+i} + \gamma Q_{t+i-1}(S_{t+i} , A_{t+i}) -Q_{t+i-2}(S_{t+i-1} , A_{t+i-1})\bigr] \tag{4}\\
\end{aligned}
Gt:t+n=Qt−1(St,At)+γ0[Rt+1+γQt(St+1,At+1)−Qt−1(St,At)]+i=2∑nγi−1[Rt+i+γQt+i−1(St+i,At+i)−Qt+i−2(St+i−1,At+i−1)]=Qt−1(St,At)+i=1∑nγi−1[Rt+i+γQt+i−1(St+i,At+i)−Qt+i−2(St+i−1,At+i−1)](4)
Let k=i+t−1k =i+t-1k=i+t−1, so i=k−t+1i =k-t+1i=k−t+1 equation (4) can be written as:
Gt:t+n=Qt−1(St,At)+∑k=tt+n−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)](5)
G_{t:t+n} = Q_{t-1}(S_t, A_t) + \sum_{k=t}^{t+n-1}\gamma^{k-t}\bigl[ R_{k+1} + \gamma Q_{k}(S_{k+1} , A_{k+1}) -Q_{k-1}(S_{k} , A_{k})\bigr] \tag{5}
Gt:t+n=Qt−1(St,At)+k=t∑t+n−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)](5)
t+nt +nt+n should not larger than TTT, so equation (5) can be written as:
Gt:t+n=Qt−1(St,At)+∑k=tmin(t+n,T)−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)]
G_{t:t+n}=Q_{t-1}(S_t,A_t)+\sum_{k=t}^{min(t+n,T)-1} \gamma^{k-t}[R_{k+1} + \gamma Q_k( S_{k+1}, A_{k+1}) - Q_{k-1}(S_k,A_k)]
Gt:t+n=Qt−1(St,At)+k=t∑min(t+n,T)−1γk−t[Rk+1+γQk(Sk+1,Ak+1)−Qk−1(Sk,Ak)]
PROVED.
Reinforcement Learning Exercise 7.4
最新推荐文章于 2020-11-25 20:22:42 发布
本文详细证明了Sarsa算法中n步回报可以精确地用一种新颖的TD误差来表示,通过数学推导展示了从一步到n步回报转换的过程,为理解强化学习中的时间差分方法提供了深入解析。
1966

被折叠的 条评论
为什么被折叠?



