Exercise 7.1 In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the n-step error used in (7.2) can also be written as a sum TD errors (again if the value estimates don’t change) generalizing the earlier result.
Here, according to equation (7.2), the TD error is:
δt=Gt:t+n−Vt+n−1(St)
\delta_t = G_{t:t+n} - V_{t+n-1}(S_t)
δt=Gt:t+n−Vt+n−1(St)
For Gt:t+nG_{t:t+n}Gt:t+n there is:
Gt:t+n={Rt+1+γRt+2+⋯+γn−1Rt+n+γnVt+n−1(St+n)(n≥1 and 0≤t<T−n)Rt+1+γRt+2+⋯+γT−t−1RT(t+n≥T)
G_{t:t+n} =
\begin{cases}
R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}) & (n \geq 1 \text{ and } 0 \leq t < T-n) \\
R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1}R_T & (t+n \geq T)
\end{cases}
Gt:t+n={Rt+1+γRt+2+⋯+γn−1Rt+n+γnVt+n−1(St+n)Rt+1+γRt+2+⋯+γT−t−1RT(n≥1 and 0≤t<T−n)(t+n≥T)
Then, for t+n≥Tt+n\geq Tt+n≥T, the Monte Carlo error is:
Gt−Vt+n(St)=Rt+1+γRt+2+⋯+γT−t−1RT−Vt+n(St)=Gt:t+n−Vt+n(St)
\begin{aligned}
G_t - V_{t+n}(S_t) &= R_{t+1} + \gamma R_{t+2} +\cdots + \gamma^{T-t-1}R_T - V_{t+n}(S_t) \\
&=G_{t:t+n}-V_{t+n}(S_t)
\end{aligned}
Gt−Vt+n(St)=Rt+1+γRt+2+⋯+γT−t−1RT−Vt+n(St)=Gt:t+n−Vt+n(St)
Because all states are unchanged: Vt+n(s)=Vt+n−1(s)V_{t+n}(s) = V_{t+n-1}(s)Vt+n(s)=Vt+n−1(s), so:
Gt−Vt+n(St)=Gt:t+n−Vt+n−1(St)=δt
\begin{aligned}
G_t - V_{t+n}(S_t) &=G_{t:t+n}-V_{t+n-1}(S_t)\\
&=\delta_t
\end{aligned}
Gt−Vt+n(St)=Gt:t+n−Vt+n−1(St)=δtand for state value in any time, there is Vt(St)=Vt+x(St)V_t(S_t) = V_{t+x}(S_t)Vt(St)=Vt+x(St). Here, x>0x > 0x>0.
Similarly, for n≥1n \geq 1n≥1 and 0≤t<T−kn0 \leq t < T-kn0≤t<T−kn, (here k≥1k \geq 1k≥1) the Monte Carlo error should be:
Gt−Vt+n(St)=Rt+1+γRt+2+⋯+γT−t−1RT−Vt+n(St)=Rt+1+γRt+2+⋯+γn−1Rt+n+γnVt+n−1(St+n)−Vt+n(St)−γnVt+n−1(St+n)+γnRt+n+1+⋯+γT−t−1RT=Rt+1+γRt+2+⋯+γn−1Rt+n+γnVt+n−1(St+n)−Vt+n−1(St)−γnVt+n−1(St+n)+γnRt+n+1+⋯+γT−t−1RT=δt+γn[Rt+n+1+⋯+γT−(t+n)−1RT−Vt+n−1(St+n)]=δt+γn[Gt+n−Vt+2n(St+n)]=δt+γnδt+n+γ2n[Gt+2n−Vt+3n(St+2n)]=δt+γnδt+n+γ2nδt+2n+⋯+γkn[Gt+kn−Vt+(k+1)n(St+kn)]=δt+γnδt+n+γ2nδt+2n+⋯+γknδt+kn+γkn[Rt+kn+1+γRt+kn+2+⋯+γT−(t+kn)−1RT−Vt+(k+1)n(St+(k+1)n)]=∑p=0p=kγpnδt+pn+γkn[Gt+kn−V(ST)]=∑p=0p=kγpnδt+pn+γkn[Gt+kn−0]=∑p=0p=kγpnδt+pn+γknGt+kn
\begin{aligned}
G_t - V_{t+n}(S_t) &= R_{t+1} + \gamma R_{t+2} +\cdots + \gamma^{T-t-1}R_T - V_{t+n}(S_t) \\
&= R_{t+1} + \gamma R_{t+2} +\cdots + \gamma^{n-1}R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}) - V_{t+n}(S_t) \\
& \quad - \gamma^n V_{t+n-1}(S_{t+n}) + \gamma^nR_{t+n+1}+\cdots+\gamma^{T-t-1}R_T\\
&= R_{t+1} + \gamma R_{t+2} +\cdots + \gamma^{n-1}R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}) - V_{t+n -1}(S_t)\\
& \quad - \gamma^n V_{t+n-1}(S_{t+n}) + \gamma^nR_{t+n+1}+\cdots+\gamma^{T-t-1}R_T \\
&=\delta_t + \gamma^n\bigl [R_{t+n+1}+\cdots+\gamma^{T-(t+n) -1}R_T - V_{t+n-1}(S_{t+n}) \bigr ]\\
&=\delta_t + \gamma^n \bigl [G_{t+n} - V_{t+2n}(S_{t+n}) \bigr ] \\
&=\delta_t + \gamma^n \delta_{t+n} + \gamma^{2n} \bigl[ G_{t+2n} - V_{t+3n}(S_{t+2n})\bigr] \\
&=\delta_t + \gamma^n \delta_{t+n} + \gamma^{2n}\delta_{t+2n} + \cdots +\gamma^{kn} \bigl [ G_{t+kn} - V_{t+(k+1)n}(S_{t+kn})\bigr ] \\
&=\delta_t + \gamma^n \delta_{t+n} + \gamma^{2n}\delta_{t+2n} + \cdots +\gamma^{kn} \delta_{t+kn} \\
& \quad+ \gamma^{kn} \Bigl [ R_{t+kn+1} + \gamma R_{t+kn+2} + \cdots + \gamma^{T-(t+kn)-1}R_T - V_{t+(k+1)n} \bigl(S_{t+(k+1)n} \bigr ) \Bigr ] \\
&= \sum_{p=0}^{p=k}\gamma^{pn}\delta_{t+pn} + \gamma^{kn}\Bigl[ G_{t+kn} -V(S_T)\Bigr] \\
&=\sum_{p=0}^{p=k}\gamma^{pn}\delta_{t+pn} + \gamma^{kn}\Bigl[ G_{t+kn}-0\Bigr] \\
&=\sum_{p=0}^{p=k}\gamma^{pn}\delta_{t+pn} + \gamma^{kn} G_{t+kn}
\end{aligned}
Gt−Vt+n(St)=Rt+1+γRt+2+⋯+γT−t−1RT−Vt+n(St)=Rt+1+γRt+2+⋯+γn−1Rt+n+γnVt+n−1(St+n)−Vt+n(St)−γnVt+n−1(St+n)+γnRt+n+1+⋯+γT−t−1RT=Rt+1+γRt+2+⋯+γn−1Rt+n+γnVt+n−1(St+n)−Vt+n−1(St)−γnVt+n−1(St+n)+γnRt+n+1+⋯+γT−t−1RT=δt+γn[Rt+n+1+⋯+γT−(t+n)−1RT−Vt+n−1(St+n)]=δt+γn[Gt+n−Vt+2n(St+n)]=δt+γnδt+n+γ2n[Gt+2n−Vt+3n(St+2n)]=δt+γnδt+n+γ2nδt+2n+⋯+γkn[Gt+kn−Vt+(k+1)n(St+kn)]=δt+γnδt+n+γ2nδt+2n+⋯+γknδt+kn+γkn[Rt+kn+1+γRt+kn+2+⋯+γT−(t+kn)−1RT−Vt+(k+1)n(St+(k+1)n)]=p=0∑p=kγpnδt+pn+γkn[Gt+kn−V(ST)]=p=0∑p=kγpnδt+pn+γkn[Gt+kn−0]=p=0∑p=kγpnδt+pn+γknGt+kn
Specially, if t+kn+1=Tt+kn+1 =Tt+kn+1=T, then Gt+kn=0G_{t+kn} = 0Gt+kn=0, the Monte Carlo error is:
Gt−Vt+n(St)=∑p=0p=kγpnδt+pn
G_t - V_{t+n}(S_t) = \sum_{p=0}^{p=k}\gamma^{pn}\delta_{t+pn}
Gt−Vt+n(St)=p=0∑p=kγpnδt+pn
本文详细探讨了在强化学习中,当价值估计不随时间变化时,n-步误差如何可以表示为TD误差的总和,扩展了第6章中的结果。通过分析不同情况下的G值和V值,展示了从一步到多步误差的转换,证明了n-步误差在特定条件下可以分解为一系列TD误差的和。
626

被折叠的 条评论
为什么被折叠?



