Reinforcement Learning Exercise 7.4

最新推荐文章于 2020-11-25 20:22:42 发布

原创最新推荐文章于 2020-11-25 20:22:42 发布 · 465 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#reinforcement learning

reinforcement learning 专栏收录该内容

37 篇文章

订阅专栏

本文详细证明了Sarsa算法中n步回报可以精确地用一种新颖的TD误差来表示，通过数学推导展示了从一步到n步回报转换的过程，为理解强化学习中的时间差分方法提供了深入解析。

Exercise 7.4 Prove that the n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, as
$G_{t:t+n}=Q_{t-1}(S_t,A_t)+\sum_{k=t}^{min(t+n,T)-1} \gamma^{k-t}[R_{k+1} + \gamma Q_k( S_{k+1}, A_{k+1}) - Q_{k-1}(S_k,A_k)]$
Prove:
First $G_{t:t+n}$ can be written in terms of the sum of difference:
$\begin{aligned} G_{t:t+n} &= G_{t:t+1} -G_{t:t+1} +G_{t:t+2} -G_{t:t+2} + \cdots + G_{t:t+n-2} -G_{t:t+n-2} +G_{t:t+n-1} -G_{t:t+n-1} + G_{t:t+n}\\ &=G_{t:t+1} + (G_{t:t+2} - G_{t:t+1}) + \cdots +(G_{t:t+n} - G_{t:t+n-1})\\ &=G_{t:t+1}+\sum_{i=2}^n(G_{t:t+i}-G_{t:t+i-1}) \tag{1} \end{aligned}$
According to Sarsa (7.4)
$G_{t:t+n} \doteq R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}), \qquad n \geq1, 0 \leq t < T-n \tag{7.4}$
there is:
$\begin{aligned} G_{t:t+n} - G_{t:t+n-1} & = \gamma^{n-1}R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n} , A_{t+n}) - \gamma^{n-1} Q_{t+n-2}(S_{t+n-1}, A_{t+n-1}) \\ &= \gamma^{n-1}\bigl[ R_{t+n} + \gamma Q_{t+n-1}(S_{t+n} , A_{t+n}) -Q_{t+n-2}(S_{t+n-1} , A_{t+n-1})\bigr] \tag{2} \end{aligned}$
and for $n = 1$ , there is:
$\begin{aligned} G_{t:t+1} &=\gamma^0 R_{t+1} + \gamma^1 Q_{t+1-1} (S_{t+1}, A_{t+1}) \\ &=\gamma^0R_{t+1} + \gamma^1 Q_{t+1-1} (S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) + Q_{t-1}(S_t, A_t) \\ &= Q_{t-1}(S_t, A_t) + \gamma^0 \bigl[ R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) \bigr ] \tag{3} \end{aligned}$
Substitute equation (2) and (3) into (1), we get:
$\begin{aligned} G_{t:t+n} &= Q_{t-1}(S_t,A_t) + \gamma^0 \bigl[ R_{t+1} + \gamma Q_t(S_{t+1}, A_{t+1}) - Q_{t-1}(S_t, A_t) \bigr ] \\ &\quad+ \sum_{i=2}^n \gamma^{i-1}\bigl[ R_{t+i} + \gamma Q_{t+i-1}(S_{t+i} , A_{t+i}) -Q_{t+i-2}(S_{t+i-1} , A_{t+i-1})\bigr] \\ &= Q_{t-1}(S_t,A_t) + \sum_{i=1}^n \gamma^{i-1}\bigl[ R_{t+i} + \gamma Q_{t+i-1}(S_{t+i} , A_{t+i}) -Q_{t+i-2}(S_{t+i-1} , A_{t+i-1})\bigr] \tag{4}\\ \end{aligned}$
Let $k = i + t - 1$ , so $i = k - t + 1$ equation (4) can be written as:
$G_{t:t+n} = Q_{t-1}(S_t, A_t) + \sum_{k=t}^{t+n-1}\gamma^{k-t}\bigl[ R_{k+1} + \gamma Q_{k}(S_{k+1} , A_{k+1}) -Q_{k-1}(S_{k} , A_{k})\bigr] \tag{5}$
$t + n$ should not larger than $T$ , so equation (5) can be written as:
$G_{t:t+n}=Q_{t-1}(S_t,A_t)+\sum_{k=t}^{min(t+n,T)-1} \gamma^{k-t}[R_{k+1} + \gamma Q_k( S_{k+1}, A_{k+1}) - Q_{k-1}(S_k,A_k)]$
PROVED.