n-step Bootsrapping:Part1

本文介绍了n步TD方法在预测和控制任务中的应用,包括n步TD预测、n步Sarsa及n步期望Sarsa等算法,并探讨了不同n值的选择对学习效果的影响。

Prediction

Actually the n-step TD is the method lying between MC and TD(0). It performs an update based on an intermediate number of rewards: more than one(TD(0)), but less than all of them until termination(MC). So, both the MC and TD(0) are the extreme exceptions of n-step TD.

In MC, the update occurs at the end of an episode, while in TD(0), the update occurs at next time step.

Some backup diagrams of specific n-step methods are shown in the following figure:
在这里插入图片描述
Noticing that, all diagrams start and end with a state, because we estimate the state value v π ( S ) v_{\pi}(S) vπ(S). In the part of control, we estimate the state-action value q π ( S , A ) q_{\pi}(S, A) qπ(S,A), whose diagrams all start and end with an action.

In Monte Carlo updates the target is the return, in one-step updates the target is the first reward plus the discounted estimated value of the next state.

n-step TD

Some returns, the target:

MC uses the complete return:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + ... +\gamma^{T-t-1} R_T Gt=Rt+1+γRt+2+γ2Rt+3+...+γTt1RT

one-step return:
G t : t + 1 = R t + 1 + γ V t ( S t + 1 ) G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}) Gt:t+1=Rt+1+γVt(St+1)
where V t V_t Vt is the estimate of v π v_{\pi} vπ at time t.

two-step return:
G t : t + 2 = R t + 1 + γ R t + 2 + γ 2 V t + 1 ( S t + 2 ) G_{t:t+2} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t+1}(S_{t+2}) Gt:t+2=Rt+1+γRt+2+γ2Vt+1(St+2)

n-step return:
G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V t + n − 1 ( S t + n ) G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}V_{t+n-1}(S_{t+n}) Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnVt+n1(St+n)

Note that n-step returns for n > 1 n > 1 n>1 involve future rewards and states that are not available at the time of transition from t t t to t + 1 t+1 t+1.

No real algorithm can use the n-step return until after it has seen R t + n R_{t+n} Rt+n and computed V t + n − 1 V_{t+n-1} Vt+n1. The first time these are available is t + n t+n t+n.

This also leads to the problem that no changes at all are made during the first n − 1 n-1 n1 steps.
Here comes the state-value learning algorithm:
V t + n ( S t ) = V t + n − 1 ( S t ) + α [ G t : t + n − V t + n − 1 ( S t ) ] , 0 ≤ t < T V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)], \qquad 0 \leq t < T Vt+n(St)=Vt+n1(St)+α[Gt:t+nVt+n1(St)],0t<T
This is n-step algorithm. It only changes the state-value function of S t S_t St, the values of all other states remain still: V t + n ( s ) = V t + n − 1 ( s ) V_{t+n}(s) = V_{t+n-1}(s) Vt+n(s)=Vt+n1(s), for all s ≠ S t s \neq S_t s=St.

The complete pseudocode is given as:
在这里插入图片描述

The expectation of n-step returns is guaranteed to be a better estimate of v π v_{\pi} vπ than V t + n − 1 V_{t+n-1} Vt+n1, in a worst-state sense.
The worst error of the expected n-step return is guaranteed to be less than or equal to γ n \gamma^n γn times the worst error under V t + n − 1 V_{t+n-1} Vt+n1.
m a x s ∣ E π [ G t : t + n ∣ S t = s ] − v π ( s ) ∣ ≤ γ n m a x s ∣ V t + n − 1 ( s ) − v π ( s ) ∣ \mathop{max}\limits_{s}|E_{\pi}[G_{t:t+n}|S_t=s]-v_{\pi}(s)| \leq \gamma^n \mathop{max}\limits_{s}|V_{t+n-1}(s)-v_{\pi}(s)| smaxEπ[Gt:t+nSt=s]vπ(s)γnsmaxVt+n1(s)vπ(s)

the choice of n

在这里插入图片描述
The picture above shows the estimate situation when we choose different n. It is the square-root of the average squared error between the predictions at the end of the episode for the 19 states and their true values. The less the average RMS error, the better. So, from the picture, the methods with an intermediate value of n n n worked best, which also implies that neither MC nor TD(0) are the best methods.

Control

n-step Sarsa

The thought is to apply the n-step methods to control. As previously mentioned, the update focuses on the state-action pairs, and the state-action value function( Q ( s , a ) Q(s, a) Q(s,a)) takes the place of the state value function( V ( s ) V(s) V(s)).

The n-step return is redefined as:
G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n Q t + n − 1 ( S t + n , A t + n ) , n ≥ 1 , 0 ≤ t < T − n G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}Q_{t+n-1}(S_{t+n}, A_{t+n}), \qquad n\geq1, 0 \leq t < T-n Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnQt+n1(St+n,At+n),n1,0t<Tn
with G t : t + n = G t G_{t:t+n}=G_t Gt:t+n=Gt if t + n ≥ T t+n\geq T t+nT

And the update algorithm is then:
Q t + n ( S t , A t ) = Q t + n − 1 ( S t , A t ) + α [ G t : t + n − Q t + n − 1 ( S t , A t ) ] , 0 ≤ t < T Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], \qquad 0 \leq t < T Qt+n(St,At)=Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)],0t<T

Also, as prediction, the value of all other state-action pairs remain unchanged: Q t + n ( s , a ) = Q t + n − 1 ( s , a ) Q_{t+n}(s,a) = Q_{t+n-1}(s, a) Qt+n(s,a)=Qt+n1(s,a) for all s , a s, a s,a such that s ≠ S t s \neq S_t s=St or a ≠ A t a \neq A_t a=At. This algorithm is called n-step Sarsa.
在这里插入图片描述
And the back up diagram is:
在这里插入图片描述
In the following figure, it is obvious that the learning process can be accelerated with the application of n-step Sarsa compared to one-step methods.
在这里插入图片描述
The first panel is the complete path taken by the agent, G is the terminal position, where and only where the agent can get a positive reward. The arrows in the other two panels show which action values were strengthened as a result of this path by one-step and 10-step Sarsa methods .

The one-step method strengthens only the last action of sequence of actions that led to the high reward, whereas the n-step method strengthens the last n actions of the sequence, so that much more is learned from the one episode.

n-step Expected Sarsa

According to the back up diagram above, n-step version of Expected Sarsa is just as the form of n-step Sarsa, except that its last element is a branch over all actions. So the n step return should redefined as:
G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V ‾ t + n − 1 ( S t + n ) , t < T − n G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}\overline{V}_{t+n-1}(S_{t+n}), \qquad t < T-n Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnVt+n1(St+n),t<Tn
with G t : t + n = G t G_{t:t+n}=G_t Gt:t+n=Gt for t + n ≥ T t+n \geq T t+nT, and V ‾ t + n − 1 \overline{V}_{t+n-1} Vt+n1 is actually the expected approximate value of state s, using the estimated action values at time t, under the target policy:

V ‾ t ( s ) = ∑ a π ( a ∣ s ) Q t ( s , a ) \overline{V}_t(s) = \sum_{a}\pi(a|s)Q_t(s, a) Vt(s)=aπ(as)Qt(s,a) \quad for all s ∈ S s \in S sS

If s is terminal then its expected approximate value is defined to be 0

V ‾ t ( s ) \overline{V}_t(s) Vt(s) weight all the possible actions of the state, so noted as the form of “V”.

n-step off-policy Learning

Here comes a new conception importance sampling ratio, noted as ρ t : t + n − 1 \rho_{t:t+n-1} ρt:t+n1.

The relative probability under the two policies of taking the n actions from A t A_t At to A t + 1 A_{t+1} At+1

ρ t : h = ∏ k = t m i n ( h , T − 1 ) π ( A k ∣ S k ) b ( A k ∣ S k ) \qquad\rho_{t:h}=\prod\limits_{k=t}^{min(h,T-1)}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} ρt:h=k=tmin(h,T1)b(AkSk)π(AkSk)

π \pi π and b b b in the above formula represent two different polices.

The off-policy version of n-step TD is:
V t + n ( S t ) = V t + n − 1 ( S t ) + α ρ t : t + n − 1 [ G t : t + n − V t + n − 1 ( S t ) ] , 0 ≤ t < T V_{t+n}(S_t)=V_{t+n-1}(S_t) +\alpha\rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], \qquad 0\leq t < T Vt+n(St)=Vt+n1(St)+αρt:t+n1[Gt:t+nVt+n1(St)],0t<T

To consider this: if the two policies are actually the same(on-policy case) then the ρ \rho ρ is always 1, thus the new update above can completely replace the earlier n-step TD update. Similarly, the previous n-step Sarsa update can be completely replaced by a off-policy form:
Q t + n ( S t , A t ) = Q t + n − 1 ( S t , A t ) + α ρ t + 1 : t + n [ G t : t + n − Q t + n − 1 ( S t , A t ) ] , 0 ≤ t < T Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha\rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], \qquad 0 \leq t < T Qt+n(St,At)=Qt+n1(St,At)+αρt+1:t+n[Gt:t+nQt+n1(St,At)],0t<T

The importance sampling ratio here starts and ends one step later than for n-step TD. This is because here we are updating a state-action pair

Here comes the pseudocode for the off-policy version of n-step Sarsa.在这里插入图片描述

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

在这里插入图片描述
Down the central spine are three sample states and rewards, and two sample actions, these are the events occur after the initial state-action pair ( S t , A t ) (S_t, A_t) (St,At). Hanging off to the sides are the actions that were not selected.

Because we have no sample date for the unselected actions, we bootstrap and use the estimates of their values in forming the target for the update.

In the tree-backup update, the target includes the rewards along the way, the estimated value of the nodes at the bottom, plus the estimated values of the dangling action nodes hanging off the sides, at all levels.

While the action nodes in the interior, corresponding to the actual actions taken, do not participate the update, because its reward is known and has been involved.

Each leaf node contributes to the target with a weight proportional to its probability of occurring under the target policy π \pi π.

So, each first-level action a contributes with a weight of π ( a ∣ S t + 1 ) \pi(a|S_{t+1}) π(aSt+1), except the action actually taken( A t + 1 A_{t+1} At+1), but its probability, π ( A t + 1 ∣ S t + 1 ) \pi(A_{t+1}|S_{t+1}) π(At+1St+1), is used to weight all the second-level action values.
So, each non-selected second-level action a’ contributes with the weight π ( A t + 1 ∣ S t + 1 ) π ( a ′ ∣ S t + 2 ) \pi(A_{t+1}| S_{t+1})\pi(a'|S_{t+2}) π(At+1St+1)π(aSt+2).
Each non-selected third-level action a’’ contributes with the weight π ( A t + 1 ∣ S t + 1 ) π ( A t + 2 ∣ S t + 2 ) π ( a ′ ′ ∣ S t + 3 ) \pi(A_{t+1}|S_{t+1})\pi(A_{t+2}|S_{t+2})\pi(a''|S_{t+3}) π(At+1St+1)π(At+2St+2)π(aSt+3)

The follows are the detailed equations for the n-step tree-backup algorithm:

The one step return is the same as that of Expected Sarsa: G t : t + 1 = R t + 1 + γ ∑ a π ( a ∣ S t + 1 ) Q t ( S t + 1 , a ) , t < T − 1 G_{t:t+1} = R_{t+1}+\gamma\sum\limits_{a}\pi(a|S_{t+1})Q_{t}(S_{t+1}, a), \quad t < T-1 Gt:t+1=Rt+1+γaπ(aSt+1)Qt(St+1,a),t<T1

Two-step tree-backup return is:
G t : t + 1 = R t + 1 + γ ∑ a ≠ A t + 1 π ( a ∣ S t + 1 ) Q t + 1 ( S t + 1 , a ) + γ π ( A t + 1 ∣ S t + 1 ) ( R t + 2 + γ ∑ a π ( a ∣ S t + 2 ) Q t + 1 ( S t + 2 , a ) ) = R t + 1 + γ ∑ a ≠ A t + 1 π ( a ∣ S t + 1 ) Q t + 1 ( S t + 1 , a ) + γ π ( A t + 1 ∣ S t + 1 ) G t + 1 : t + 2 G_{t:t+1} = R_{t+1}+\gamma\sum\limits_{a \ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1}, a)\\ + \gamma\pi(A_{t+1}|S_{t+1})(R_{t+2}+\gamma\sum\limits_{a}\pi(a|S_{t+2})Q_{t+1}(S_{t+2}, a)) \\ =R_{t+1} + \gamma\sum\limits_{a \ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1},a) + \gamma\pi(A_{t+1}|S_{t+1})G_{t+1:t+2} Gt:t+1=Rt+1+γa=At+1π(aSt+1)Qt+1(St+1,a)+γπ(At+1St+1)(Rt+2+γaπ(aSt+2)Qt+1(St+2,a))=Rt+1+γa=At+1π(aSt+1)Qt+1(St+1,a)+γπ(At+1St+1)Gt+1:t+2,

for t < T − 1 , n > 2 t < T-1, n>2 t<T1,n>2

Action-value update rule from n-step Sarsa:
Q t + n ( S t , A t ) ≐ Q t + n − 1 ( S t , A t ) + α [ G t : t + n − Q t + n − 1 ( S t , A t ) ] Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t, A_t)] Qt+n(St,At)Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)]
for 0 ≤ t < T 0 \leq t < T 0t<T, and all other state-action pairs remain unchanged.
And its pseoducode is:
在这里插入图片描述

References

[1]. Reinforcement Learning-An introduction

If there is infringement, promise to delete immediately

EVENT-2-1-26-v17.06 BASIC NOTIFICATION INTERFACE – TOPIC SUB-TREE IN PULLMESSAGES FILTER STEP 1 - Get Device service address StepStart: 2025-07-01T07:59:50.5222015Z http://192.168.137.122:2020/onvif/device_service STEP PASSED STEP 2 - Check that the DUT returned Device service address StepStart: 2025-07-01T07:59:50.5261692Z STEP PASSED STEP 3 - Get Services StepStart: 2025-07-01T07:59:50.5752728Z Transmit done Receive done STEP PASSED STEP 4 - Get Event service address StepStart: 2025-07-01T07:59:50.6536401Z http://192.168.137.122:2020/onvif/service STEP PASSED STEP 5 - Check that the DUT returned Event service address StepStart: 2025-07-01T07:59:50.657608Z STEP PASSED STEP 6 - Get Event Properties StepStart: 2025-07-01T07:59:50.6665356Z Transmit done Receive done STEP PASSED STEP 7 - Parse topic StepStart: 2025-07-01T07:59:50.7657355Z STEP PASSED STEP 8 - Creating listening server StepStart: 2025-07-01T07:59:50.7682152Z STEP PASSED STEP 9 - Send Subscribe request StepStart: 2025-07-01T07:59:50.7811113Z Transmit done Receive done STEP PASSED STEP 10 - Check that the DUT returned Subscribe response StepStart: 2025-07-01T07:59:50.8713822Z STEP PASSED STEP 11 - Check that CurrentTime is specified StepStart: 2025-07-01T07:59:50.8803105Z STEP PASSED STEP 12 - Check that TerminationTime is specified StepStart: 2025-07-01T07:59:50.8832861Z STEP PASSED STEP 13 - Check that TerminationTime and CurrentTime has reasonable values StepStart: 2025-07-01T07:59:50.8847741Z STEP PASSED STEP 14 - Validate CurrentTime and TerminationTime StepStart: 2025-07-01T07:59:50.8872543Z STEP PASSED STEP 15 - Check if the DUT returned SubscriptionReference StepStart: 2025-07-01T07:59:50.8902306Z STEP PASSED STEP 16 - Check if SubscriptionReference contains address StepStart: 2025-07-01T07:59:50.8927102Z STEP PASSED STEP 17 - Check that URL specified is valid StepStart: 2025-07-01T07:59:50.8961821Z STEP PASSED STEP 18 - Wait for notification StepStart: 2025-07-01T07:59:50.9001503Z No notification received! STEP FAILED STEP 19 - Send Unsubscribe request StepStart: 2025-07-01T07:59:51.9075175Z Transmit done Receive done STEP PASSED TEST FAILED
07-02
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值