n-step Bootsrapping:Part1

本文介绍了n步TD方法在预测和控制任务中的应用,包括n步TD预测、n步Sarsa及n步期望Sarsa等算法,并探讨了不同n值的选择对学习效果的影响。

Prediction

Actually the n-step TD is the method lying between MC and TD(0). It performs an update based on an intermediate number of rewards: more than one(TD(0)), but less than all of them until termination(MC). So, both the MC and TD(0) are the extreme exceptions of n-step TD.

In MC, the update occurs at the end of an episode, while in TD(0), the update occurs at next time step.

Some backup diagrams of specific n-step methods are shown in the following figure:
在这里插入图片描述
Noticing that, all diagrams start and end with a state, because we estimate the state value v π ( S ) v_{\pi}(S) vπ(S). In the part of control, we estimate the state-action value q π ( S , A ) q_{\pi}(S, A) qπ(S,A), whose diagrams all start and end with an action.

In Monte Carlo updates the target is the return, in one-step updates the target is the first reward plus the discounted estimated value of the next state.

n-step TD

Some returns, the target:

MC uses the complete return:
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . + γ T − t − 1 R T G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3} + ... +\gamma^{T-t-1} R_T Gt=Rt+1+γRt+2+γ2Rt+3+...+γTt1RT

one-step return:
G t : t + 1 = R t + 1 + γ V t ( S t + 1 ) G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}) Gt:t+1=Rt+1+γVt(St+1)
where V t V_t Vt is the estimate of v π v_{\pi} vπ at time t.

two-step return:
G t : t + 2 = R t + 1 + γ R t + 2 + γ 2 V t + 1 ( S t + 2 ) G_{t:t+2} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t+1}(S_{t+2}) Gt:t+2=Rt+1+γRt+2+γ2Vt+1(St+2)

n-step return:
G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V t + n − 1 ( S t + n ) G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}V_{t+n-1}(S_{t+n}) Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnVt+n1(St+n)

Note that n-step returns for n > 1 n > 1 n>1 involve future rewards and states that are not available at the time of transition from t t t to t + 1 t+1 t+1.

No real algorithm can use the n-step return until after it has seen R t + n R_{t+n} Rt+n and computed V t + n − 1 V_{t+n-1} Vt+n1. The first time these are available is t + n t+n t+n.

This also leads to the problem that no changes at all are made during the first n − 1 n-1 n1 steps.
Here comes the state-value learning algorithm:
V t + n ( S t ) = V t + n − 1 ( S t ) + α [ G t : t + n − V t + n − 1 ( S t ) ] , 0 ≤ t < T V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)], \qquad 0 \leq t < T Vt+n(St)=Vt+n1(St)+α[Gt:t+nVt+n1(St)],0t<T
This is n-step algorithm. It only changes the state-value function of S t S_t St, the values of all other states remain still: V t + n ( s ) = V t + n − 1 ( s ) V_{t+n}(s) = V_{t+n-1}(s) Vt+n(s)=Vt+n1(s), for all s ≠ S t s \neq S_t s=St.

The complete pseudocode is given as:
在这里插入图片描述

The expectation of n-step returns is guaranteed to be a better estimate of v π v_{\pi} vπ than V t + n − 1 V_{t+n-1} Vt+n1, in a worst-state sense.
The worst error of the expected n-step return is guaranteed to be less than or equal to γ n \gamma^n γn times the worst error under V t + n − 1 V_{t+n-1} Vt+n1.
m a x s ∣ E π [ G t : t + n ∣ S t = s ] − v π ( s ) ∣ ≤ γ n m a x s ∣ V t + n − 1 ( s ) − v π ( s ) ∣ \mathop{max}\limits_{s}|E_{\pi}[G_{t:t+n}|S_t=s]-v_{\pi}(s)| \leq \gamma^n \mathop{max}\limits_{s}|V_{t+n-1}(s)-v_{\pi}(s)| smaxEπ[Gt:t+nSt=s]vπ(s)γnsmaxVt+n1(s)vπ(s)

the choice of n

在这里插入图片描述
The picture above shows the estimate situation when we choose different n. It is the square-root of the average squared error between the predictions at the end of the episode for the 19 states and their true values. The less the average RMS error, the better. So, from the picture, the methods with an intermediate value of n n n worked best, which also implies that neither MC nor TD(0) are the best methods.

Control

n-step Sarsa

The thought is to apply the n-step methods to control. As previously mentioned, the update focuses on the state-action pairs, and the state-action value function( Q ( s , a ) Q(s, a) Q(s,a)) takes the place of the state value function( V ( s ) V(s) V(s)).

The n-step return is redefined as:
G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n Q t + n − 1 ( S t + n , A t + n ) , n ≥ 1 , 0 ≤ t < T − n G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}Q_{t+n-1}(S_{t+n}, A_{t+n}), \qquad n\geq1, 0 \leq t < T-n Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnQt+n1(St+n,At+n),n1,0t<Tn
with G t : t + n = G t G_{t:t+n}=G_t Gt:t+n=Gt if t + n ≥ T t+n\geq T t+nT

And the update algorithm is then:
Q t + n ( S t , A t ) = Q t + n − 1 ( S t , A t ) + α [ G t : t + n − Q t + n − 1 ( S t , A t ) ] , 0 ≤ t < T Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], \qquad 0 \leq t < T Qt+n(St,At)=Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)],0t<T

Also, as prediction, the value of all other state-action pairs remain unchanged: Q t + n ( s , a ) = Q t + n − 1 ( s , a ) Q_{t+n}(s,a) = Q_{t+n-1}(s, a) Qt+n(s,a)=Qt+n1(s,a) for all s , a s, a s,a such that s ≠ S t s \neq S_t s=St or a ≠ A t a \neq A_t a=At. This algorithm is called n-step Sarsa.
在这里插入图片描述
And the back up diagram is:
在这里插入图片描述
In the following figure, it is obvious that the learning process can be accelerated with the application of n-step Sarsa compared to one-step methods.
在这里插入图片描述
The first panel is the complete path taken by the agent, G is the terminal position, where and only where the agent can get a positive reward. The arrows in the other two panels show which action values were strengthened as a result of this path by one-step and 10-step Sarsa methods .

The one-step method strengthens only the last action of sequence of actions that led to the high reward, whereas the n-step method strengthens the last n actions of the sequence, so that much more is learned from the one episode.

n-step Expected Sarsa

According to the back up diagram above, n-step version of Expected Sarsa is just as the form of n-step Sarsa, except that its last element is a branch over all actions. So the n step return should redefined as:
G t : t + n = R t + 1 + γ R t + 2 + . . . + γ n − 1 R t + n + γ n V ‾ t + n − 1 ( S t + n ) , t < T − n G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+ ... +\gamma^{n-1} R_{t+n} + \gamma^{n}\overline{V}_{t+n-1}(S_{t+n}), \qquad t < T-n Gt:t+n=Rt+1+γRt+2+...+γn1Rt+n+γnVt+n1(St+n),t<Tn
with G t : t + n = G t G_{t:t+n}=G_t Gt:t+n=Gt for t + n ≥ T t+n \geq T t+nT, and V ‾ t + n − 1 \overline{V}_{t+n-1} Vt+n1 is actually the expected approximate value of state s, using the estimated action values at time t, under the target policy:

V ‾ t ( s ) = ∑ a π ( a ∣ s ) Q t ( s , a ) \overline{V}_t(s) = \sum_{a}\pi(a|s)Q_t(s, a) Vt(s)=aπ(as)Qt(s,a) \quad for all s ∈ S s \in S sS

If s is terminal then its expected approximate value is defined to be 0

V ‾ t ( s ) \overline{V}_t(s) Vt(s) weight all the possible actions of the state, so noted as the form of “V”.

n-step off-policy Learning

Here comes a new conception importance sampling ratio, noted as ρ t : t + n − 1 \rho_{t:t+n-1} ρt:t+n1.

The relative probability under the two policies of taking the n actions from A t A_t At to A t + 1 A_{t+1} At+1

ρ t : h = ∏ k = t m i n ( h , T − 1 ) π ( A k ∣ S k ) b ( A k ∣ S k ) \qquad\rho_{t:h}=\prod\limits_{k=t}^{min(h,T-1)}\frac{\pi(A_k|S_k)}{b(A_k|S_k)} ρt:h=k=tmin(h,T1)b(AkSk)π(AkSk)

π \pi π and b b b in the above formula represent two different polices.

The off-policy version of n-step TD is:
V t + n ( S t ) = V t + n − 1 ( S t ) + α ρ t : t + n − 1 [ G t : t + n − V t + n − 1 ( S t ) ] , 0 ≤ t < T V_{t+n}(S_t)=V_{t+n-1}(S_t) +\alpha\rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)], \qquad 0\leq t < T Vt+n(St)=Vt+n1(St)+αρt:t+n1[Gt:t+nVt+n1(St)],0t<T

To consider this: if the two policies are actually the same(on-policy case) then the ρ \rho ρ is always 1, thus the new update above can completely replace the earlier n-step TD update. Similarly, the previous n-step Sarsa update can be completely replaced by a off-policy form:
Q t + n ( S t , A t ) = Q t + n − 1 ( S t , A t ) + α ρ t + 1 : t + n [ G t : t + n − Q t + n − 1 ( S t , A t ) ] , 0 ≤ t < T Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha\rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], \qquad 0 \leq t < T Qt+n(St,At)=Qt+n1(St,At)+αρt+1:t+n[Gt:t+nQt+n1(St,At)],0t<T

The importance sampling ratio here starts and ends one step later than for n-step TD. This is because here we are updating a state-action pair

Here comes the pseudocode for the off-policy version of n-step Sarsa.在这里插入图片描述

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

在这里插入图片描述
Down the central spine are three sample states and rewards, and two sample actions, these are the events occur after the initial state-action pair ( S t , A t ) (S_t, A_t) (St,At). Hanging off to the sides are the actions that were not selected.

Because we have no sample date for the unselected actions, we bootstrap and use the estimates of their values in forming the target for the update.

In the tree-backup update, the target includes the rewards along the way, the estimated value of the nodes at the bottom, plus the estimated values of the dangling action nodes hanging off the sides, at all levels.

While the action nodes in the interior, corresponding to the actual actions taken, do not participate the update, because its reward is known and has been involved.

Each leaf node contributes to the target with a weight proportional to its probability of occurring under the target policy π \pi π.

So, each first-level action a contributes with a weight of π ( a ∣ S t + 1 ) \pi(a|S_{t+1}) π(aSt+1), except the action actually taken( A t + 1 A_{t+1} At+1), but its probability, π ( A t + 1 ∣ S t + 1 ) \pi(A_{t+1}|S_{t+1}) π(At+1St+1), is used to weight all the second-level action values.
So, each non-selected second-level action a’ contributes with the weight π ( A t + 1 ∣ S t + 1 ) π ( a ′ ∣ S t + 2 ) \pi(A_{t+1}| S_{t+1})\pi(a'|S_{t+2}) π(At+1St+1)π(aSt+2).
Each non-selected third-level action a’’ contributes with the weight π ( A t + 1 ∣ S t + 1 ) π ( A t + 2 ∣ S t + 2 ) π ( a ′ ′ ∣ S t + 3 ) \pi(A_{t+1}|S_{t+1})\pi(A_{t+2}|S_{t+2})\pi(a''|S_{t+3}) π(At+1St+1)π(At+2St+2)π(aSt+3)

The follows are the detailed equations for the n-step tree-backup algorithm:

The one step return is the same as that of Expected Sarsa: G t : t + 1 = R t + 1 + γ ∑ a π ( a ∣ S t + 1 ) Q t ( S t + 1 , a ) , t < T − 1 G_{t:t+1} = R_{t+1}+\gamma\sum\limits_{a}\pi(a|S_{t+1})Q_{t}(S_{t+1}, a), \quad t < T-1 Gt:t+1=Rt+1+γaπ(aSt+1)Qt(St+1,a),t<T1

Two-step tree-backup return is:
G t : t + 1 = R t + 1 + γ ∑ a ≠ A t + 1 π ( a ∣ S t + 1 ) Q t + 1 ( S t + 1 , a ) + γ π ( A t + 1 ∣ S t + 1 ) ( R t + 2 + γ ∑ a π ( a ∣ S t + 2 ) Q t + 1 ( S t + 2 , a ) ) = R t + 1 + γ ∑ a ≠ A t + 1 π ( a ∣ S t + 1 ) Q t + 1 ( S t + 1 , a ) + γ π ( A t + 1 ∣ S t + 1 ) G t + 1 : t + 2 G_{t:t+1} = R_{t+1}+\gamma\sum\limits_{a \ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1}, a)\\ + \gamma\pi(A_{t+1}|S_{t+1})(R_{t+2}+\gamma\sum\limits_{a}\pi(a|S_{t+2})Q_{t+1}(S_{t+2}, a)) \\ =R_{t+1} + \gamma\sum\limits_{a \ne A_{t+1}}\pi(a|S_{t+1})Q_{t+1}(S_{t+1},a) + \gamma\pi(A_{t+1}|S_{t+1})G_{t+1:t+2} Gt:t+1=Rt+1+γa=At+1π(aSt+1)Qt+1(St+1,a)+γπ(At+1St+1)(Rt+2+γaπ(aSt+2)Qt+1(St+2,a))=Rt+1+γa=At+1π(aSt+1)Qt+1(St+1,a)+γπ(At+1St+1)Gt+1:t+2,

for t < T − 1 , n > 2 t < T-1, n>2 t<T1,n>2

Action-value update rule from n-step Sarsa:
Q t + n ( S t , A t ) ≐ Q t + n − 1 ( S t , A t ) + α [ G t : t + n − Q t + n − 1 ( S t , A t ) ] Q_{t+n}(S_t, A_t) \doteq Q_{t+n-1}(S_t, A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t, A_t)] Qt+n(St,At)Qt+n1(St,At)+α[Gt:t+nQt+n1(St,At)]
for 0 ≤ t < T 0 \leq t < T 0t<T, and all other state-action pairs remain unchanged.
And its pseoducode is:
在这里插入图片描述

References

[1]. Reinforcement Learning-An introduction

If there is infringement, promise to delete immediately

基于TROPOMI高光谱遥感仪器获取的大气成分观测资料,本研究聚焦于大气污染物一氧化氮(NO₂)的空间分布与浓度定量反演问题。NO₂作为影响空气质量的关键指标,其精确监测对环境保护与大气科学研究具有显著价值。当前,利用卫星遥感数据结合先进算法实现NO₂浓度的高精度反演已成为该领域的重要研究方向。 本研究构建了一套以深度学习为核心的技术框架,整合了来自TROPOMI仪器的光谱辐射信息、观测几何参数以及辅助气象数据,形成多维度特征数据集。该数据集充分融合了不同来源的观测信息,为深入解析大气中NO₂的时空变化规律提供了数据基础,有助于提升反演模型的准确性与环境预测的可靠性。 在模型架构方面,项目设计了一种多分支神经网络,用于分别处理光谱特征与气象特征等多模态数据。各分支通过独立学习提取代表性特征,并在深层网络中进行特征融合,从而综合利用不同数据的互补信息,显著提高了NO₂浓度反演的整体精度。这种多源信息融合策略有效增强了模型对复杂大气环境的表征能力。 研究过程涵盖了系统的数据处理流程。前期预处理包括辐射定标、噪声抑制及数据标准化等步骤,以保障输入特征的质量与一致性;后期处理则涉及模型输出的物理量转换与结果验证,确保反演结果符合实际大气浓度范围,提升数据的实用价值。 此外,本研究进一步对不同功能区域(如城市建成区、工业带、郊区及自然背景区)的NO₂浓度分布进行了对比分析,揭示了人类活动与污染物空间格局的关联性。相关结论可为区域环境规划、污染管控政策的制定提供科学依据,助力大气环境治理与公共健康保护。 综上所述,本研究通过融合TROPOMI高光谱数据与多模态特征深度学习技术,发展了一套高效、准确的大气NO₂浓度遥感反演方法,不仅提升了卫星大气监测的技术水平,也为环境管理与决策支持提供了重要的技术工具。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
EVENT-2-1-26-v17.06 BASIC NOTIFICATION INTERFACE – TOPIC SUB-TREE IN PULLMESSAGES FILTER STEP 1 - Get Device service address StepStart: 2025-07-01T07:59:50.5222015Z http://192.168.137.122:2020/onvif/device_service STEP PASSED STEP 2 - Check that the DUT returned Device service address StepStart: 2025-07-01T07:59:50.5261692Z STEP PASSED STEP 3 - Get Services StepStart: 2025-07-01T07:59:50.5752728Z Transmit done Receive done STEP PASSED STEP 4 - Get Event service address StepStart: 2025-07-01T07:59:50.6536401Z http://192.168.137.122:2020/onvif/service STEP PASSED STEP 5 - Check that the DUT returned Event service address StepStart: 2025-07-01T07:59:50.657608Z STEP PASSED STEP 6 - Get Event Properties StepStart: 2025-07-01T07:59:50.6665356Z Transmit done Receive done STEP PASSED STEP 7 - Parse topic StepStart: 2025-07-01T07:59:50.7657355Z STEP PASSED STEP 8 - Creating listening server StepStart: 2025-07-01T07:59:50.7682152Z STEP PASSED STEP 9 - Send Subscribe request StepStart: 2025-07-01T07:59:50.7811113Z Transmit done Receive done STEP PASSED STEP 10 - Check that the DUT returned Subscribe response StepStart: 2025-07-01T07:59:50.8713822Z STEP PASSED STEP 11 - Check that CurrentTime is specified StepStart: 2025-07-01T07:59:50.8803105Z STEP PASSED STEP 12 - Check that TerminationTime is specified StepStart: 2025-07-01T07:59:50.8832861Z STEP PASSED STEP 13 - Check that TerminationTime and CurrentTime has reasonable values StepStart: 2025-07-01T07:59:50.8847741Z STEP PASSED STEP 14 - Validate CurrentTime and TerminationTime StepStart: 2025-07-01T07:59:50.8872543Z STEP PASSED STEP 15 - Check if the DUT returned SubscriptionReference StepStart: 2025-07-01T07:59:50.8902306Z STEP PASSED STEP 16 - Check if SubscriptionReference contains address StepStart: 2025-07-01T07:59:50.8927102Z STEP PASSED STEP 17 - Check that URL specified is valid StepStart: 2025-07-01T07:59:50.8961821Z STEP PASSED STEP 18 - Wait for notification StepStart: 2025-07-01T07:59:50.9001503Z No notification received! STEP FAILED STEP 19 - Send Unsubscribe request StepStart: 2025-07-01T07:59:51.9075175Z Transmit done Receive done STEP PASSED TEST FAILED
07-02
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值