快速批量处理 PDF:Doc2X 为您服务
想要批量处理 PDF 转 Word、Latex 或 Markdown?Doc2X 提供高效的 公式解析、表格识别、代码解析,支持 深度翻译 和 大模型训练语料提取,为科研文档处理提速!
Fast Batch PDF Processing: Doc2X at Your Service
Need batch processing for PDF to Word, LaTeX, or Markdown? Doc2X delivers efficient formula parsing, table recognition, and code parsing, with support for advanced translation and large-model training corpus extraction, boosting research productivity!
👉 立即访问 Doc2X | Visit Doc2X Now
原文链接:https://arxiv.org/pdf/2405.09999
Reward Centering
奖励中心化
Abhishek Naik 1 , 2 {}^{1,2} 1,2 ,Yi Wan 3 {}^{3} 3 ,Manan Tomar 1 , 2 {}^{1,2} 1,2 ,Richard S. Sutton 1 , 2 {}^{1,2} 1,2
Abhishek Naik 1 , 2 {}^{1,2} 1,2 ,Yi Wan 3 {}^{3} 3 ,Manan Tomar 1 , 2 {}^{1,2} 1,2 ,Richard S. Sutton 1 , 2 {}^{1,2} 1,2
{abhishek.naik,mtomar,rsutton}@ualberta.ca, yiwan@meta.com
{abhishek.naik,mtomar,rsutton}@ualberta.ca, yiwan@meta.com
1 {}^{1} 1 University of Alberta 2 {}^{2} 2 Alberta Machine Intelligence Institute 3 {}^{3} 3 Meta AI
1 {}^{1} 1 阿尔伯塔大学 2 {}^{2} 2 阿尔伯塔机器智能研究所 3 {}^{3} 3 Meta AI
Abstract
摘要
We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards’ empirical average. The improvement is substantial at commonly used discount factors and increases further as the discount factor approaches one. In addition, we show that if a problem’s rewards are shifted by a constant, then standard methods perform much worse, whereas methods with reward centering are unaffected. Estimating the average reward is straightforward in the on-policy setting; we propose a slightly more sophisticated method for the off-policy setting. Reward centering is a general idea, so we expect almost every reinforcement-learning algorithm to benefit by the addition of reward centering.
我们表明,解决持续强化学习问题的折扣方法如果通过减去奖励的经验平均值来中心化其奖励,则可以显著提高性能。在常用的折扣因子下,改进是显著的,并且随着折扣因子接近于1而进一步增加。此外,我们表明,如果一个问题的奖励被一个常数偏移,则标准方法的表现会大幅下降,而奖励中心化的方法则不受影响。在策略设置中,估计平均奖励是直接的;我们为离策略设置提出了一种稍微复杂的方法。奖励中心化是一个通用的概念,因此我们预计几乎每种强化学习算法都能从奖励中心化的添加中受益。
Reinforcement learning is a computational approach to learning from interaction, where the goal of a learning agent is to obtain as much reward as possible (Sutton & Barto, 2018). In many problems of interest, the stream of interaction between the agent and the environment is continuing and cannot be naturally separated into disjoint subsequences or episodes. In continuing problems, agents experience infinitely many rewards, hence a viable way of evaluating performance is to measure the average reward obtained per step, or the rate of reward, with equal weight given to immediate and delayed rewards. The discounted-reward formulation offers another way to interpret a sum of infinite rewards by discounting delayed rewards in favor of immediate rewards. The two problem formulations are typically studied separately, each having a set of solution methods or algorithms.
强化学习是一种通过交互学习的计算方法,其中学习代理的目标是尽可能获得更多的奖励(Sutton & Barto, 2018)。在许多感兴趣的问题中,代理与环境之间的交互流是持续的,无法自然地分离为不相交的子序列或剧集。在持续问题中,代理经历无限多的奖励,因此评估性能的一种可行方法是测量每一步获得的平均奖励或奖励率,给予即时奖励和延迟奖励相等的权重。折扣奖励的表述提供了另一种解释无限奖励总和的方法,通过折扣延迟奖励以偏向即时奖励。这两种问题表述通常是分开研究的,每种都有一套解决方法或算法。
In this paper, we show that the simple idea of estimating and subtracting the average reward from the observed rewards can lead to a significant improvement in performance (as in Figure 1) when using common discounted methods such as actor-critic methods (Barto et al., 1983) or Q-learning (Watkins & Dayan, 1992). The underlying theory dates back to 1962 with Blackwell’s seminal work on dynamic programming in discrete Markov decision processes (MDPs). We are still realizing some of its deeper implications, and we discuss the following two in particular:
在本文中,我们展示了一个简单的想法,即估计并从观察到的奖励中减去平均奖励,可以在使用常见的折扣方法(如演员-评论家方法(Barto et al., 1983)或Q学习(Watkins & Dayan, 1992))时显著提高性能(如图1所示)。其基础理论可以追溯到1962年,Blackwell在离散马尔可夫决策过程(MDP)中的动态规划的开创性工作。我们仍在认识到其更深层次的含义,特别讨论以下两个方面:
Figure 1: Learning curves showing the difference in performance of Q-learning with and without reward centering for different discount factors on the Access-Control Queuing problem (Sutton & Barto, 1998). Plotted is the average per-step reward obtained by the agent across 50 runs w.r.t. the number of time steps of interaction. The shaded region denotes one standard error. See Section 4.
图1:学习曲线显示了在访问控制排队问题(Sutton & Barto, 1998)中,Q学习在不同折扣因子下的奖励中心化与非奖励中心化的性能差异。绘制的是代理在50次运行中获得的每步平均奖励,关于交互的时间步数。阴影区域表示一个标准误差。见第4节。
-
Mean-centering the rewards removes a state-independent constant (that scales inversely with 1 − γ 1 - \gamma 1−γ ,where γ \gamma γ denotes the discount factor) from the value estimates,enabling the value-function approximator to focus on the relative differences between the states and actions. As a result, values corresponding to discount factors arbitrarily close to one can be estimated relatively easily (e.g., without any degradation in performance; see Figure 1).
-
奖励的均值中心化消除了一个与状态无关的常数(该常数与 1 − γ 1 - \gamma 1−γ成反比,其中 γ \gamma γ表示折扣因子),使得价值函数逼近器能够专注于状态和动作之间的相对差异。因此,能够相对容易地估计与接近于1的折扣因子相对应的值(例如,性能没有任何下降;见图1)。
-
Furthermore, mean-centering the rewards (unsurprisingly) makes standard methods robust to any constant offset in the rewards. This can be useful in reinforcement learning applications in which the reward signal is unknown or changing.
-
此外,奖励的均值中心化(并不令人惊讶)使得标准方法对奖励中的任何常量偏移具有鲁棒性。这在奖励信号未知或变化的强化学习应用中是非常有用的。
We begin with what reward centering is and why it can be beneficial (Section 1). We then show how reward centering can be done, starting with the simplest form (within the prediction problem), and show that it can be highly effective when used with discounted-reward temporal difference algorithms (Section 2). The off-policy setting requires more sophistication; for it we propose another way of reward centering based on recent advances in the average-reward formulation for reinforcement learning (Section 3). Next, we present a case study of using reward centering with Q-learning, in which we (a) propose a convergence result based on recent work by Devraj and Meyn (2021) and (b) showcase consistent trends across a series of control problems that require tabular, linear, and non-linear function approximation (Section 4). Finally, we discuss the limitations of the proposed methods and propose directions of future work (Section 5).
我们首先介绍奖励中心化的概念及其潜在的好处(第1节)。接着,我们展示如何进行奖励中心化,从最简单的形式(在预测问题中)开始,并表明在与折扣奖励时间差算法结合使用时,它可以非常有效(第2节)。离策略设置需要更复杂的处理;为此,我们提出了一种基于近期在强化学习平均奖励公式方面的进展的奖励中心化方法(第3节)。接下来,我们展示了使用Q学习进行奖励中心化的案例研究,其中我们(a)基于Devraj和Meyn(2021)的最新工作提出了一个收敛结果,并(b)展示了一系列控制问题中的一致趋势,这些问题需要表格、线性和非线性函数逼近(第4节)。最后,我们讨论所提方法的局限性,并提出未来工作的方向(第5节)。
1 Theory of Reward Centering
1 奖励中心化理论
We formalize the interaction between the agent and the environment by a finite MDP(S,A,R,p), where S \mathcal{S} S denotes the set of states, A \mathcal{A} A denotes the set of actions, R \mathcal{R} R denotes the set of rewards, and p : S × R × S × A → [ 0 , 1 ] p : \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \rightarrow \left\lbrack {0,1}\right\rbrack p:S×R×S×A→[0,1] denotes the transition dynamics. At time step t t t ,the agent is in state S t ∈ S {S}_{t} \in \mathcal{S} St∈S ,takes action A t ∈ A {A}_{t} \in \mathcal{A} At∈A using a behavior policy b : A × S → [ 0 , 1 ] b : \mathcal{A} \times \mathcal{S} \rightarrow \left\lbrack {0,1}\right\rbrack b:A×S→[0,1] ,observes the next state S t + 1 ∈ S {S}_{t + 1} \in \mathcal{S} St+1∈S and reward R t + 1 ∈ R {R}_{t + 1} \in \mathcal{R} Rt+1∈R according to the transition dynamics p ( s ′ , r ∣ s , a ) = p\left( {{s}^{\prime },r \mid s,a}\right) = p(s′,r∣s,a)= Pr ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) \Pr \left( {{S}_{t + 1} = {s}^{\prime },{R}_{t + 1} = r \mid {S}_{t} = s,{A}_{t} = a}\right) Pr(St+1=s′,Rt+1=r∣St=s,At=a) . We consider continuing problems,where the agent-environment interaction goes on ad infinitum. The agent’s goal is to maximize the average reward obtained over a long time (formally defined in (2)). We consider methods that try to achieve this goal by estimating the expected discounted sum of rewards from each state for γ ∈ [ 0 , 1 ) \gamma \in \lbrack 0,1) γ∈[0,1) : v π γ ( s ) ≐ E [ ∑ t = 0 ∞ γ t R t + 1 ∣ S t = s , A t : ∞ ∼ π ] , ∀ s {v}_{\pi }^{\gamma }\left( s\right) \doteq \mathbb{E}\left\lbrack {\mathop{\sum }\limits_{{t = 0}}^{\infty }{\gamma }^{t}{R}_{t + 1} \mid {S}_{t} = s,{A}_{t : \infty } \sim \pi }\right\rbrack ,\forall s vπγ(s)≐E[t=0∑∞γtRt+1∣St=s,At:∞∼π],∀s . Here,the discount factor is not part of the problem but an algorithm parameter (see Naik et al. (2019) or Sutton & Barto’s (2018) Section 10.4 for an extended discussion on objectives for continuing problems).
我们通过有限马尔可夫决策过程(MDP)(S,A,R,p) 来形式化代理与环境之间的交互,其中 S \mathcal{S} S 表示状态集, A \mathcal{A} A 表示动作集, R \mathcal{R} R 表示奖励集, p : S × R × S × A → [ 0 , 1 ] p : \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \rightarrow \left\lbrack {0,1}\right\rbrack p:S×R×S×A→[0,1] 表示转移动态。在时间步 t t t 时,代理处于状态 S t ∈ S {S}_{t} \in \mathcal{S} St∈S,根据行为策略 b : A × S → [ 0 , 1 ] b : \mathcal{A} \times \mathcal{S} \rightarrow \left\lbrack {0,1}\right\rbrack b:A×S→[0,1] 采取动作 A t ∈ A {A}_{t} \in \mathcal{A} At∈A,并根据转移动态 p ( s ′ , r ∣ s , a ) = p\left( {{s}^{\prime },r \mid s,a}\right) = p(s′,r∣s,a)= Pr ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) \Pr \left( {{S}_{t + 1} = {s}^{\prime },{R}_{t + 1} = r \mid {S}_{t} = s,{A}_{t} = a}\right) Pr(St+1=s′,Rt+1=r∣St=s,At=a) 观察下一个状态 S t + 1 ∈ S {S}_{t + 1} \in \mathcal{S} St+1∈S 和奖励 R t + 1 ∈ R {R}_{t + 1} \in \mathcal{R} Rt+1∈R。我们考虑持续性问题,其中代理与环境的交互无限进行。代理的目标是最大化在较长时间内获得的平均奖励(在(2)中正式定义)。我们考虑通过估计每个状态的期望折扣奖励总和来实现这一目标的方法: γ ∈ [ 0 , 1 ) \gamma \in \lbrack 0,1) γ∈[0,1) : v π γ ( s ) ≐ E [ ∑ t = 0 ∞ γ t R t + 1 ∣ S t = s , A t : ∞ ∼ π ] , ∀ s {v}_{\pi }^{\gamma }\left( s\right) \doteq \mathbb{E}\left\lbrack {\mathop{\sum }\limits_{{t = 0}}^{\infty }{\gamma }^{t}{R}_{t + 1} \mid {S}_{t} = s,{A}_{t : \infty } \sim \pi }\right\rbrack ,\forall s vπγ(s)≐E[t=0∑∞γtRt+1∣St=s,At:∞∼π],∀s。在这里,折扣因子不是问题的一部分,而是算法参数(有关持续性问题目标的扩展讨论,请参见 Naik 等人(2019)或 Sutton & Barto(2018)第 10.4 节)。
Reward centering is a simple idea: subtract the empirical average of the observed rewards from the rewards. Doing so makes the modified rewards appear mean centered. The effect of mean-centered rewards is well known in the bandit setting. For instance, Sutton and Barto (2018, Section 2.8) demonstrated that estimating and subtracting the average reward from the observed rewards can significantly improve the rate of learning. Here we show that the benefits extend to the full reinforcement learning problem and are magnified as the discount factor γ \gamma γ approaches one.
奖励中心化是一个简单的想法:从观察到的奖励中减去经验平均值。这样做使得修改后的奖励看起来是均值中心的。均值中心奖励在赌博问题中的效果是众所周知的。例如,Sutton 和 Barto(2018,第 2.8 节)证明了从观察到的奖励中估计并减去平均奖励可以显著提高学习速度。在这里,我们展示了这种好处扩展到完整的强化学习问题,并且随着折扣因子 γ \gamma γ 趋近于 1,效果会被放大。
The reason underlying the benefits of reward centering is revealed by the Laurent-series decomposition of the discounted value function. The discounted value function can be decomposed into two parts, one of which is a constant that does not depend on states or actions and hence is not involved in,say,action selection. Mathematically,for the tabular discounted value function v π γ : S → R {v}_{\pi }^{\gamma } : \mathcal{S} \rightarrow \mathbb{R} vπγ:S→R of a policy π \pi π corresponding to a discount factor γ \gamma γ :
奖励中心化的好处背后的原因通过折扣值函数的 Laurent 级数分解揭示。折扣值函数可以分解为两部分,其中一部分是一个与状态或动作无关的常数,因此在,例如,动作选择中不涉及。数学上,对于与折扣因子 γ \gamma γ 相对应的策略 π \pi π 的表格折扣值函数 v π γ : S → R {v}_{\pi }^{\gamma } : \mathcal{S} \rightarrow \mathbb{R} vπγ:S→R:
where r ( π ) r\left( \pi \right) r(π) is the state-independent average reward obtained by policy π \pi π and v ~ π ( s ) {\widetilde{v}}_{\pi }\left( s\right) v π(s) is the differential value of state s s s ,each defined for ergodic MDPs (for ease of exposition) as (e.g.,Wan et al.,2021): and e π γ ( s ) {e}_{\pi }^{\gamma }\left( s\right) eπγ(s) denotes an error term that goes to zero as the discount factor goes to one (Blackwell, 1962: Theorem 4a; also see Puterman’s (1994) Corollary 8.2.4). This decomposition of the state values also implies a similar decomposition for state-action values.
其中 r ( π ) r\left( \pi \right) r(π) 是策略 π \pi π 获得的与状态无关的平均奖励, v ~ π ( s ) {\widetilde{v}}_{\pi }\left( s\right) v π(s) 是状态 s s s 的微分值,均为遍历 MDP 定义(为便于阐述)为(例如,Wan 等,2021):并且 e π γ ( s ) {e}_{\pi }^{\gamma }\left( s\right) eπγ(s) 表示一个误差项,随着折扣因子趋近于一而趋向于零(Blackwell,1962:定理 4a;另见 Puterman(1994)推论 8.2.4)。状态值的这种分解也暗示了状态-动作值的类似分解。
The Laurent-series decomposition explains how reward centering can help learning in bandit problems such as the one in Sutton & Barto’s (2018) Figure 2.5. There, the action-value estimates are initialized to zero and the true values are centered around +4 . The actions are selected based on their relative values, but each action-value estimate must independently learn the same constant offset. Approximation errors in estimating the offset can easily mask the relative differences in actions, especially if the offset is large.
Laurent 级数分解解释了奖励中心化如何帮助解决如 Sutton & Barto(2018)图 2.5 中的赌博机问题。在那里,动作值估计初始化为零,真实值围绕 +4 中心。动作是基于它们的相对值选择的,但每个动作值估计必须独立学习相同的常量偏移。估计偏移的近似误差很容易掩盖动作之间的相对差异,特别是当偏移较大时。
In the full reinforcement learning problem, the state-independent offset can be quite large. For example, consider the three-state Markov reward process shown Figure 2 (induced by some policy π \pi π in some MDP). The reward is +3 on transition from state A to state B,and 0 otherwise. The average reward is r ( π ) = 1 r\left( \pi \right) = 1 r(π)=1 . The discounted state values for three discount factors are shown in the table. Note the magnitude of the standard discounted values and especially the jump when the discount factor is increased. Now consider the discounted values with the constant offset subtracted from each state, v π γ ( s ) − r ( π ) / ( 1 − γ ) {v}_{\pi }^{\gamma }\left( s\right) - r\left( \pi \right) /\left( {1 - \gamma }\right) vπγ(s)−r(π)/(1−γ) ,which we call the centered discounted values. The centered values are much smaller in magnitude and change only slightly when the discount factor is increased. The differential values are also shown for reference. These trends hold in general: for any problem, the magnitude of the discounted values increase dramatically as the discount factor approaches one whereas the centered discounted values change little and approach the differential values.
在完整的强化学习问题中,状态无关的偏移量可能相当大。例如,考虑图2所示的三状态马尔可夫奖励过程(由某个策略 π \pi π 在某个MDP中引发)。从状态A到状态B的转移奖励为+3,其他情况下为0。平均奖励为 r ( π ) = 1 r\left( \pi \right) = 1 r(π)=1。三个折扣因子的折扣状态值如表所示。注意标准折扣值的大小,尤其是在折扣因子增加时的跃升。现在考虑从每个状态中减去常数偏移量后的折扣值 v π γ ( s ) − r ( π ) / ( 1 − γ ) {v}_{\pi }^{\gamma }\left( s\right) - r\left( \pi \right) /\left( {1 - \gamma }\right) vπγ(s)−r(π)/(1−γ),我们称之为中心化折扣值。中心化值的大小要小得多,并且在折扣因子增加时仅略有变化。差分值也供参考。这些趋势通常成立:对于任何问题,当折扣因子接近1时,折扣值的大小会显著增加,而中心化折扣值变化不大,并接近差分值。
Figure 2: Comparison of the standard and the centered discounted values on a simple example.
图2:在一个简单示例中标准折扣值和中心化折扣值的比较。
Formally, the centered discounted values are the expected discounted sum of mean-centered rewards:
从形式上讲,中心化折扣值是均值中心化奖励的期望折扣和:
where γ ∈ [ 0 , 1 ] \gamma \in \left\lbrack {0,1}\right\rbrack γ∈[0,1] . When γ = 1 \gamma = 1 γ=1 ,the centered discounted values are the same as the differential values, that is, v ~ π γ ( s ) = v ~ π ( s ) , ∀ s {\widetilde{v}}_{\pi }^{\gamma }\left( s\right) = {\widetilde{v}}_{\pi }\left( s\right) ,\forall s v πγ(s)=v π(s),∀s . More generally,the centered discounted values are the differential values plus the error terms from the Laurent-series decomposition, as shown on the right above.
其中 γ ∈ [ 0 , 1 ] \gamma \in \left\lbrack {0,1}\right\rbrack γ∈[0,1]。当 γ = 1 \gamma = 1 γ=1 时,中心化折扣值与差分值相同,即 v ~ π γ ( s ) = v ~ π ( s ) , ∀ s {\widetilde{v}}_{\pi }^{\gamma }\left( s\right) = {\widetilde{v}}_{\pi }\left( s\right) ,\forall s v πγ(s)=v π(s),∀s。更一般地说,中心化折扣值是差分值加上来自Laurent级数分解的误差项,如上右侧所示。
Reward centering thus enables capturing all the information within the discounted value function via two components: (1) the constant average reward and (2) the centered discounted value function. Such a decomposition can be immensely valuable: (a) As γ → 1 \gamma \rightarrow 1 γ→1 ,the discounted values tend to explode but the centered discounted values remain small and tractable. (b) If the problems’ rewards are shifted by a constant c c c ,then the magnitude of the discounted values increases by c / ( 1 − γ ) c/\left( {1 - \gamma }\right) c/(1−γ) , but the centered discounted values are unchanged because the average reward increases by c c c . These effects are demonstrated in the following sections.
因此,奖励中心化使得通过两个组成部分捕捉折现价值函数中的所有信息成为可能:(1) 常数平均奖励和 (2) 中心化折现价值函数。这种分解可以极具价值:(a) 如 γ → 1 \gamma \rightarrow 1 γ→1 所示,折现值往往会爆炸,但中心化折现值保持较小且易于处理。(b) 如果问题的奖励通过一个常数 c c c 进行偏移,则折现值的大小增加了 c / ( 1 − γ ) c/\left( {1 - \gamma }\right) c/(1−γ),但中心化折现值保持不变,因为平均奖励增加了 c c c。这些效应将在以下部分中展示。
Reward centering also enables the design of algorithms in which the discount factor (an algorithm parameter) can be changed within the lifetime of a learning agent. This is usually inefficient or ineffective with standard discounted algorithms because their uncentered values can change massively (Figure 2). In contrast, centered values may change little, and the changes become minuscule as the discount factor approaches 1 . We discuss this exciting direction in the final section.
奖励中心化还使得设计算法成为可能,在这些算法中,折现因子(算法参数)可以在学习代理的生命周期内进行更改。对于标准的折现算法而言,这通常是低效或无效的,因为它们的未中心化值可能会发生巨大变化(图 2)。相比之下,中心化值可能变化很小,并且随着折现因子接近 1,变化变得微不足道。我们将在最后一部分讨论这个令人兴奋的方向。
To obtain these potential benefits, we need to estimate the average reward from data. In the next section we show that even the simplest method can be quite effective.
为了获得这些潜在的好处,我们需要从数据中估计平均奖励。在下一部分中,我们将展示即使是最简单的方法也可以非常有效。
2 Simple Reward Centering
2 简单的奖励中心化
The simplest way to estimate the average reward is to maintain a running average of the rewards observed so far. That is,if R ˉ t ∈ R {\bar{R}}_{t} \in \mathbb{R} Rˉt∈R denotes the estimate of the average reward after t t t time steps, then R ˉ t = ∑ k = 1 t R k {\bar{R}}_{t} = \mathop{\sum }\limits_{{k = 1}}^{t}{R}_{k} Rˉt=k=1∑tRk . More generally,the estimate can be updated with a step-size parameter β t {\beta }_{t} βt :
估计平均奖励的最简单方法是保持到目前为止观察到的奖励的运行平均值。也就是说,如果 R ˉ t ∈ R {\bar{R}}_{t} \in \mathbb{R} Rˉt∈R 表示在 t t t 时间步之后的平均奖励估计,则 R ˉ t = ∑ k = 1 t R k {\bar{R}}_{t} = \mathop{\sum }\limits_{{k = 1}}^{t}{R}_{k} Rˉt=k=1∑tRk。更一般地,估计可以通过步长参数 β t {\beta }_{t} βt 进行更新:
This update leads to an unbiased estimate of the average reward R ˉ t ≈ r ( π ) {\bar{R}}_{t} \approx r\left( \pi \right) Rˉt≈r(π) ,for the policy π \pi π generating the data, if the step sizes follow standard conditions (Robbins & Monro, 1951).
如果步长遵循标准条件(Robbins & Monro, 1951),则此更新会导致对生成数据的策略 π \pi π 的平均奖励 R ˉ t ≈ r ( π ) {\bar{R}}_{t} \approx r\left( \pi \right) Rˉt≈r(π) 的无偏估计。
Simple centering (4) can be used with almost any reinforcement learning algorithm. For example, it can be combined with conventional temporal-difference (TD) learning (see Sutton, 1988a) to learn a state-value function estimate V ~ γ : S → R {\widetilde{V}}^{\gamma } : \mathcal{S} \rightarrow \mathbb{R} V γ:S→R by updating,on transition from t t t to t + 1 t + 1 t+1 :
简单中心化 (4) 可以与几乎任何强化学习算法结合使用。例如,它可以与传统的时间差分 (TD) 学习结合使用(见 Sutton, 1988a),通过在从 t t t 转移到 t + 1 t + 1 t+1 时更新来学习状态值函数估计 V ~ γ : S → R {\widetilde{V}}^{\gamma } : \mathcal{S} \rightarrow \mathbb{R} V γ:S→R:
with V ~ t + 1 γ ( s ) ≐ V ~ t γ ( s ) , ∀ s ≠ S t {\widetilde{V}}_{t + 1}^{\gamma }\left( s\right) \doteq {\widetilde{V}}_{t}^{\gamma }\left( s\right) ,\forall s \neq {S}_{t} V t+1γ(s)≐V tγ(s),∀s=St ,where α t > 0 {\alpha }_{t} > 0 αt>0 is a step-size parameter.
使用 V ~ t + 1 γ ( s ) ≐ V ~ t γ ( s ) , ∀ s ≠ S t {\widetilde{V}}_{t + 1}^{\gamma }\left( s\right) \doteq {\widetilde{V}}_{t}^{\gamma }\left( s\right) ,\forall s \neq {S}_{t} V t+1γ(s)≐V tγ(s),∀s=St,其中 α t > 0 {\alpha }_{t} > 0 αt>0 是一个步长参数。
We used four algorithmic variations of (5) differing only in the definition of R ˉ t {\bar{R}}_{t} Rˉt in our first set of experiments. One algorithm used R ˉ t = 0 , ∀ t {\bar{R}}_{t} = 0,\forall t Rˉt=0,∀t ,and thus involves no reward centering. The second algorithm used the best possible estimate of the average reward: R ˉ t = r ( π ) , ∀ t {\bar{R}}_{t} = r\left( \pi \right) ,\forall t Rˉt=r(π),∀t ; we call this oracle centering. The third algorithm used simple reward centering as in (4). The fourth algorithm used a more sophisticated kind of reward centering which we discuss in the next section.
在我们的第一组实验中,我们使用了四种算法变体 (5),它们仅在 R ˉ t {\bar{R}}_{t} Rˉt 的定义上有所不同。一种算法使用了 R ˉ t = 0 , ∀ t {\bar{R}}_{t} = 0,\forall t Rˉt=0,∀t,因此不涉及奖励中心化。第二种算法使用了平均奖励的最佳可能估计: R ˉ t = r ( π ) , ∀ t {\bar{R}}_{t} = r\left( \pi \right) ,\forall t Rˉt=r(π),∀t;我们称之为oracle中心化。第三种算法使用了如 (4) 所示的简单奖励中心化。第四种算法使用了一种更复杂的奖励中心化方法,我们将在下一节中讨论。
The environment was an MDP with seven states in a row with two actions in each state. The right action from the rightmost state leads to the middle state with a reward of +7 and the left action from the leftmost state leads to the middle state with a reward of +1 ; all other transitions have zero rewards. The target policy takes both actions in each state with equal probability, that is, π ( left ∣ ⋅ ) = π ( right ∣ ⋅ ) = 0.5 \pi \left( {\operatorname{left} \mid \cdot }\right) = \pi \left( {\operatorname{right} \mid \cdot }\right) = {0.5} π(left∣⋅)=π(right∣⋅)=0.5 . The average reward corresponding to this policy is r ( π ) = 0.25 r\left( \pi \right) = {0.25} r(π)=0.25 .
环境是一个具有七个状态的马尔可夫决策过程 (MDP),每个状态有两个动作。从最右侧状态的正确动作导致中间状态,奖励为 +7,而从最左侧状态的左侧动作导致中间状态,奖励为 +1;所有其他转移的奖励为零。目标策略在每个状态下以相等的概率采取两种动作,即 π ( left ∣ ⋅ ) = π ( right ∣ ⋅ ) = 0.5 \pi \left( {\operatorname{left} \mid \cdot }\right) = \pi \left( {\operatorname{right} \mid \cdot }\right) = {0.5} π(left∣⋅)=π(right∣⋅)=0.5。与该策略对应的平均奖励为 r ( π ) = 0.25 r\left( \pi \right) = {0.25} r(π)=0.25。
Our first experiment applied the four algorithms to the seven-state MDP with two discount factors, γ = 0.9 \gamma = {0.9} γ=0.9 and 0.99 . All algorithms were run with a range of values for the step-size parameters α \alpha α . The algorithms that learned to center were run with different values of η \eta η ,where β = η α \beta = {\eta \alpha } β=ηα (without loss of generality). Each parameter setting for each algorithm was run for 50,000 time steps, and then repeated for 50 runs. The full experimental details are in Appendix C. As a measure of performance at time t t t ,we used the root-mean-squared value error (RMSVE; see Sutton &Barto,2018,Section 9.2) between V ~ t γ {\widetilde{V}}_{t}^{\gamma } V tγ and v ~ π γ {\widetilde{v}}_{\pi }^{\gamma } v πγ for the centered algorithms,and between V ~ t γ {\widetilde{V}}_{t}^{\gamma } V tγ and v π γ {v}_{\pi }^{\gamma } vπγ for the algorithm without centering. There was no separate training and testing period.
我们的第一次实验将四种算法应用于具有两个折扣因子的七状态马尔可夫决策过程(MDP),分别为 γ = 0.9 \gamma = {0.9} γ=0.9 和 0.99。所有算法都使用一系列步长参数 α \alpha α 的值进行运行。学习中心的算法使用不同的 η \eta η 值进行运行,其中 β = η α \beta = {\eta \alpha } β=ηα(不失一般性)。每种算法的每个参数设置运行了 50,000 个时间步,然后重复进行了 50 次运行。完整的实验细节见附录 C。作为在时间 t t t 的性能衡量,我们使用了根均方误差值(RMSVE;见 Sutton & Barto, 2018, 第 9.2 节),用于中心算法的 V ~ t γ {\widetilde{V}}_{t}^{\gamma } V tγ 和 v ~ π γ {\widetilde{v}}_{\pi }^{\gamma } v πγ 之间,以及未中心算法的 V ~ t γ {\widetilde{V}}_{t}^{\gamma } V tγ 和 v π γ {v}_{\pi }^{\gamma } vπγ 之间。没有单独的训练和测试阶段。
Learning curves for this experiment and each value of γ \gamma γ are shown in the first column of Figure 3. For all algorithms,we show only curves for the α \alpha α value that was best for TD-learning without reward centering. For the centering methods,the curve shown is for the best choice of η \eta η from a coarse search over a broad range. Each solid point represents the RMSVE averaged over the 50 independent runs; the shaded region shows one standard error.
本实验及每个 γ \gamma γ 值的学习曲线显示在图 3 的第一列中。对于所有算法,我们仅显示 TD 学习在没有奖励中心情况下最佳的 α \alpha α 值的曲线。对于中心方法,所示曲线是从广泛范围的粗略搜索中选择的最佳 η \eta η 值。每个实心点代表 50 次独立运行的 RMSVE 平均值;阴影区域表示一个标准误差。
First note that the learning curves start much lower when the rewards are centered by an oracle; for the other algorithms,the first error is of the order r ( π ) / ( 1 − γ ) r\left( \pi \right) /\left( {1 - \gamma }\right) r(π)/(1−γ) . TD-learning without centering (blue) eventually reached the same error rate as the oracle-centered algorithm (orange), as expected. Learning the average reward and subtracting it (green) indeed helps reduce the RMSVE much faster compared to when there is no centering. However, the eventual error rate is slightly higher, which is expected because the average-reward estimate is changing over time, leading to more variance in the updates compared to the uncentered or oracle-centered version. Similar trends hold for the larger discount factor (lower left), with the uncentered approach appearing much slower in comparison (note the difference in axes’ scales). In both cases, we verified that the average-reward estimate across the runs was around 0.25 .
首先注意,当奖励由一个神谕中心化时,学习曲线的起始点要低得多;对于其他算法,第一次错误的量级为 r ( π ) / ( 1 − γ ) r\left( \pi \right) /\left( {1 - \gamma }\right) r(π)/(1−γ) 。未中心化的 TD 学习(蓝色)最终达到了与神谕中心化算法(橙色)相同的错误率,这是预期的。学习平均奖励并将其减去(绿色)确实有助于比没有中心化时更快地减少 RMSVE。然而,最终的错误率略高,这是可以预期的,因为平均奖励估计值随着时间的推移而变化,导致与未中心化或神谕中心化版本相比,更新中的方差更大。对于较大的折扣因子(左下角),类似的趋势也成立,未中心化的方法相比之下显得更慢(注意坐标轴刻度的差异)。在这两种情况下,我们验证了运行中的平均奖励估计约为 0.25。
These experiments show that the simple reward-centering technique can be quite effective in the on-policy setting, and the effect is more pronounced for larger discount factors.
这些实验表明,简单的奖励中心化技术在策略设置中可以非常有效,并且对于较大的折扣因子,其效果更加明显。
—— 更多内容请到Doc2X翻译查看——
—— For more content, please visit Doc2X for translations ——