强化学习探索的buffer设计

最新推荐文章于 2025-02-20 19:51:24 发布

原创最新推荐文章于 2025-02-20 19:51:24 发布 · 236 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能

本文提出了一种新的记忆驱动的轨迹条件策略，通过利用过去经验中的多样性来引导探索。策略利用注意力机制处理存储的记忆轨迹，生成新的动作序列，从而在稀疏奖励环境中提升性能。实验结果显示在Atari游戏等挑战性任务中，该方法超越了现有技术。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards

某种程度上讲，这就是RL版本的广度优先搜索，把frontier作为buffer，然后把到达frontier里面的states的trajectories作为RL学习的目标（用imitation learning的方法）。

在sparse reward的设定下，In these tasks, a positive reward could only be received after a long sequence of appropriate actions. The gradient-based updates of parameters are incremental and slow and have a global impact on all parameters, which may cause catastrophic forgetting and performance degradation.

Many parametric approaches rely on recent samples and do not explore the state space systematically. They might forget the positive-reward trajectories unless the good trajectories are frequently collected.

Recently, non-parametric memory from past experiences is employed in DRL algorithms to improve policy learning and sample efficiency

但是会有myopic behaviors

In an RL setting, we aim to generate new trajectories visiting novel states by editing or augmenting the trajectories stored in the memory from past experiences. We propose a novel trajectory-conditioned policy where a full sequence of states is given as the condition. Then a sequence-to-sequence model with an attention mechanism learns to ‘translate’ the demonstration trajectory to a sequence of actions and generate a new trajectory in the environment with stochasticity. The single policy could take diverse trajectories as the condition, imitate the demonstrations to reach diverse regions in the state space, and allow for flexibility in the action choices to discover novel states.

Our main contributions are summarized as follows. (1) We propose a novel architecture for a trajectory-conditioned policy that can flexibly imitate diverse demonstration trajectories. (2) We show the importance of exploiting diverse past experiences in the memory to indirectly drive exploration, by comparing with existing approaches on various sparse-reward RL tasks with stochasticity in the environments. (3) We achieve a performance superior to the state-of-the-art under 5 billion number of frames, on hard-exploration Atari games of Montezuma’s Revenge and Pitfall, without using expert demonstrations or resetting to arbitrary states. We also demonstrate the effectiveness of our method on other benchmarks.

2.2 Overview of DTSIL

Organizing Trajectory Buffer As shown in Fig. 2(a), we maintain a trajectory buffer D = {(e (1), τ (1), n(1)), (e (2), τ (2), n(2)), · · · } of diverse past trajectories. τ (i) is the best trajectory ending with a state with embedding e (i) . n (i) is the number of times the cluster represented by the embedding e (i) has been visited during training. To maintain a compact buffer, similar state embeddings within the tolerance threshold δ are clustered together, and an existing entry is replaced if an improved trajectory τ (i) ending with a near-identical state is found. In the buffer, we keep a single representative state embedding for each cluster. If a state embedding et observed in the current episode is close to a representative state embedding e (k) , we increase visitation count n (k) of the k-th cluster. If the sub-trajectory τ≤t of the current episode up to step t is better than τ (k) , e (k) is replaced by et. Pseudocode for organizing clusters is in the appendix.

Imitating Trajectory to State of Interest In stochastic environments, in order to reach diverse states e (i) we sampled, the agent would need to learn a goal-conditioned policy [1, 34, 44, 40]. But it is difficult to learn the goal-conditioned policy only with the final goal state because the goal state might be far from the agent’s initial states and the agent might have few experiences reaching it. Therefore, we provide the agent with the full trajectory leading to the goal state. So the agent benefits from richer intermediate information and denser rewards. We call this trajectory-conditioned policy πθ(·|g) where g = {e g 1 , e g 2 , · · · , e g |g| }, and introduce how to train the policy in detail in Sec. 2.3.

Updating Buffer with New Trajectory With trajectory-conditioned policy, the agent takes actions to imitate the sampled demonstration trajectory. As shown in Fig. 2(c), because there could be stochasticity in the environment and our method does not require the agent to exactly follow the demonstration step by step, the agent’s new trajectory could be different from the demonstration and thus visit novel states. In a new trajectory E = {(o0, e0, a0, r0), · · · , (oT , eT , aT , rT )}, if et is nearly identical to a state embedding e (k) in the buffer and the partial episode τ≤t is better than (i.e. higher return or shorter trajectory) the stored trajectory τ (k) , we replace the existing entry (e (k) , τ (k) , n(k) ) by (et, τ≤t, n(k) + 1). If et is not sufficiently similar to any state embedding in the buffer, a new entry (et, τ≤t, 1) is pushed into the buffer, as shown in Fig. 2(d). Therefore we gradually increase the diversity of trajectories in the buffer. The detailed algorithm is described in the supplementary material.

2.3 Learning Trajectory-Conditioned Policy

3 Related Work

Imitation Learning The goal of imitation learning is to train a policy to mimic a given demonstration. Many previous works achieve good results on hard-exploration Atari games by imitating human demonstrations [23, 41]. Aytar et al. [3] learn embeddings from a variety of demonstration videos and proposes the one-shot imitation learning reward, which inspires the design of rewards in our method. All these successful attempts rely on the availability of human demonstrations. In contrast, our method treats the agent’s past trajectories as demonstrations.

Memory Based RL An external memory buffer enables the storage and usage of past experiences to improve RL algorithms. Episodic reinforcement learning methods [43, 22, 28] typically store and update a look-up table to memorize the best episodic experiences and retrieve the episodic memory in the agent’s decision-making process. Oh et al. [36] and Gangwani et al. [19] train a parameterized policy to imitate only the high-reward trajectories with the SIL or GAIL objective. Unlike the previous work focusing on high-reward trajectories, we store the past trajectories ending with diverse states in the buffer, because trajectories with low reward in the short term might lead to high reward in the long term. Badia et al. [5] train a range of directed exploratory policies based on episodic memory. Gangwani et al. [19] propose to learn multiple diverse policies in a SIL framework but their exploration can be limited by the number of policies learned simultaneously and the exploration performance of every single policy, as shown in the supplementary material.

Exploration Many exploration methods [46, 2, 12, 50] in RL tend to award a bonus to encourage an agent to visit novel states. Recently this idea was scaled up to large state spaces [53, 7, 38, 11, 39, 10]. Intrinsic curiosity uses the prediction error or pseudo count as intrinsic reward signals to incentivize visiting novel states. We propose that instead of directly taking a quantification of novelty as an intrinsic reward, one can encourage exploration by rewarding the agent when it successfully imitates demonstrations that would lead to novel states. Ecoffet et al. [16] also shows the benefit of exploration by returning to promising states. Our method can be viewed in general as an extension of [16], though we do not need to rely on the assumption that the environment is resettable to arbitrary states. Similar to previous off-policy methods, we use experience replay to enhance exploration.

Many off-policy methods [25, 36, 1] tend to discard old experiences with low rewards and hence may prematurely converge to sub-optimal behaviors, but DTSIL using these diverse experiences has a better chance of finding higher rewards in the long term.Contemporaneous works [5, 4] as off-policy methods also achieved strong results on Atari games.

NGU [5] constructs an episodic memory-based intrinsic reward using k-nearest neighbors over the agent’s recent experience to train the directed exploratory policies

Agent57 [4] parameterizes a family of policies ranging from very exploratory to purely exploitative and proposes an adaptive mechanism to choose which policy to prioritize throughout the training process

Experiments

We compare our method with the following baselines: (1) PPO [48]; (2) PPO+EXP: PPO with reward f(rt) + λ/p N(et), where λ/p N(et) is the count-based exploration bonus, N(e) is the number of times the cluster which the state representation e belongs to was visited during training and λ is the hyper-parameter controlling the weight of exploration term; (3) PPO+SIL: PPO with Self-Imitation Learning [36]; (4) DTRA (“Diverse Trajectory-conditioned Repeat Actions”): we keep a buffer of diverse trajectories and sample the demonstrations as DTSIL, but we simply repeat the action sequence in the demonstration trajectory and then perform random exploration until the episode terminates.

4.1 Apple-Gold Domain

The Apple-Gold domain (Fig. 3a) is a grid-world environment with misleading rewards that can lead the agent to local optima. At the start of each episode, the agent is placed randomly in the left bottom part of the maze. An observation consists of the agent’s location (xt, yt) and binary variables showing whether the agent has gotten the apples or the gold. A state is represented as the agent’s location and the cumulative positive reward indicating the collected objects, i.e. et = (xt, yt, Pt i=1 max(ri , 0)).

这里PPO+EXP就是类似于RND的setting

全部的算法

然后是选择g的A2

越是到后面越是exploit

然后在exploration（else）里面，越是访问的概率小，越是被选中概率大。

然后是buffer更新

可以看到，这里对于每个t，都把St当做一个可能的goal

我也在思考这个，因为比如 Q learnig 里的 TD update，并不会更新整个trajectory，而是只是更新那一步，导致这些信息没法沿着 trajectory 输送回起点。这个 V value 是从后往前收敛的。

所以buffer里最好保存一个trajectory。

这样子，一段trajectory是一段E value 单调递增的序列，并且保存对更之前和之后的trajectories的指针，然后buffer对这样一个序列进行整体的保存和评估。