Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards
某种程度上讲,这就是RL版本的广度优先搜索,把frontier作为buffer,然后把到达frontier里面的states的trajectories作为RL学习的目标(用imitation learning的方法)。
在sparse reward的设定下,In these tasks, a positive reward could only be received after a long sequence of appropriate actions. The gradient-based updates of parameters are incremental and slow and have a global impact on all parameters, which may cause catastrophic forgetting and performance degradation.
Many parametric approaches rely on recent samples and do not explore the state space systematically. They might forget the positive-reward trajectories unless the good trajectories are frequently collected.
Recently, non-parametric memory from past experiences is employed in DRL algorithms to improve policy learning and sample efficiency
但是会有myopic behaviors
In an RL setting, we aim to generate new trajectories visiting novel states by editing or augmenting the trajectories stored in the memory from past experiences. We propose a novel trajectory-conditioned policy where a full sequence of states is given as the condition. Then a sequence-to-sequence model with an attention mechanism learns to ‘translate’ the demonstration trajectory to a sequence of actions and generate a new trajectory in the environment with stochasticity. The single policy could take diverse trajectories as the condition, imitate the demonstrations to reach diverse regions in the state space, and allow for flexibility in the action choices to discover novel states.
Our main contributions are summarized as follows. (1) We propose a novel architecture for a trajectory-conditioned policy that can flexibly imitate diverse demonstration trajectories. (2) We show the importance of exploiting diverse past experiences in the memory to indirectly drive exploration, by comparing with existing approaches on various sparse-reward RL tasks with stochasticity in the environments. (3) We achieve a performance superior to the state-of-the-art under 5 billion number of frames, on hard-exploration Atari games of Monte

本文提出了一种新的记忆驱动的轨迹条件策略,通过利用过去经验中的多样性来引导探索。策略利用注意力机制处理存储的记忆轨迹,生成新的动作序列,从而在稀疏奖励环境中提升性能。实验结果显示在Atari游戏等挑战性任务中,该方法超越了现有技术。
最低0.47元/天 解锁文章
1471

被折叠的 条评论
为什么被折叠?



