强化学习探索的buffer设计

记忆驱动的轨迹条件策略:解决稀疏奖励环境下的探索与模仿学习
本文提出了一种新的记忆驱动的轨迹条件策略,通过利用过去经验中的多样性来引导探索。策略利用注意力机制处理存储的记忆轨迹,生成新的动作序列,从而在稀疏奖励环境中提升性能。实验结果显示在Atari游戏等挑战性任务中,该方法超越了现有技术。

Memory Based Trajectory-conditioned Policies for Learning from Sparse Rewards

某种程度上讲,这就是RL版本的广度优先搜索,把frontier作为buffer,然后把到达frontier里面的states的trajectories作为RL学习的目标(用imitation learning的方法)。

在sparse reward的设定下,In these tasks, a positive reward could only be received after a long sequence of appropriate actions. The gradient-based updates of parameters are incremental and slow and have a global impact on all parameters, which may cause catastrophic forgetting and performance degradation.

Many parametric approaches rely on recent samples and do not explore the state space systematically. They might forget the positive-reward trajectories unless the good trajectories are frequently collected.

Recently, non-parametric memory from past experiences is employed in DRL algorithms to improve policy learning and sample efficiency

但是会有myopic behaviors

In an RL setting, we aim to generate new trajectories visiting novel states by editing or augmenting the trajectories stored in the memory from past experiences. We propose a novel trajectory-conditioned policy where a full sequence of states is given as the condition. Then a sequence-to-sequence model with an attention mechanism learns to ‘translate’ the demonstration trajectory to a sequence of actions and generate a new trajectory in the environment with stochasticity. The single policy could take diverse trajectories as the condition, imitate the demonstrations to reach diverse regions in the state space, and allow for flexibility in the action choices to discover novel states.

Our main contributions are summarized as follows. (1) We propose a novel architecture for a trajectory-conditioned policy that can flexibly imitate diverse demonstration trajectories. (2) We show the importance of exploiting diverse past experiences in the memory to indirectly drive exploration, by comparing with existing approaches on various sparse-reward RL tasks with stochasticity in the environments. (3) We achieve a performance superior to the state-of-the-art under 5 billion number of frames, on hard-exploration Atari games of Monte

### 基于深度强化学习的控制器设计方法 #### 1. 深度强化学习基础概念 深度强化学习(DRL)是一种结合了深度神经网络和强化学习(RL)的方法,旨在通过试错来优化决策过程。DRL能够处理高维输入空间并自动提取特征,在复杂环境中表现出色[^1]。 #### 2. 控制器架构的选择 对于垂直起降(VTOL)系统的控制问题,可以采用Actor-Critic结构作为基本框架。其中,“演员”(Actor)负责根据当前状态输出动作;而“评论家”(Critic)则评估该动作的好坏程度,并据此调整策略参数。 #### 3. 环境建模与奖励函数设定 为了使代理(agent)学会执行特定任务,需构建合适的模拟环境以及定义合理的奖励机制。针对VTOL飞行器而言,可能涉及的姿态角、速度等多个物理量都应被纳入考量范围之内。此外,还需考虑安全边界等因素以防止失控情况发生。 #### 4. 数据收集及预处理 在训练初期阶段,通常需要先让系统随机探索一段时间积累初始经验数据集。这些样本会被存储在一个称为重放缓冲区(replay buffer)的数据结构里供后续采样使用。值得注意的是,由于实际操作中的噪声干扰等原因可能导致部分观测存在偏差,因此有必要实施一定的平滑化措施提高泛化能力。 #### 5. 训练流程概述 整个训练过程中涉及到两个主要环节:一是利用历史轨迹更新价值估计模型(Critic),二是依据改进后的评价标准修正行为模式(Actor)。具体来说就是交替迭代这两个子程序直到收敛为止。期间可能会引入诸如软更新(Soft Update)之类的技术手段促进稳定性和效率提升。 ```matlab % 创建环境对象 env = rlPredefinedEnv('QuadRotor'); % 定义观察空间维度 observationInfo = getObservationInfo(env); numObs = observationInfo.Dimension; % 设置行动空间界限 actionInfo = getActionInfo(env); lowLimit = actionInfo.LowerLimit; highLimit = actionInfo.UpperLimit; % 构造DDPG算法实例 agentOpts = rlDDPGAgentOptions(); criticNet = createValueFunctionNetwork(numObs, numAct); % 自定义创建函数 actorNet = createPolicyFunctionNetwork(numObs, numAct); % 同上 critic = rlQValueRepresentation(criticNet, obsInfo, actInfo,... 'Observation', {'state'}, ... 'Action',{'action'}); actor = rlDeterministicActorRepresentation(actorNet,obsInfo,actInfo,... 'Observation',{'state'},... 'ActionName','action'); agent = rlDDPGAgent(actor,critic,agentOpts); % 开始训练循环 trainOpts = rlTrainingOptions(...); trainingStats = train(agent, env, trainOpts); ```
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值