关于强化学习探索的金句

本文探讨了强化学习中基于不确定性的探索方法,如DIRECTEDEXPLORATION,其使用奖励奖金策略。然而,奖励奖金法存在缺点,如函数逼近需要多次更新才能收敛,且奖励奖金的非稳定性可能导致探索效率降低。RND中的0.25%mask旨在解决快速变化带来的问题。相比之下,目标网络和goal-conditionedpolicy有助于处理非stationarity,提高算法的效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

DIRECTED EXPLORATION FOR REINFORCEMENT LEARNING

总的来说和Go-explore差不多

These uncertainty-based methods use a reward bonus approach, where they compute a measure of uncertainty and transform that into a bonus that is then added into the reward function. Unfortunately this reward bonus approach has some drawbacks. The main drawback is that reward bonuses may take many, many updates before they propagate and change agent behavior.

This is due to two main factors: the first is that function approximation itself needs many updates before converging; the second is that the reward bonuses are non-stationary and change as the agent explores, meaning the function approximator needs to update and converge to a new set of values every time the uncertainties change.

This makes it necessary to ensure that uncertainties do not change too quickly, in order to give enough time for the function approximation to catch up and propagate the older changes before needing to catch up to the newer changes.

RND 里的0.25% mask原来就是干这个用的。

If the reward bonuses change too quickly, or are too noisy, then it becomes possible for the function approximator to prematurely stop propagation of older changes and start trying to match the newer changes, resulting in missed exploration opportunies or even converging to a suboptimal mixture of old and new uncertainties.

Non-stationarity has already been a difficult problem for RL in learning a Q-value function, which the DQN algorithm is able to tackle by slowing down the propagation of changes through the use of a target network [Mnih et al., 2013]. These two factors together result in slow adaptation of reward bonuses and lead to less efficient exploration.

用goal-conditioned policy 因为 This results in an algorithm that is completely stationary, because the goal-conditioned policy is independent of the uncertainty.

全部算法:

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值