关于强化学习探索的金句-优快云博客

本文链接：https://blog.youkuaiyun.com/fangzhang233/article/details/134101658

本文探讨了强化学习中基于不确定性的探索方法，如DIRECTEDEXPLORATION，其使用奖励奖金策略。然而，奖励奖金法存在缺点，如函数逼近需要多次更新才能收敛，且奖励奖金的非稳定性可能导致探索效率降低。RND中的0.25%mask旨在解决快速变化带来的问题。相比之下，目标网络和goal-conditionedpolicy有助于处理非stationarity，提高算法的效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

DIRECTED EXPLORATION FOR REINFORCEMENT LEARNING

总的来说和Go-explore差不多

These uncertainty-based methods use a reward bonus approach, where they compute a measure of uncertainty and transform that into a bonus that is then added into the reward function. Unfortunately this reward bonus approach has some drawbacks. The main drawback is that reward bonuses may take many, many updates before they propagate and change agent behavior.

This is due to two main factors: the first is that function approximation itself needs many updates before converging; the second is that the reward bonuses are non-stationary and change as the agent explores, meaning the function approximator needs to update and converge to a new set of values every time the uncertainties change.

This makes it necessary to ensure that uncertainties do not change too quickly, in order to give enough time for the function approximation to catch up and propagate the older changes before needing to catch up to the newer changes.

RND 里的0.25% mask原来就是干这个用的。

If the reward bonuses change too quickly, or are too noisy, then it becomes possible for the function approximator to prematurely stop propagation of older changes and start trying to match the newer changes, resulting in missed exploration opportunies or even converging to a suboptimal mixture of old and new uncertainties.

Non-stationarity has already been a difficult problem for RL in learning a Q-value function, which the DQN algorithm is able to tackle by slowing down the propagation of changes through the use of a target network [Mnih et al., 2013]. These two factors together result in slow adaptation of reward bonuses and lead to less efficient exploration.

用goal-conditioned policy 因为 This results in an algorithm that is completely stationary, because the goal-conditioned policy is independent of the uncertainty.

全部算法：