强化学习之Q-learning

最新推荐文章于 2024-11-25 11:02:04 发布

转载最新推荐文章于 2024-11-25 11:02:04 发布 · 1.4k 阅读

文章标签：

#强化学习 #机器学习 #人工智能

本文深入探讨了强化学习中on-policy与off-policy的区别，通过对比SARSA与Q-learning等经典算法，阐述了行为策略与估计策略分离的优势，有助于理解如何在实际应用中选择合适的策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

强化学习的一个实例

增强学习Reinforcement Learning经典算法梳理3：TD方法

on-policy和off-policy的区别

知乎on-policy和on-policy的区别

sarsa和q-learning的区别

Q-learning的最大化偏置问题（Maximization Bias）

1 on-policy与off-policy的本质区别

更新Q值时所使用的方法是既定的策略（on-policy）还是使用其他策略（off-policy).

	SARSA	Q-learning
选择下一步动作 a‘	π	π
更新Q值	π	μ

Reinforcement Learning An Introduction原文解释：

We are now ready to present an example of the second class of learning control methods we consider in this book: off-policy methods. Recall that the distinguishing feature of on-policy methods is that they estimate the value of
a policy while using it for control. In off-policy methods these two functions are separated. The policy used to generate behavior, called the behavior policy, may in fact be unrelated to the policy that is evaluated and improved, called the estimation policy. An advantage of this separation is that the estimation policy may be deterministic (e.g., greedy), while the behavior policy can continue to sample all possible actions.

讲的是off-policy方法 行为策略和 估计策略是分离的，行为策略是用来做决策的，也就是选择下一步动作的，而估计策略是确定的，例如贪心策略，用来更新值函数的。这种分离的优点是估计策略是确定的，同时行为策略能够持续抽样所有可能的动作。