增强学习Reinforcement Learning经典算法梳理3:TD方法
Q-learning的最大化偏置问题(Maximization Bias)
1 on-policy与off-policy的本质区别
更新Q值时所使用的方法是既定的策略(on-policy)还是使用其他策略(off-policy).
SARSA | Q-learning | |
选择下一步动作 a‘ | π | π |
更新Q值 | π | μ |
Reinforcement Learning An Introduction原文解释:
We are now ready to present an example of the second class of learning control methods we consider in this book: off-policy methods. Recall that the distinguishing feature of on-policy methods is that they estimate the value of
a policy while using it for control. In off-policy methods these two functions are separated. The policy used to generate behavior, called the behavior policy, may in fact be unrelated to the policy that is evaluated and improved, called the estimation policy. An advantage of this separation is that the estimation policy may be deterministic (e.g., greedy), while the behavior policy can continue to sample all possible actions.
讲的是off-policy方法
行为策略和
估计策略是分离的,行为策略是用来做决策的,也就是选择下一步动作的,而估计策略是确定的,例如贪心策略,用来更新值函数的。这种分离的优点是估计策略是确定的,同时行为策略能够持续抽样所有可能的动作。