Thanks Sutton and Barto for their great work of Reinforcement Learning: An Introduction.
Almost all off-policy reinforcement learning methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another. We apply it by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.
Given a set of trajectory St,At,St+1,At+1,…,ST under policy π, i.e.
Define J(s) as the set of all time steps in which state s is visited. This is for an every-visit method; for a first-visit method,
To estimate vπ(s), we simply scale the returns by the ratios and average the results:
When importance sampling is done as a simple average in this way it is called ordinary importance sampling. An important alternative is weighted importance sampling, which uses a weighted average, defined as:
The difference between the two kinds of importance sampling is expressed in their biases and variances. The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased (the bias converges asymptotically to zero). On the other hand, the variance of the ordinary importance-sampling estimator is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one.
本文探讨了强化学习中重要性采样的应用及其在离策略学习中的关键作用。通过对比普通重要性采样和加权重要性采样,解释了这两种方法在偏差和方差方面的差异,并给出了具体的应用实例。
279

被折叠的 条评论
为什么被折叠?



