《reinforcement learning：an introduction》第二章《Multi-arm Bandits》总结

最新推荐文章于 2023-01-20 19:54:18 发布

原创

最新推荐文章于 2023-01-20 19:54:18 发布 · 1.5k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#reinforcement learni #Multi-arm Bandits #Sutton RL

这篇博客介绍了强化学习入门书籍《reinforcement learning：an introduction》第二章关于Multi-arm Bandits的内容。博主讨论了K-armed bandit问题、action-value函数、exploration与exploitation的平衡，提到了e-greedy和UCB策略。还探讨了TD学习、非确定性和非静态环境，并指出在实际应用中常使用固定步长的TD学习。最后，提到了梯度带宽算法和上下文带宽数学习。

由于组里新同学进来，需要带着他入门RL，选择从silver的课程开始。

对于我自己，增加一个仔细阅读《reinforcement learning：an introduction》的要求。

因为之前读的不太认真，这一次希望可以认真一点，将对应的知识点也做一个简单总结。

K-armed bandit problem:

Consider the following learning problem. You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

即，用规定的timestep数，找到最优的action（每个action都对应自己的reward distribution，不是说每个action访问一次就可以确切知道action的reward）。用规定的预算，找到最好的广告安排策略。用规定的预算，找到最好的治疗方案都可以近似看作这类问题。Another analogy is that of a doctor choosing between experimental treatments for a series of
seriously ill patients. Each action selection is a treatment selection, and each reward is the survival or well-being of the patient.

考虑action-value function:

Q(a) = sigma{R_i * Indicator (A_i=a)} / sigma{ Indicator (A_i=a)}

在大数定理之下，这种sample-average method计算Q(a)能够保证收敛：As the denominator goes to infinity, by the law of large numbers, Qt(a) converges to q∗(a).

exploration and exploitation：纯粹的exploitation一般不好，需要exploration

e-greedy Action Selection