reinforcement learning,增强学习:Exploration and Exploitation

本文探讨了在线决策中探索与利用的平衡策略,并介绍了经典博弈论的相关概念。文章首先讲解了如何通过不同算法来减少总遗憾,包括乐观初始化、衰减ε-贪婪算法等。接着,文章还讨论了最小最大搜索、自我博弈强化学习以及不完美信息博弈中的强化学习。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


最后两节课分别将bandits和games,基本上是保证课程的完整性,很多内容比较复杂,这里只提一些思想。


Lecture 9: Exploration and Exploitation

Online decision-making involves a fundamental choice:
Exploitation Make the best decision given current information
Exploration Gather more information
The best long-term strategy may involve short-term sacrifices
Gather enough information to make the best overall decisions

然而问题是:

If an algorithm forever explores it will have linear total regret
If an algorithm
never explores it will have linear total regret
Is it possible to achieve sublinear total regret?



exploration and exploitation的principle:

Naive Exploration:
Add noise to greedy policy (e.g. epo-greedy)  ==> greedy/epo-greedy has linear total regret 

Optimistic Initialisation:
Assume the best until proven otherwise  ==> greedy/epo-greedy + optimistic initialisation has linear total regret 


Decaying epo-Greedy Algorithm :

不断减小epo的值,从多探索到多选择已知最优 ==> Decaying epo-Greedy Algorithm has logarithmic asymptotic total regret 

Lower Bound of regret:Asymptotic total regret is at least logarithmic in number of steps


Optimism in the Face of Uncertainty:
Prefer actions with uncertain values

The more uncertain we are about an action-value,The more important it is to explore that action,It could turn out to be the best action

这其中的道理是:不确定的action对应的density function慢慢变得确定,而且reward是大是小非常明显。


After picking blue action(如下图),We are less uncertain about the value,And more likely to pick another action,Until we home in on best action



Probability Matching:
Select actions according to probability they are best

Information State Search:
Lookahead search incorporating value of information



Lecture 10: Classic Games 



Minimax Search 

Self-Play Reinforcement Learning

Combining Reinforcement Learning and Minimax Search

Reinforcement Learning in Imperfect-Information Games




评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值