Reinforcement Learning:An Introduction第二章读书笔记

Part I:Tabular Solution Methods

在这一部分中我们描述了强化学习中几乎所有的核心思想。在这些问题中state和action空间足够小可以被估计值函数如队列,表来展示。在这些例子中,都能准确地找到最佳值函数和最佳策略。这与下一部分是不同的,下一部分模糊的解决,但适用范围更广。

这一部分的第一章介绍了强化学习的特殊例子,它只有一种情况,被称作bandit问题。第二章介绍了延续到本书剩余部分的通用问题形式化——有限马尔科夫决策过程。他的核心思想包括bellman equation 和value function。

接下来的三章介绍了解决有限马尔科夫问题的三类问题:动态编程,Monte Carlo method,和时序差分学习。每类方法都各有优缺点。

 

方法 评价
动态编程 可以发挥数学上计算优势(are well developed mathematically),但是需要一个完全准确地环境模型。
Monte Carlo methond 不需要环境模型,概念简单,但不适合步进计算。
时序差分算法 不需要环境模型,完全步进,但分析十分复杂。

接下来的两章如何结合三种方法来最大化效果。在前一章中我们介绍了如何通过 multi-step bootstrapping method 来吧Monte Carlo methon和时序差分学习的优势结合起来,最后一章介绍了时序差分学习如何跟模型学习和计划方法(如动态规划)结合来提供一个完全的统一的解决表式强化学习问题的方法。

Chapter 2 Multi_armed Bandits

强化学习与其他类型的学习最显著的特征是它靠训练的信息评定action而不是通过给出正确的action来被指导。评定性反馈(evaluative)完全根据采取的行动,而指导性反馈(instructive)和采取的行动是独立的。

在这一章我们以最简单的方式来研究评定性强化学习,只涉及一种情况(situation)。学习这种非关联(nonassociative)问题可以简化完全强化学习问题,并清晰地掌握评定性判定与指导性的不同和结合。

2.1 k臂赌博机问题(A k-armed Bandit Problem)

简单来说有k个不同的选择,选择后得到reward,目标是最大化reward。在今天"bandit problem"一般就代表这类问题。

每个action都有期望reward,称为the value of the action。

At: 表示在t步的action     q*(a):表示该action的期望reward    Qt(a) :estimate of the q*(a)

the greedy action:每次都选择最大的估计值的action,即max Qt(a).

explore :选择非贪婪action,非最

The authoritative textbook for reinforcement learning by Richard Sutton and Andrew Barto. Contents Preface Series Forward Summary of Notation I. The Problem 1. Introduction 1.1 Reinforcement Learning 1.2 Examples 1.3 Elements of Reinforcement Learning 1.4 An Extended Example: Tic-Tac-Toe 1.5 Summary 1.6 History of Reinforcement Learning 1.7 Bibliographical Remarks 2. Evaluative Feedback 2.1 An -Armed Bandit Problem 2.2 Action-Value Methods 2.3 Softmax Action Selection 2.4 Evaluation Versus Instruction 2.5 Incremental Implementation 2.6 Tracking a Nonstationary Problem 2.7 Optimistic Initial Values 2.8 Reinforcement Comparison 2.9 Pursuit Methods 2.10 Associative Search 2.11 Conclusions 2.12 Bibliographical and Historical Remarks 3. The Reinforcement Learning Problem 3.1 The Agent-Environment Interface 3.2 Goals and Rewards 3.3 Returns 3.4 Unified Notation for Episodic and Continuing Tasks 3.5 The Markov Property 3.6 Markov Decision Processes 3.7 Value Functions 3.8 Optimal Value Functions 3.9 Optimality and Approximation 3.10 Summary 3.11 Bibliographical and Historical Remarks II. Elementary Solution Methods 4. Dynamic Programming 4.1 Policy Evaluation 4.2 Policy Improvement 4.3 Policy Iteration 4.4 Value Iteration 4.5 Asynchronous Dynamic Programming 4.6 Generalized Policy Iteration 4.7 Efficiency of Dynamic Programming 4.8 Summary 4.9 Bibliographical and Historical Remarks 5. Monte Carlo Methods 5.1 Monte Carlo Policy Evaluation 5.2 Monte Carlo Estimation of Action Values 5.3 Monte Carlo Control 5.4 On-Policy Monte Carlo Control 5.5 Evaluating One Policy While Following Another 5.6 Off-Policy Monte Carlo Control 5.7 Incremental Implementation 5.8 Summary 5.9 Bibliographical and Historical Remarks 6. Temporal-Difference Learning 6.1 TD Prediction 6.2 Advantages of TD Prediction Methods 6.3 Optimality of TD(0) 6.4 Sarsa: On-Policy TD Control 6.5 Q-Learning: Off-Policy TD Control 6.6 Actor-Critic Methods 6.7 R-Learning for Undiscounted Continuing Tasks 6.8 Games, Afterstates, and Other Special Cases 6.9 Summary 6.10 Bibliographical and Historical Remarks III. A Unified View 7. Eligibility Traces 7.1 -Step TD Prediction 7.2 The Forward View of TD( ) 7.3 The Backward View of TD( ) 7.4 Equivalence of Forward and Backward Views 7.5 Sarsa( ) 7.6 Q( ) 7.7 Eligibility Traces for Actor-Critic Methods 7.8 Replacing Traces 7.9 Implementation Issues 7.10 Variable 7.11 Conclusions 7.12 Bibliographical and Historical Remarks 8. Generalization and Function Approximation 8.1 Value Prediction with Function Approximation 8.2 Gradient-Descent Methods 8.3 Linear Methods 8.3.1 Coarse Coding 8.3.2 Tile Coding 8.3.3 Radial Basis Functions 8.3.4 Kanerva Coding 8.4 Control with Function Approximation 8.5 Off-Policy Bootstrapping 8.6 Should We Bootstrap? 8.7 Summary 8.8 Bibliographical and Historical Remarks 9. Planning and Learning 9.1 Models and Planning 9.2 Integrating Planning, Acting, and Learning 9.3 When the Model Is Wrong 9.4 Prioritized Sweeping 9.5 Full vs. Sample Backups 9.6 Trajectory Sampling 9.7 Heuristic Search 9.8 Summary 9.9 Bibliographical and Historical Remarks 10. Dimensions of Reinforcement Learning 10.1 The Unified View 10.2 Other Frontier Dimensions 11. Case Studies 11.1 TD-Gammon 11.2 Samuel's Checkers Player 11.3 The Acrobot 11.4 Elevator Dispatching 11.5 Dynamic Channel Allocation 11.6 Job-Shop Scheduling Bibliography Index
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值