(David Silver深度强化学习) - Lecture1: Introduction to RL

最新推荐文章于 2022-01-30 19:33:46 发布

原创最新推荐文章于 2022-01-30 19:33:46 发布 · 586 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#强化学习

David Silver深度强化学习专栏收录该内容

2 篇文章

订阅专栏

本文介绍了David Silver在2019年的深度强化学习课程，详细阐述了强化学习的基本概念，包括奖励、序列决策制定、状态表示、环境模型，以及RL agent内部的策略、价值函数和模型。同时，讨论了学习与规划、探索与开发、预测与控制等RL中的关键问题。

David Silver deep reinforcement learning course in 2019. For document and discussion.

Lecture1：Introduction

Outline

Ⅰ The RL Problem

1.Reward

reward $R_t$ 是一个标量的反馈信号
表明agent的每一步的执行效果
agent目标：将累积奖励最大化
课程提出的奖励的假说：

All goals can be described by the maximization of expected cumulative reward.

所有目标都能被表述为期望累积奖励的最大化

2.Seqential Decision Making

Goal:选择使未来总体目标最大化的action
Action可能有长期的影响
reward可能会延迟
为了获得更长远的reward，也许牺牲眼前的reward可能会更好→金融投资

3.Agent and Environment

下图大脑代表agent，其中包含有一系列算法来调整action
地球代表environment

学生提问：多目标时如何处理？

答：给不同的目标赋予不同权重，agent根据权重采取action

4.State

history: observations、actions、rewards所组成的序列
- 所有观测变量取决于t
- 未来发生的事情决定于history（agent选择action，env选择observations/rewards）
  
  $H_t = O_t, R_1, A_1, ..., A_{t-1},O_t, R_t$

RL要实现的算法就是从history到下一个action的映射

state: 用来决定下一步action。state是history的函数

$S_t=f(H_t)$

1.environment state $S_t^e$ : 环境选择下一个observation/reward的数据集合，对于agent通常是不可见的，即便可见也是属于干扰信息。

2.agent state $S_t^a$ : 存在于算法中，表示agent用于挑选下一个action的信息，也可以是history的函数 $S_t^a=f(H_t)$

3.information state(a.k.a Markov state):基于信息论的概念，其包含history中全部的有用信息

在这里插入图片描述

Markov State: 当且仅当 $S_t$ 满足 $P[S_{t+1}|S_t]=P[S_{t+1}|S_1,...,S_t]$ 时， $S_t$ 为Markov
“The future is independent of the past given the present”
也就是说state可知时，history可丢弃

Rat Example

在这里插入图片描述

5.Environment

Fully Observable Environments: agent可直接观测到environment state
- $O_t=S_t^a=S_t^e$
- agent state = environment state=information state
- 这是一个Markov decision process（MDP），即马可夫决策过程
Partially Observable Environments: agent并不能观测到环境的全部信息
- agent state ≠ environment state
- 这是一个partially observable Markov decision process(POMDP)
- agent必须创建自己的状态表示 $S_t^a$
  - e.g.1 利用history: $S_t^a=H_t$
  - e.g.2 利用env state的概率: $S_t^a=(P[S_t^e=s^1],...,P[S_t^e=s^n])$
  - e.g.3 利用神经网络线性模型: $S_t^a=σ(S_{t-1}^aW_s+O_tW_o)$

Ⅱ Inside An RL Agent

1.RL agent的主要组成

policy, value function, model

policy： agent的行为函数，以状态作为输入下一步的行为决策作为输出，用于选择action。
- state到action的映射，即$a=π(s) $
- 期望采用的policy采用的action能得到更多的reward
- Stochastic policy（随机policy）: $π(a|s)=P[A_t = a|S_t = s]$
value function：用于表示agent采取某个action后的好坏，预期reward是多少等，用于评估在某种情况下agent的行为，用于最优化action。
- value function是对未来reward的预测（how much total reward we expected in the future），用以评估state的好坏
- 基于action的函数 $v_π(s)=E_π[R_{t+1}+γR_{t+2}+γ^2R_{t+3}+...|S_t=s]$
其中 $γ$ 为折扣因子（discount factor），一步步降低每一步的reward，直到reward可忽略，即达到value的最大化和action的最优化)
model: agent感知环境而对环境做出的表示，即agent是通过model来判断环境变化的。
- 预测环境会做出何反应
- Transitions model: $P$ 预测下一个环境的state（动态特性），这个模型函数表示基于之前的state和action，环境处于下一个state的概率
- Reward model: $R$ 预测下一个(immediate)reward，这个模型函数表示预期的reward是基于当前的state和action。

在这里插入图片描述

注意：model并不总是必需环节

2.Maze Example（迷宫）

在迷宫例子中，rewards每过一个时间t就-1（意味着agent需要找到最快的路径走出迷宫），actions为东南西北四个，states为agent所处位置

在这里插入图片描述

policy: 每个state对应的policy $π (s)$

在这里插入图片描述

value: 每个state对应的value $v_π(s)$
Model:
- 方格表示transition model
- 方格中的数字表示immediate reward

在这里插入图片描述

3.RL agents的分类

第一种分类
- Value Based: No Policy (Implicit) ；Value Function
- Policy Based: Policy；No Value Function
- Actor Critic: Policy + Value Function
注意：前两种分类中，如No Policy表示的是agent仅基于Value Funtion就可以判断和采取action，如2.Maze Example中，由Value Function的图可知，基于Value Function其实就已经可以判断出下一步的action了。No Value Function同。
第二种分类
- Model Free: Policy and/or Value Function；No Model
- Model Based: Policy and/or Value Function + Model
Model Free表示agent不会尝试理解环境，不会创建一个动态特性模型来表示该如何action，而直接使用Policy, Value Function的经验

在这里插入图片描述

Ⅲ Problems within RL

以下是RL中的一些问题

1.Learning and Planning（学习与规划）

在连续决策中有两大基本问题

Reinforcement Learning (学习)
- 环境的初始状态是未知的（如Atari games中，agent不知道游戏的规则）
- agent和环境交互，获取环境的observation, reward
- agent会根据环境提供的observation, reward改进其policy
Planning (规划)
- 环境的model是已知的（如Atari games中，agent知道游戏的规则）
- agent根据环境model进行计算（无外部干涉，即无需与环境进行真实交互）
- agent改进其policy
总结：RL就是在与环境交互中学习，而Planning是根据model来计算

2.Exploration and Exploitation (探索和开发)

RL是试错学习(trail-and-error learning)，不断尝试哪些action是好的，哪些是不好的，在这些过程中会失去reward
agent的目标是根据其经验找到一个最优的policy，而不至于失去太多的reward
平衡Exploration和Exploitation
- Exploration: 放弃部分reward，找到有关环境的更多信息
- Exploitation: 开发已知信息来最大化reward

3.Prediction and Control (预测和控制)

Prediction: 估计未来，即根据给定的policy来估计能获得多少reward
Control: 优化未来，即找到最优的policy

在RL中，需要解决Prediction问题进而解决Control问题，需要估计所有的policy以获取最优的policy

Gridworld Example
- Prediction: 采用均匀随机policy，获取value的表格如右图所示。

在这里插入图片描述

Control: 根据已有的policy获取最优的policy以获得最多的reward

在这里插入图片描述

References

https://www.bilibili.com/video/BV1kb411i7KG?from=search&seid=5362588914313557002

https://blog.youkuaiyun.com/u013745804/category_7216686.html

http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

（转载整理自网络，如有侵权，联系本人删除，仅供技术总结使用）