Chapter 5 Monte Carlo Methods

不像以前章节,这里不假设有complete knowledge of the environment。

   不需要完美的model,只要有experiences就行,用episodes表示,一个episode就是一个完整的从开始到结束的state、action、reward序列。蒙特卡洛方法的特点就是要使用整个序列,举例来说就是必须在一个episode结束后得到了整个序列才能使用蒙特卡洛方法。
   蒙特卡洛方法因此可以episode-by-episode的增加,但不是step-by-step的(在线)的增加。
   蒙特卡洛方法在这里用于基于averaging complete returns。而且要处理的问题也是nonstationarity

5.1 Monte Carlo Prediction

首先考虑蒙特卡洛方法用于在给定policy下学习state-value function。跟Policy Evaluation(Prediction)类似的情况。原理是大数定理,这也是所有蒙特卡洛方法的基础

  • First-Visit Monte-Carlo Policy Evaluation:estimate vπ(s)vπ(s) as the average of the returns following first visits to s.
    • To evaluate state s
    • The first time-step t that state s is visited in an episode,
    • Increment counter N(s)N(s)+1N(s)←N(s)+1
    • Increment total return S(s)S(s)+GtS(s)←S(s)+Gt
    • Value is estimated by mean return V(s)=S(s)/N(s)V(s)=S(s)/N(s)
    • By law of large numbers, V(s)vπ(s) as N(s)V(s)→vπ(s) as N(s)→∞
  • Every-Visit Monte-Carlo Policy Evaluation:estimate vπ(s)vπ(s) as the average of the returns following every visits to s.
    • To evaluate state s
    • Every time-step t that state s is visited in an episode,
    • Increment counter N(s)N(s)+1N(s)←N(s)+1
    • Increment total return S(s)S(s)+GtS(s)←S(s)+Gt
    • Value is estimated by mean return V(s)=S(s)/N(s)V(s)=S(s)/N(s)
    • Again, V(s)vπ(s) as N(s)V(s)→vπ(s) as N(s)→∞

这里的说的 对svisit 是指在一个episode中 state s 出现一次

First-visit MC prediction
first-visit MC和every-visit MC都收敛到vπ(s)vπ(s),当visit的数量增加到无限的时候

5.3 Monte Carlo Control

π(s)argmaxaq(s,a).π(s)≐argmaxaq(s,a).

policy improvement theorem应用到 πkπkπk+1πk+1
qπk(s,πk+1(s))=qπk(s,argmaxaqπk(s,a))=maxaqπk(s,a)qπk(s,πk(s))vπk(s).qπk(s,πk+1(s))=qπk(s,argmaxaqπk(s,a))=maxaqπk(s,a)≥qπk(s,πk(s))≥vπk(s).

Monte Carlo ES
exploring starts就是开始的时候手动给一个好的值
5.4 Monte Carlo Control without Exploring Starts

有两种方法可以避开exploring starts的需求

  • On-policy learning
    • ”Learn on the job”
    • Learn about policy ππ from experience sampled from ππ
    • On-policy更新的policy与产生样本的policy是一样的
  • Off-policy learning
    • ”Look over someone’s shoulder”
    • Learning about policy ππ from experience sampled from μμ
    • Off-policy更新的policy与产生样本的policy不一样

关于On-policy和Off-policy的定义和关系是后面近似方法的核心

On-policy rst-visit MC control

5.5 Off-policy Prediction via Importance Sampling

off-policy的方差更大,收敛的更慢

   on-policy approach实际上是一种妥协,是探索近似最优policy
   off-policy approach使用一种更直观的方式是使用两个policy,一个用来学习并成为the optimal policy,另一个更exploratory,用来generate behavior。

   用来学习的policy称为target policy,这里是 ππ ;用来生成行为的policy称为behavior policy,这里是 bb
   In this case we say that learning is from data “off” the target policy, and the overall process is termed off-policy learning.

因为behavior policy更stochastic and more exploratory,所以可以是 ε-greedy 方法

Almost all off-policy methods utilize importance sampling, a general technique for estimating expected values under one distribution given samples from another.
We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.

给定开始状态 StSt,后续state-action trajectory在任意policy ππ 下发生的概率

Pr{At,St+1,AT=1,,ST|St,At:T1π}=π(At|St)p(St+1|St,At)π(At+1|St+1)p(ST|ST1,AT1)=k=tT1π(Ak|Sk)p(Sk+1|Sk,Ak),Pr{At,St+1,AT=1,⋯,ST|St,At:T−1∼π}=π(At|St)p(St+1|St,At)π(At+1|St+1)⋯p(ST|ST−1,AT−1)=∏k=tT−1π(Ak|Sk)p(Sk+1|Sk,Ak),

注意trajectory,后面蒙特卡洛搜索树会用到这个概念

那么importance sampling ratio

ρt:T1T1k=tπ(Ak|Sk)p(Sk+1|Sk,Ak)T1k=tb(Ak|Sk)p(Sk+1|Sk,Ak)=k=tT1π(Ak|Sk)b(Ak|Sk)ρt:T−1≐∏k=tT−1π(Ak|Sk)p(Sk+1|Sk,Ak)∏k=tT−1b(Ak|Sk)p(Sk+1|Sk,Ak)=∏k=tT−1π(Ak|Sk)b(Ak|Sk)

应用importance ratio。在只有有behavior policy得到的returns GtGt的情况下,想得到在target policy下的expected returns(values)。

E[ρt:T1Gt|St=s]=vπ(s)E[ρt:T−1Gt|St=s]=vπ(s)

In particular, we can define the set of all time steps in which state s is visited, denoted J(s)J(s). This is for an every-visit method; for a first-visit method, J(s)J(s) would only include time steps that were first-visits to s within their episodes.

Ordinary importance sampling:

V(s)tJ(s)ρt:T1Gt|J(s)|V(s)≐∑t∈J(s)ρt:T−1Gt|J(s)|

Weighted importance sampling:

V(s)tJ(s)ρt:T1GttJ(s)ρt:T1V(s)≐∑t∈J(s)ρt:T−1Gt∑t∈J(s)ρt:T−1
5.6 Incremental Implementation

Wt=ρt:T1Wt=ρt:T−1
则有
V(s)n1t=1WkGtn1t=1Wk,n2V(s)≐∑t=1n−1WkGt∑t=1n−1Wk,n≥2

把上面的权重更新写成递增实现
Vn+1Vn+WnCn[CnVn],n1Vn+1≐Vn+WnCn[Cn−Vn],n≥1

Gn+1Cn+Wn+1Gn+1≐Cn+Wn+1

Off-policy MC prediction
这里其实就是上面增量的实现weighted importance-sampling。这里只表现了增量实现与importance-sampling的关系

5.7 Off-policy Monte Carlo Control

Off-policy MC control

5.8 *Discounting-aware Importance Sampling

把returns的内部结构添加到discounted rewards的总和的考虑中。这可以减小方差

The essence of the idea is to think of discounting as determining a probability of termination or, equivalently, a degree of partial termination.

G¯t:hRt+1+Rt+2++Rh,0t<hT,G¯t:h≐Rt+1+Rt+2+⋯+Rh,0≤t<h≤T,

The conventional full return GtGt can be viewed as a sum of at partial returns

GtRt+1+γRt+2+γ2Rt+3++γTt1RT=(1γ)Rt+1+(1γ)γ(Rt+1+Rt+2)+(1γ)γ2(Rt+1+Rt+2+Rt+3)+(1γ)γTt2(Rt+1+Rt+2++RT1)+γTt1(Rt+1+Rt+2++RT1)=(1γ)h=t+1T1γht1G¯t:h+γTt1G¯t:TGt≐Rt+1+γRt+2+γ2Rt+3+⋯+γT−t−1RT=(1−γ)Rt+1+(1−γ)γ(Rt+1+Rt+2)+(1−γ)γ2(Rt+1+Rt+2+Rt+3)⋮+(1−γ)γT−t−2(Rt+1+Rt+2+⋯+RT−1)+γT−t−1(Rt+1+Rt+2+⋯+RT−1)=(1−γ)∑h=t+1T−1γh−t−1G¯t:h+γT−t−1G¯t:T

则有
ordinary importance-sampling estimator
V(s)tJ(s)((1γ)T(t)1h=t+1γht1ρt:h1G¯t:h+γT(t)t1ρt:T(t)1G¯t:T(t))|J(s)|V(s)≐∑t∈J(s)((1−γ)∑h=t+1T(t)−1γh−t−1ρt:h−1G¯t:h+γT(t)−t−1ρt:T(t)−1G¯t:T(t))|J(s)|

weighted importance-sampling estimator

V(s)tJ(s)((1γ)T(t)1h=t+1γht1ρt:h1G¯t:h+γT(t)t1ρt:T(t)1G¯t:T(t))tJ(s)((1γ)T(t)1h=t+1γht1ρt:h1+γT(t)t1ρt:T(t)1)V(s)≐∑t∈J(s)((1−γ)∑h=t+1T(t)−1γh−t−1ρt:h−1G¯t:h+γT(t)−t−1ρt:T(t)−1G¯t:T(t))∑t∈J(s)((1−γ)∑h=t+1T(t)−1γh−t−1ρt:h−1+γT(t)−t−1ρt:T(t)−1)
5.9 *Per-decision Importance Sampling

另外一种把structure of the return作为rewards的总和,可以被考虑在off-policy importance sampling中。也可以减小方差

ρt:T1Gt=ρt:T1(Rt+1+γRt+2++γTt1RT)=ρt:T1Rt+1+γρt:T1Rt+2++γTt1ρt:T1RT)ρt:T−1Gt=ρt:T−1(Rt+1+γRt+2+⋯+γT−t−1RT)=ρt:T−1Rt+1+γρt:T−1Rt+2+⋯+γT−t−1ρt:T−1RT)

上式的第一个子项可以写为
ρt:T1Rt+1=π(At|St)b(At|St)π(At+1|St+1)b(At+1|St+1)π(At+2|St+2)b(At+2|St+2)π(AT1|ST1)b(AT1|ST1)Rt+1ρt:T−1Rt+1=π(At|St)b(At|St)π(At+1|St+1)b(At+1|St+1)π(At+2|St+2)b(At+2|St+2)⋯π(AT−1|ST−1)b(AT−1|ST−1)Rt+1

注意到上式中的各项,只有第一项和最后一项(the reward)是相关的;其他各项都是独立随机变量,它们的期望值为1

E[π(Ak|Sk)b(Ak|Sk)]ab(a|Sk)π(a|Sk)b(a|Sk)=aπ(a|Sk)=1E[π(Ak|Sk)b(Ak|Sk)]≐∑ab(a|Sk)π(a|Sk)b(a|Sk)=∑aπ(a|Sk)=1

所有的比率值中,只有第一项留下来了,则有

E[ρt:T1Rt+1]=[ρt:tRt+1]E[ρt:T−1Rt+1]=[ρt:tRt+1]

重复上述分析过程则可以得到

E[ρt:T1Gt]=E[G¯t]E[ρt:T−1Gt]=E[G¯t]

其中

G¯t=ρt:tRt+1+γρt:t+1Rt+2+γ2ρt:t+2Rt+3++γTt1ρt:T1RTG¯t=ρt:tRt+1+γρt:t+1Rt+2+γ2ρt:t+2Rt+3+⋯+γT−t−1ρt:T−1RT

我们称这个想法为 per-decision importance sampling

使用 G¯tG¯t 的ordinary-importance-sampling estimator

V(s)tJ(s)G¯t|J(s)|V(s)≐∑t∈J(s)G¯t|J(s)|
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值