文章目录
强化学习基础概念
MP
下一时刻只跟前一时刻的状态有关,跟前面的状态无关。
策略 π \pi π
策略
π
\pi
π可以是一个函数,图表等由状态
→
\rightarrow
→动作的映射,在某个s状态时采取动作a的概率:
π
(
a
∣
s
)
=
p
[
A
t
=
a
∣
S
t
=
s
]
(1)
\pi(a|s)=p[A_t=a|S_t=s]\tag{1}
π(a∣s)=p[At=a∣St=s](1)
Reward
G t = R t + 1 + γ R t + 2 + ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 (2) G_t = R_{t+1}+\gamma R_{t+2}+\cdots=\sum^\infin_{k=0}\gamma^{k} R_{t+k+1} \tag{2} Gt=Rt+1+γRt+2+⋯=k=0∑∞γkRt+k+1(2)
State Value function
遵从策略
π
\pi
π在状态s的状态值函数:
v
π
(
s
)
=
E
π
[
∑
k
=
0
∞
γ
k
R
t
+
k
+
1
∣
S
t
=
s
]
(3)
v_\pi(s)=E_\pi[\sum^\infin_{k=0}\gamma^kR_{t+k+1}|S_t=s]\tag{3}
vπ(s)=Eπ[k=0∑∞γkRt+k+1∣St=s](3)
State Value Bellman equation
v ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ R t + 2 + ⋯ ∣ S t = s ] = E [ R t + 1 + γ ( R t + 2 + γ R t + 3 + ⋯ ) ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 + γ v ( S t + 1 ) ∣ S t = s ] (4) \begin{aligned} v(s) &= E[G_t|S_t=s]\\ &=E[R_{t+1}+\gamma R_{t+2}+\cdots|S_t=s]\\ &=E[R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3}+\cdots)|S_t=s]\\ &=E[R_{t+1}+\gamma G_{t+1}|S_t=s] \\ &=E[R_{t+1}+\gamma v(S_{t+1})|S_t=s] \end{aligned}\tag{4} v(s)=E[Gt∣St=s]=E[Rt+1+γRt+2+⋯∣St=s]=E[Rt+1+γ(Rt+2+γRt+3+⋯)∣St=s]=E[Rt+1+γGt+1∣St=s]=E[Rt+1+γv(St+1)∣St=s](4)
State-Action Value function
遵从策略
π
\pi
π在状态s采取动作a的状态动作值函数:
q
π
(
s
,
a
)
=
E
π
[
∑
k
=
0
∞
γ
k
R
t
+
k
+
1
∣
S
t
=
s
,
A
t
=
a
]
(5)
q_\pi(s,a)=E_\pi[\sum^\infin_{k=0}\gamma^kR_{t+k+1}|S_t=s,A_t=a]\tag{5}
qπ(s,a)=Eπ[k=0∑∞γkRt+k+1∣St=s,At=a](5)
State-Action Value Bellman equation
q π ( s , a ) = E π [ R t + 1 + γ q ( S t + 1 , A t + 1 ) ∣ S t = s , A t = a ] (6) q_\pi(s,a)=E_\pi[R_{t+1}+\gamma q(S_{t+1},A_{t+1})|S_t=s,A_t=a]\tag{6} qπ(s,a)=Eπ[Rt+1+γq(St+1,At+1)∣St=s,At=a](6)
Look ahead
v
π
(
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
q
π
(
s
,
a
)
(7)
v_\pi(s)=\sum_{a\in A}\pi(a|s)q_\pi(s,a)\tag{7}
vπ(s)=a∈A∑π(a∣s)qπ(s,a)(7)
q
π
(
s
,
a
)
=
R
s
a
+
γ
∑
s
′
P
s
s
′
a
v
π
(
s
′
)
(8)
q_\pi(s,a)=R^a_s+\gamma \sum_{s'}P^a_{ss'}v_\pi(s')\tag{8}
qπ(s,a)=Rsa+γs′∑Pss′avπ(s′)(8)
其中
P
s
s
′
a
P^a_{ss'}
Pss′a是状态转移概率
将8带入7式,可得:
v
π
(
s
)
=
∑
a
∈
A
π
(
a
∣
s
)
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
a
v
π
(
s
′
)
(9)
v_\pi(s)=\sum_{a\in A}\pi(a|s) R^a_s+\gamma \sum_{s'\in S}P^a_{ss'}v_\pi(s')\tag{9}
vπ(s)=a∈A∑π(a∣s)Rsa+γs′∈S∑Pss′avπ(s′)(9)
v
π
(
s
′
)
=
∑
a
′
∈
A
π
(
a
′
∣
s
′
)
q
π
(
s
′
,
a
′
)
(10)
v_\pi(s')=\sum_{a'\in A}\pi(a'|s')q_\pi(s',a')\tag{10}
vπ(s′)=a′∈A∑π(a′∣s′)qπ(s′,a′)(10)
将10带入8,得到State-Ation Value
q
π
(
s
,
a
)
=
R
s
a
+
γ
∑
s
′
∈
S
P
s
s
′
a
∑
a
′
∈
A
π
(
a
′
∣
s
′
)
q
π
(
s
′
,
a
′
)
(11)
q_\pi(s,a)=R^a_s +\gamma \sum_{s'\in S}P^a_{ss'}\sum_{a'\in A}\pi(a'|s')q_\pi(s',a')\tag{11}
qπ(s,a)=Rsa+γs′∈S∑Pss′aa′∈A∑π(a′∣s′)qπ(s′,a′)(11)
最优值函数
v
⋆
(
s
)
=
max
π
v
(
s
)
v^\star(s)=\text{max}_\pi v(s)
v⋆(s)=maxπv(s)是所有
v
(
s
)
v(s)
v(s)中最大的,
q
⋆
(
s
,
a
)
=
max
π
q
(
s
,
a
)
q^\star(s,a)=\text{max}_\pi q(s,a)
q⋆(s,a)=maxπq(s,a)
Greedy
π ⋆ ( a ∣ s ) = { 1 , if a=arg m a x a ∈ A q ⋆ ( s , a ) 0 , otherwise \pi^\star(a|s)=\begin{cases} 1, & \text{if a=arg $max_{a\in A} q^\star(s,a)$}\\ 0,& \text{otherwise} \end{cases} π⋆(a∣s)={1,0,if a=arg maxa∈Aq⋆(s,a)otherwise
ϵ \epsilon ϵ-greedy
π ( s ∣ a ) = { 1 − ϵ + ϵ ∣ A ( s ) ∣ , if a = a r g m a x a Q ( s , a ) ϵ ∣ A ( s ) ∣ , if a ≠ a r g m a x a Q ( s , a ) \pi(s|a)=\begin{cases} 1-\epsilon+\frac{\epsilon}{|A(s)|}, & \text{if $a=argmax_aQ(s,a)$}\\ \frac{\epsilon}{|A(s)|}, & \text{if $a \neq argmax_aQ(s,a)$} \end{cases} π(s∣a)={1−ϵ+∣A(s)∣ϵ,∣A(s)∣ϵ,if a=argmaxaQ(s,a)if a=argmaxaQ(s,a)
softmax
详见RL:A Intro P37
π
(
a
∣
s
,
θ
)
=
e
H
t
(
a
)
∑
b
e
H
t
(
b
)
\pi(a|s,\theta)=\frac{e^{H_t(a)}}{\sum_be^{H_t(b)}}
π(a∣s,θ)=∑beHt(b)eHt(a)
其中
H
1
(
a
)
=
1
H_1(a)=1
H1(a)=1, 其更新公式为:
H
t
+
1
(
A
t
)
=
H
t
(
A
t
)
+
α
(
R
t
−
R
ˉ
t
)
(
1
−
π
t
(
A
t
)
)
,
α
H_{t+1}(A_t)=H_t(A_t)+\alpha(R_t-\bar R_t)(1-\pi_t(A_t)),\alpha
Ht+1(At)=Ht(At)+α(Rt−Rˉt)(1−πt(At)),α是step-size参数,可取0.1, 0.2, 0.3, 0.4 …
强化学习算法
有模型
动态规划
无模型
基于值函数的方法
MC蒙特卡洛法
通过一次实验产生一个 τ = { S 1 , A 1 , R 1 , S 2 , A 2 , R 2 ⋯ , S T , R T } \tau=\{S_1,A_1,R_1,S_2,A_2,R_2\cdots,S_T,R_T\} τ={S1,A1,R1,S2,A2,R2⋯,ST,RT}
首次访问的MC
在一次试验中,首次访问时状态s的折扣回报为
G
t
(
s
)
=
R
t
+
1
+
γ
R
t
+
2
+
⋯
+
γ
T
−
t
−
1
R
T
G_t(s)=R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{T-t-1}R_T
Gt(s)=Rt+1+γRt+2+⋯+γT−t−1RT,进行多次实验,再计算
G
(
s
)
G(s)
G(s)的期望:
v
(
s
)
=
G
t
1
(
s
)
+
G
t
2
(
s
)
+
⋯
+
G
t
N
(
s
)
N
v(s)=\frac{G_{t1}(s)+G_{t2}(s)+\cdots+G_{tN}(s)}{N}
v(s)=NGt1(s)+Gt2(s)+⋯+GtN(s)
每次访问的MC
在一次试验中,每次访问状态s的折扣回报为
G
t
i
(
s
)
=
R
t
+
1
+
γ
R
t
+
2
+
⋯
+
γ
T
−
t
−
1
R
T
G^i_t(s)=R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{T-t-1}R_T
Gti(s)=Rt+1+γRt+2+⋯+γT−t−1RT,上标i表示第i次实验,第i次实验中有h个访问到状态s,回报分别为
G
m
1
i
,
G
m
2
i
,
G
m
3
i
,
⋯
,
G
m
h
i
G^i_{m_1},G^i_{m_2},G^i_{m_3},\cdots,G^i_{m_h}
Gm1i,Gm2i,Gm3i,⋯,Gmhi。进行多次实验,再计算
G
(
s
)
G(s)
G(s)的期望:
v
(
s
)
=
G
m
1
1
(
s
)
+
G
m
2
1
(
s
)
+
⋯
+
G
m
h
1
(
s
)
+
G
t
1
2
(
s
)
+
⋯
+
G
t
n
2
(
s
)
+
⋯
N
(
G
)
v(s)=\frac{G^1_{m_1}(s)+G^1_{m_2}(s)+\cdots+G^1_{m_h}(s)+G^2_{t_1}(s)+\cdots+G^2_{t_n}(s)+\cdots}{N(G)}
v(s)=N(G)Gm11(s)+Gm21(s)+⋯+Gmh1(s)+Gt12(s)+⋯+Gtn2(s)+⋯
On-policy
进行实验来采样的策略(behavior policy)和目标策略(target policy)是同一个
Off-policy & Importance Sampling
进行实验来采样的策略(behavior policy)和目标策略(target policy)不是是同一个。
重要性采样。这是一个通用性方法,给定一个分布用另一个分布去估计其期望。根据目标策略
π
\pi
π和采样策略
b
b
b它们的
τ
\tau
τ(trajectoris)发生的相关性概率,给予相应权重,将Importance-sampling技术应用到Off-policy上。给定初始状态
S
t
S_t
St,在策略
π
\pi
π下的之后的状态-动作轨迹
τ
\tau
τ的概率:
P
{
A
t
,
S
t
+
1
,
A
t
+
1
,
⋯
,
S
T
∣
S
t
,
A
t
:
T
−
1
∼
π
}
=
π
(
A
t
∣
S
t
)
p
(
S
t
+
1
∣
S
t
,
A
t
)
π
(
A
t
+
1
∣
S
t
+
1
)
⋯
p
(
S
T
∣
S
T
−
1
,
A
T
−
1
)
=
∏
k
=
t
T
−
1
π
(
A
k
∣
S
k
)
p
(
S
k
+
1
∣
S
k
,
A
k
)
P\{A_t,S_{t+1},A_{t+1},\cdots,S_T|S_t,A_{t:T-1} \sim \pi \}\\ \begin{aligned} &= \pi(A_t|S_t)p(S_{t+1}|S_t,A_t)\pi(A_{t+1}|S_{t+1})\cdots p(S_T|S_{T-1},A_{T-1}) \\ &= \prod^{T-1}_{k=t} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k) \\ \end{aligned}
P{At,St+1,At+1,⋯,ST∣St,At:T−1∼π}=π(At∣St)p(St+1∣St,At)π(At+1∣St+1)⋯p(ST∣ST−1,AT−1)=k=t∏T−1π(Ak∣Sk)p(Sk+1∣Sk,Ak)
重要性率:
ρ
t
:
T
−
1
=
˙
∏
k
=
t
T
−
1
π
(
A
k
∣
S
k
)
p
(
S
k
+
1
∣
S
k
,
A
k
)
∏
k
=
t
T
−
1
b
(
A
k
∣
S
k
)
p
(
S
k
+
1
∣
S
k
,
A
k
)
=
∏
k
=
t
T
−
1
π
(
A
k
∣
S
k
)
b
(
A
k
∣
S
k
)
\rho_{t:T-1} \dot =\frac{\prod^{T-1}_{k=t} \pi(A_k|S_k)p(S_{k+1}|S_k,A_k) } {\prod^{T-1}_{k=t} b(A_k|S_k)p(S_{k+1}|S_k,A_k) }=\prod^{T-1}_{k=t}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}
ρt:T−1=˙∏k=tT−1b(Ak∣Sk)p(Sk+1∣Sk,Ak)∏k=tT−1π(Ak∣Sk)p(Sk+1∣Sk,Ak)=k=t∏T−1b(Ak∣Sk)π(Ak∣Sk)
普通重要性采样ordinary importance sampling
V ( s ) = ˙ ∑ t ∈ T ( s ) ρ t : T ( t ) − 1 G t ∣ T ( s ) ∣ V(s) \dot= \frac{\sum_{t\in \mathscr T(s)} \rho_{t:T(t)-1}G_t} {|\mathscr T(s)|} V(s)=˙∣T(s)∣∑t∈T(s)ρt:T(t)−1Gt
其中t是访问状态s的时刻, T ( t ) T(t) T(t)是访问状态s相对应的实验终止状态对应的时刻, T ( s ) \mathscr T(s) T(s)是s发生的所有时刻的集合。
加权重要性采样weighted importance sampling
V ( s ) = ˙ ∑ t ∈ T ( s ) ρ t : T ( t ) − 1 G t ∑ t ∈ T ( s ) ρ t : T ( t ) − 1 V(s) \dot =\frac{\sum_{t\in \mathscr T(s)} \rho_{t:T(t)-1}G_t} {\sum_{t\in \mathscr T(s)}\rho_{t:T(t)-1}} V(s)=˙∑t∈T(s)ρt:T(t)−1∑t∈T(s)ρt:T(t)−1Gt
- 假设我么有一列回报收益
G
1
,
G
2
,
…
,
G
n
−
1
G_1,G_2,\dots,G_{n-1}
G1,G2,…,Gn−1,都从同一个状态起始,每个都对应随机权重
W
i
W_i
Wi(e.g.
W
i
=
ρ
t
i
:
T
(
t
t
)
−
1
W_i=\rho_{t_i:T(t_t)-1}
Wi=ρti:T(tt)−1). 我们希望组建一个估计:
V
n
=
˙
∑
k
=
1
n
−
1
W
k
G
k
∑
k
=
1
n
−
1
W
k
,
n
≥
2
V_n\dot = \frac{\sum^{n-1}_{k=1}W_kG_k} {\sum^{n-1}_{k=1}W_k}, \ n \geq 2
Vn=˙∑k=1n−1Wk∑k=1n−1WkGk, n≥2
且当我们获得一个新回报
G
n
G_n
Gn的时候保持其更新。要保持对
V
n
V_n
Vn的更新轨迹,我们必须对每个状态维持给首次n个回报权重的不断累计的
C
n
C_n
Cn之和。
V
n
V_n
Vn更新规则是:
V
n
+
1
=
˙
V
n
+
W
n
C
n
[
G
n
−
V
n
]
,
n
≥
1
V_{n+1}\dot=V_n +\frac{W_n}{C_n}[G_n-V_n],\ n\geq1
Vn+1=˙Vn+CnWn[Gn−Vn], n≥1
and
C
n
+
1
=
˙
C
n
+
W
n
+
1
C_{n+1} \dot =C_n+W_{n+1}
Cn+1=˙Cn+Wn+1
其中
C
0
=
˙
0
,
V
1
C_0\dot=0,V_1
C0=˙0,V1是任意的因此不用特别声明其值。
pseudocode:
TD(Temporal-Difference)时间差分法
Sarsa: On-policy TD Control
Action-values update
Q
(
S
t
,
A
t
)
←
Q
(
S
t
,
A
t
)
+
α
[
R
t
+
1
+
γ
Q
(
S
t
+
1
,
A
t
+
1
)
−
Q
(
S
t
,
A
t
)
]
Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha[R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)]
Q(St,At)←Q(St,At)+α[Rt+1+γQ(St+1,At+1)−Q(St,At)]
如果
S
t
+
1
S_{t+1}
St+1结束,则
Q
(
S
t
+
1
,
A
t
+
1
)
Q(S_{t+1},A_{t+1})
Q(St+1,At+1)为0.
其TD error为
δ
t
=
R
t
+
1
+
γ
Q
(
S
t
+
1
,
A
t
+
1
)
−
Q
(
S
t
,
A
t
)
\delta_t=R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)
δt=Rt+1+γQ(St+1,At+1)−Q(St,At)
pseudocode:
Q-learning: Off-policy TD Control
Action-Values update
Q
(
S
t
,
A
t
)
←
Q
(
S
t
,
A
t
)
+
α
[
R
t
+
1
+
γ
max
a
Q
(
S
t
+
1
,
a
)
−
Q
(
S
t
,
A
t
)
]
Q(S_t,A_t)\leftarrow Q(S_t,A_t)+\alpha[R_{t+1}+\gamma \text{max}_aQ(S_{t+1},a)-Q(S_t,A_t)]
Q(St,At)←Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)−Q(St,At)]
这里,学习的动作值函数Q直接用去近似最优动作函数
q
∗
q^*
q∗,而不依赖策略。
pseudocoe:
TD( λ \lambda λ)
-
回忆在Monte Carlo更新 v π ( S t ) v_\pi(S_t) vπ(St)的估计中,我们直接用全部的Return:
G t = ˙ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ + γ T − t − 1 R T G_t\dot=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\cdots+\gamma^{T-t-1}R_T Gt=˙Rt+1+γRt+2+γ2Rt+3+⋯+γT−t−1RT
我们成这个量为更新的目标。 -
而在一步的更新中,其目标为:
G t : t + 1 = ˙ R t + 1 + γ V t ( S t + 1 ) G_{t:t+1}\dot=R_{t+1}+\gamma V_t(S_{t+1}) Gt:t+1=˙Rt+1+γVt(St+1)
其中 V t : S → R V_t:\mathcal S\rightarrow \mathbb R Vt:S→R是 v π v_\pi vπ在时刻t的估计。 G t : t + 1 G_{t:t+1} Gt:t+1下标表示它是从t到t+1的截断Return,折扣估计 γ V t ( S t + 1 ) \gamma V_t(S_{t+1}) γVt(St+1) 替代了 γ R t + 2 + γ 2 R t + 3 + ⋯ + γ T − t − 1 R T \gamma R_{t+2}+\gamma^2R_{t+3}+\cdots+\gamma^{T-t-1}R_T γRt+2+γ2Rt+3+⋯+γT−t−1RT. -
在两步的更新中,其目标为:
G t : t + 2 = ˙ R t + 1 + γ R t + 2 + γ 2 V t + 1 ( S t + 2 ) G_{t:t+2}\dot=R_{t+1}+ \gamma R_{t+2}+ \gamma^2 V_{t+1}(S_{t+2}) Gt:t+2=˙Rt+1+γRt+2+γ2Vt+1(St+2) -
⋯ \cdots ⋯
-
类似的,在n步更新中,其目标为:
G t : t + n = ˙ R t + 1 + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n V t + n − 1 ( S t + n ) G_{t:t+n}\dot=R_{t+1}+ \gamma R_{t+2}+\cdots+ \gamma^{n-1}R_{t+n}+\gamma^{n} V_{t+n-1}(S_{t+n}) Gt:t+n=˙Rt+1+γRt+2+⋯+γn−1Rt+n+γnVt+n−1(St+n)
其中for all n,t such that n > 1 n>1 n>1 and 0 ≤ t ≤ T − n 0\le t\le T-n 0≤t≤T−n.
那么自然地,状态值函数的n步更新算法为:
V
t
+
n
(
S
t
)
=
˙
V
t
+
n
−
1
(
S
t
)
+
α
[
G
t
:
t
+
n
−
V
t
+
n
−
1
(
S
t
)
]
V_{t+n}(S_t)\dot=V_{t+n-1}(S_t)+\alpha[G_{t:t+n}-V_{t+n-1}(S_{t})]
Vt+n(St)=˙Vt+n−1(St)+α[Gt:t+n−Vt+n−1(St)]
其中
0
≤
t
<
T
.
0\le t<T.
0≤t<T.
pseudocode:
n-step Sarsa
Action-values update
从估计的动作值角度重新定义n-step收益(更新目标):
G
t
:
t
+
n
=
˙
R
t
+
1
+
γ
R
t
+
2
+
⋯
+
γ
n
−
1
R
t
+
n
+
γ
n
Q
t
+
n
−
1
(
S
t
+
n
,
A
t
+
n
)
,
n
≥
1
,
0
≤
t
<
T
−
n
G_{t:t+n}\dot=R_{t+1}+ \gamma R_{t+2}+\cdots+ \gamma^{n-1}R_{t+n}+\gamma^{n} Q_{t+n-1}(S_{t+n},A_{t+n}),n\ge1,0\le t<T-n
Gt:t+n=˙Rt+1+γRt+2+⋯+γn−1Rt+n+γnQt+n−1(St+n,At+n),n≥1,0≤t<T−n
其中,如果 t+n>=T,则
G
t
:
t
+
n
=
˙
G
t
G_{t:t+n}\dot=G_t
Gt:t+n=˙Gt。算法为:
Q
t
+
n
(
S
t
,
A
t
)
=
˙
Q
t
+
n
−
1
(
S
t
,
A
t
)
+
α
[
G
t
:
t
+
n
−
Q
t
+
n
−
1
(
S
t
,
A
t
)
]
,
0
≤
t
<
T
Q_{t+n}(S_t,A_t)\dot=Q_{t+n-1}(S_t,A_t)+\alpha[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)],0\le t<T
Qt+n(St,At)=˙Qt+n−1(St,At)+α[Gt:t+n−Qt+n−1(St,At)],0≤t<T
pseudocode
n-step Off-policy Learning
在 n-stepTD方法中,returns是由 n 步构建的,所以我们只对那 n 步的相关概率感兴趣。如,对于 n-step 的 off-policy 版本,对与时刻 t (实际实在时刻 t+n )的更新能容易地通过
ρ
t
:
t
+
n
−
1
\rho_{t:t+n-1}
ρt:t+n−1赋予权重:
V
t
+
n
(
S
t
)
=
˙
V
t
+
n
−
1
(
S
t
)
+
α
ρ
t
:
t
+
n
−
1
[
G
t
:
t
+
n
−
V
t
+
n
−
1
(
S
t
)
]
,
0
≤
t
<
T
V_{t+n}(S_t)\dot=V_{t+n-1}(S_t)+\alpha \rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)],\ 0\le t<T
Vt+n(St)=˙Vt+n−1(St)+αρt:t+n−1[Gt:t+n−Vt+n−1(St)], 0≤t<T
其中
ρ
t
:
t
+
n
−
1
\rho_{t:t+n-1}
ρt:t+n−1重要采样率,是在两个策略采取从
A
t
t
o
A
t
+
n
−
1
A_t\ to\ A_{t+n-1}
At to At+n−1这 n 个动作的相关概率:
ρ
t
:
h
=
˙
∏
k
=
1
m
i
n
(
h
,
T
−
1
)
π
(
A
k
∣
S
k
)
b
(
A
k
∣
S
k
)
\rho_{t:h}\dot= \prod^{min(h,T-1)}_{k=1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}
ρt:h=˙k=1∏min(h,T−1)b(Ak∣Sk)π(Ak∣Sk)
Action-values update
类似于前面 n-step Sarsa 更新,加入一个重要采样率,一个简单 off-policy 的形式:
Q
t
+
n
(
S
t
,
A
t
)
=
˙
Q
t
+
n
−
1
(
S
t
,
A
t
)
+
α
ρ
t
+
1
:
t
+
n
[
G
t
:
t
+
n
−
Q
t
+
n
−
1
(
S
t
,
A
t
)
]
,
0
≤
t
<
T
Q_{t+n}(S_t,A_t)\dot=Q_{t+n-1}(S_t,A_t)+\alpha \rho_{t+1:t+n}[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)],\ 0\le t<T
Qt+n(St,At)=˙Qt+n−1(St,At)+αρt+1:t+n[Gt:t+n−Qt+n−1(St,At)], 0≤t<T
基于策略的方法
看https://blog.youkuaiyun.com/anny0001/article/details/103696709