强化学习原理python篇02——贝尔曼公式
本章全篇参考赵世钰老师的教材 Mathmatical-Foundation-of-Reinforcement-Learning State Values and Bellman Equation章节,请各位结合阅读,本合集只专注于数学概念的代码实现。
概念
以bootstrapping来介绍状态值
bootstrapping(自举法)
让v代表从s1,…,s4的回报
v 1 = r 1 + γ r 2 + γ r 3 2 + . . . = r 1 + γ v 2 ; v 2 = r 2 + γ r 2 + γ r 3 2 + . . . = r 2 + γ v 3 ; v 3 = r 3 + γ r 2 + γ r 3 2 + . . . = r 3 + γ v 4 ; v 4 = r 4 + γ r 2 + γ r 3 2 + . . . = r 4 + γ v 1 ; v_1 = r_1 + γ_{r_2} + γ^2_{r_3} + ... =r_1+\gamma v_2;\\ v_2 = r_2 + γ_{r_2} + γ^2_{r_3} + ... =r_2+\gamma v_3;\\ v_3 = r_3 + γ_{r_2} + γ^2_{r_3} + ...=r_3+\gamma v_4 ;\\ v_4 = r_4 + γ_{r_2} + γ^2_{r_3} + ...=r_4+\gamma v_1 ;\\ v1=r1+γr2+γr32+...=r1+γv2;v2=r2+γr2+γr32+...=r2+γv3;v3=r3+γr2+γr32+...=r3+γv4;v4=r4+γr2+γr32+...=r4+γv1;
用矩阵表示为
[ v 1 v 2 v 3 v 4 ] = [ r 1 r 2 r 3 r 4 ] + γ [ 0 , 1 , 0 , 0 0 , 0 , 1 , 0 0 , 0 , 0 , 1 1 , 0 , 0 , 0 ] [ v 1 v 2 v 3 v 4 ] \left [\begin{matrix}v_1\\v_2\\v_3\\ v_4 \end{matrix} \right ] = \left [\begin{matrix}r_1\\r_2\\r_3\\ r_4 \end{matrix} \right ]+\gamma \left [\begin{matrix}0,1,0,0\\0,0,1,0\\0,0,0,1\\1,0,0,0 \end{matrix} \right ]\left [\begin{matrix}v_1\\v_2\\v_3\\ v_4 \end{matrix} \right ]
v1v2v3v4
=
r1r2r3r4
+γ
0,1,0,00,0,1,00,0,0,11,0,0,0
v1v2v3v4
写作
v = r + γ P v v = ( 1 − γ P ) − 1 r \pmb v = \pmb r + \pmb{γP} v\\ \pmb v =(1- \pmb{γP})^{-1} \pmb{r} v=r+γPvv=(1−γP)−1r
state value
S t → A t S t + 1 ; R t + 1 S_t \stackrel{At} {\rightarrow}S_{t+1}; R_{t+1} St→AtSt+1;Rt+1
表示从状态st做出动作at到 s t + 1 s_{t+1} st+1,并且获得鼓励 R t + 1 R_{t+1} Rt+1,从t开始,可以获得一个trajectory
S t → A t S t + 1 ; R t + 1 → A t + 1 S t + 2 ; R t + 2 → A t + 2 S t + 3 ; R t + 3 . . . S_t \stackrel{At} {\rightarrow}S_{t+1}; R_{t+1}\stackrel{A_{t+1}} {\rightarrow}S_{t+2}; R_{t+2}\stackrel{A_{t+2}} {\rightarrow}S_{t+3}; R_{t+3}... St→AtSt+1;Rt+1→At+1St+2;Rt+2→At+2St+3;Rt+3...
discounted return 为
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + . . . γ ∈ ( 0 ; 1 ) G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+... \\ \gamma \in (0; 1) Gt=Rt+1+γRt+2+γ2Rt+3+...γ∈(0;1)
state value 被定义为
v π ( s ) = E [ G t ∣ S t = s ] v_\pi(s)=E[G_t|S_t=s] vπ(s)=E[Gt