强化学习无痛上手笔记第1课

参考书籍:
1.Reinforcement Learning State-of-the-Art, Marco Wiering
2.Reinforcement learning: An introduction, RS Sutton

下一篇博客:
https://blog.youkuaiyun.com/Jinyindao243052/article/details/107051507

Markov Decision Process

Definition

Suppose that there is an interactive system as follows which consists of an environment and an agent.
在这里插入图片描述
Its running process is described as follows. At the beginning of step t = 0 , 1 , . . . t=0,1,... t=0,1,..., the agent has the knowledge of the environment state s t s_{t} st. Then it chooses an action a t ∈ A a_{t}\in \mathcal{A} atA that is available in state s t s_{t} st and apply it. After applying action a t a_{t} at, the state of the environment makes a transition to s t + 1 s_{t+1} st+1 according to a probability distribution T ( s t , a t ) T(s_t,a_t) T(st,at) over S \mathcal{S} S and generates a reward r t = R ( s t , a t , s t + 1 ) r_{t}=R(s_t,a_t,s_{t+1}) rt=R(st,at,st+1) and gives the information of its new state and reward to the agent.

Here S \mathcal{S} S is the set of possible environmental states, called state space. A \mathcal{A} A is the set of possible agent actions, called action space. T : S × A → P ( S ) T: \mathcal{S} \times \mathcal{A}\rightarrow \mathcal{P}(\mathcal{S}) T:S×AP(S) is called the transition function, where P ( S ) \mathcal{P}(\mathcal{S}) P(S) is probability distributions of states in S \mathcal{S} S. The probability of ending up in state s ′ s' s after doing applicable action a a a in state s s s is denoted as T ( s , a , s ′ ) T(s,a,s') T(s,a,s). R : S × A × S → R R: \mathcal{S}\times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R} R:S×A×SR is called the reward function, the reward of ending up in state s ′ s' s after doing action a a a in state s s s is R ( s , a , s ′ ) R(s,a,s') R(s,a,s).

Note that the propability distribution of its next state s t + 1 s_{t+1} st+1 is only determined by s t s_t st and a t a_t at, and is not directly influenced by all former states or actions. The reward r t r_t rt is determined by s t s_t st, a t a_t at, s t + 1 s_{t+1} st+1 and is not directly influenced by all former states, actions, and rewards. This kind of property is called Markov property.

In theoretic analysis, we assume the process never stops.

The discrete-time running process of the foregoing system is called a Markov Decision Process.

Related Concepts

Discounted Cumulative Reward: Define the discounted cumulative reward of the process from step t t t as γ k r t + k \gamma^{k}r_{t+k} γkrt+k(denoted as G t G_t Gt), where γ ∈ ( 0 , 1 ) \gamma \in (0,1) γ(0,1) is a parameter called the discount rate.

A MDP is usually represented by a tuple ⟨ S , A , T , R , γ ⟩ \left \langle \mathcal{S}, \mathcal{A}, T, R, \gamma\right \rangle S,A,T,R,γ, where S \mathcal{S} S is its state space, A \mathcal{A} A is its action space, T T T is its transition function, R R R is its reward function, γ \gamma γ is its discount rate.

History: The sequence of ( s 0 , a 0 , s 1 , a 1 , . . . ) (s_0,a_0,s_1,a_1,...) (s0,a0,s1,a1,...) is defined as the history of the MDP process (denoted as τ \tau τ). The probability of ( s t , a t , . . . , s t ′ ) (s_t, a_t, ..., s_{t'}) (st,at,...,st) given s t s_t st and a t a_{t} at,…, a t ′ − 1 a_{t'-1} at1 is Π i = t t ′ − 1 T ( s i , a i , s i + 1 ) \Pi_{i=t}^{t'-1}T(s_i,a_i, s_{i+1}) Πi=tt1T(si,ai,si+1).

Trajectory: The sequence of ( s 0 , a 0 , r 0 , s 1 , a 1 , r 1 . . . ) (s_0,a_0,r_0,s_1,a_1,r_1...) (s0,a0,r0,s1,a1,r1...) is defined as the trajectory of the MDP process (denoted as E \mathcal{E} E).

Policy for MDP Agent

Definition

在这里插入图片描述
在这里插入图片描述

Policy: A deterministic policy π \pi π is a function defined as π : S → A \pi: \mathcal{S}\rightarrow \mathcal{A} π:SA, which tells the agent the action to take in a particular state.

A stochastic policy π \pi π is defined as π : S → P ( A ) \pi: \mathcal{S}\rightarrow \mathcal{P}(\mathcal{A}) π:SP(A) where P ( A ) \mathcal{P}(\mathcal{A}) P(A) is probability distributions of actions in A \mathcal{A} A. It tells the agent the probability to take every action in a particular state. Denote the probability of action a a a according to π ( s ) \pi(s) π(s) as π ( a ∣ s ) \pi(a|s) π(as).

Adopting a deterministic policy π \pi π is equivalent to adopting a stochastic policy π ~ \tilde{\pi} π~,

π ~ ( a ∣ s ) = { 1 , a = π ( s ) 0 , a ≠ π ( s ) ,   s ∈ S \tilde{\pi}(a|s)=\begin{cases} &1, &a=\pi(s)\\ &0, &a\neq \pi(s) \end{cases},\ s\in\mathcal{S} π~(as)={1,0,a=π(s)a=π(s), sS

Therefore, we only talk about stochastic policy in the following part.

Judgement for Policy

Value Functions

State Value Function (V function): The state value function of policy π \pi π is defined as V π : S → R V^{\pi}: \mathcal{S}\rightarrow \mathbb{R} Vπ:SR,

V π ( s ) = E τ ∣ s 0 = s , π { ∑ t = 0 ∞ γ t r t } = E τ ∣ s 0 = s , π { G 0 } ,   s ∈ S V^{\pi}(s)=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}=\mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}, \ s\in \mathcal{S} Vπ(s)=Eτs0=s,π{t=0γtrt}=Eτs0=s,π{G0}, sS

where τ = s 0 , a 0 , s 1 , a 1 , … \tau=s_{0}, a_{0}, s_{1}, a_{1}, \dots τ=s0,a0,s1,a1, is the sequence of state-action transitions. V π ( s ) V^{\pi}(s) Vπ(s) is the expected discounted cumulative reward of the process when s 0 = s s_0=s s0=s and adopting policy π \pi π.

For any sequence [ S 0 , A 0 , S 1 . . . ] [S_0,A_0,S_1...] [S0,A0,S1...], where S 0 = S S_0=S S0=S, S i ∈ S S_i\in \mathcal{S} SiS, A i ∈ A A_i\in \mathcal{A} AiA:

p ( τ ∣ s 0 = s , π = [ S 0 , A 0 , S 1 , . . . ] ) = Π i = 0 ∞ T ( S i , A i , S i + 1 ) π ( A i ∣ S i ) p(\tau_{|s_0=s,\pi}=[S_0,A_0,S_1,...])=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i) p(τs0=s,π=[S0,A0,S1,...])=Πi=0T(Si,Ai,Si+1)π(AiSi)

p ( τ t ∣ s t = s , π = [ S 0 , A 0 , S 1 , . . . ] ∣ s 0 = s , π ) = Π i = 0 ∞ T ( S i , A i , S i + 1 ) π ( A i ∣ S i ) p(\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...]|s_0=s,\pi)=\Pi_{i=0}^{\infty}T(S_i,A_i,S_{i+1})\pi(A_i|S_i) p(τtst=s,π=[S0,A0,S1,...]s0=s,π)=Πi=0T(Si,Ai,Si+1)π(AiSi)

Therefore the probability density distributions of sequence τ ∣ s 0 = s , π \tau|s_0=s, \pi τs0=s,π and τ t ∣ s t = s , π \tau_t|s_t=s,\pi τtst=s,π are the same. When τ ∣ s 0 = s , π = τ t ∣ s t = s , π = [ S 0 , A 0 , S 1 , . . . ] \tau|_{s_0=s, \pi}=\tau_t|_{s_t=s,\pi}=[S_0,A_0,S_1,...] τs0=s,π=τtst=s,π=[S0,A0,S1,...], G 0 = G t = ∑ i = 0 ∞ { γ i R ( S i , A i , S i + 1 ) } G_0=G_t=\sum_{i=0}^{\infty}\{\gamma^{i}R(S_i,A_i,S_{i+1})\} G0=Gt=i=0{γiR(Si,Ai,Si+1)}

Therefore E τ ∣ s 0 = s , π { G 0 } = E τ t ∣ s t = s , π { G t } \mathbb{E}_{\tau |s_{0}=s, \pi}\{G_{0}\}=\mathbb{E}_{\tau_t|s_{t}=s, \pi}\{G_{t}\} Eτs0=s,π{G0}=Eτtst=s,π{Gt}. Then V π ( s ) = E τ t ∣ s t = s , π [ G t ] V^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s,\pi}[G_t] Vπ(s)=Eτtst=s,π[Gt], which is the expected discounted cumulative reward of the process from step t t t under the condition of s t = s s_t=s st=s and following π \pi π afterwards.

State-Action Value Function (Q function): The state-action value function is defined as Q π : S × A → R Q^{\pi}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R} Qπ:S×AR,

Q π ( s , a ) = E τ ∣ s 0 = s , a 0 = a , π { ∑ t = 0 ∞ γ t r t } , = E τ ∣ s 0 = s , a 0 = a , π { G 0 }   s ∈ S , a ∈ A Q^{\pi}(s,a)=\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}, =\mathop{\mathbb{E}}\limits_{\tau |s_{0}=s, a_{0}=a, \pi}\{G_0\} \ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=τs0=s,a0=a,πE{t=0γtrt},=τs0=s,a0=a,πE{G0} sS,aA

Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a) is the expected discounted cumulative reward of the process from step 0 0 0 when s 0 = s s_0=s s0=s, a 0 = a a_0=a a0=a, and adopting policy π \pi π. Similarly, Q π ( s ) = E τ t ∣ s t = s , a t = a , π [ G t ] Q^{\pi}(s)=\mathbb{E}_{\tau_t|s_{t}=s, a_{t}=a,\pi}[G_t] Qπ(s)=Eτtst=s,at=a,π[Gt], which is the expected discounted cumulative reward of the process from step t t t under the condition of s t = s s_t=s st=s, a t = a a_t=a at=a and following π \pi π afterwards.

TD formula for value functions

  • TD(t) ( t = 1 , 2 , . . . t=1,2,... t=1,2,...) formula for V function is V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})] Vπ(s)=Eτksk=s,π[i=0t1γirk+i+γtVπ(sk+t)], s ∈ S s\in \mathcal{S} sS.

Proof.

V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E τ k ∣ s k = s , π [ G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E ( s k , . . . , s k + t ) ∣ s k = s , π E ( a k + t , . . . ) ∣ ( s k , . . . , s k + t ) , s k = s , π [ G k + t ] = E τ k ∣ s k = s , π ∑ i = 0 t − 1 [ γ i r k + i ] + γ t E ( s k , . . . , s k + t ) ∣ s k = s , π V π ( s k + t ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] \begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{\tau_k|s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}\mathbb{E}_{(a_{k+t},...)|(s_k,...,s_{k+t}), s_k=s,\pi}[G_{k+t}]\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{k+i}]+\gamma^{t}\mathbb{E}_{(s_k,...,s_{k+t})|s_k=s, \pi}V^{\pi}(s_{k+t})\\ &=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})] \end{align} Vπ(s)=Eτksk=s,π[i=0t1γirk+i+γtGk+t]=Eτksk=s,πi=0t1[γirk+i]+γtEτksk=s,π[Gk+t]=Eτksk=s,πi=0t1[γirk+i]+γtE(sk,...,sk+t)sk=s,πE(ak+t,...)(sk,...,sk+t),sk=s,π[Gk+t]=Eτksk=s,πi=0t1[γirk+i]+γtE(sk,...,sk+t)sk=s,πVπ(sk+t)=Eτksk=s,π[i=0t1γirk+i+γtVπ(sk+t)]

  • TD(t) ( t = 1 , 2 , . . . t=1,2,... t=1,2,...) for Q function is Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k Q π ( s t , a t ) ] Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t},a_{t})] Qπ(s,a)=Eτksk=s,ak=a,π[i=0t1γiri+k+γt+kQπ(st,at)], s ∈ S s\in \mathcal{S} sS, a ∈ A a\in \mathcal{A} aA.

Proof.

Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k G t + k ] = E τ ∣ s k = s , a k = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E τ ∣ s k = s , a k = a , π [ G t + k ] = E τ ∣ s k = s , a 0 = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E ( s k , . . . , s t + k , a t + k ) ∣ s k = s , a k = a , π E ( s t + k + 1 , . . . ) ∣ ( s k , . . . , s t + k , a t + k ) , s k = s , a k = a , π [ G t + k ] = E τ k ∣ s k = s , a k = a , π ∑ i = 0 t − 1 [ γ i r i + k ] + γ t + k E ( s k , . . . , s t , a t + k ) ∣ s k = s , a k = a , π Q π ( s t + k , a t + k ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r i + k + γ t + k Q π ( s t + k , a t + k ) ] \begin{align} Q^{\pi}(s,a)&=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{\tau|s_k=s, a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau|s_k=s,a_0=a,\pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_{t+k}, a_{t+k})|s_k=s, a_k=a, \pi}\mathbb{E}_{(s_{t+k+1},...)|(s_k,...,s_{t+k}, a_{t+k}), s_k=s,a_k=a,\pi}[G_{t+k}]\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}\sum_{i=0}^{t-1}[\gamma^{i}r_{i+k}]+\gamma^{t+k}\mathbb{E}_{(s_k,...,s_t, a_{t+k})|s_k=s, a_k=a, \pi}Q^{\pi}(s_{t+k}, a_{t+k})\notag\\ &=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{i+k}+\gamma^{t+k}Q^{\pi}(s_{t+k}, a_{t+k})]\notag \end{align} Qπ(s,a)=Eτksk=s,ak=a,π[i=0t1γiri+k+γt+kGt+k]=Eτsk=s,ak=a,πi=0t1[γiri+k]+γt+kEτsk=s,ak=a,π[Gt+k]=Eτsk=s,a0=a,πi=0t1[γiri+k]+γt+kE(sk,...,st+k,at+k)sk=s,ak=a,πE(st+k+1,...)(sk,...,st+k,at+k),sk=s,ak=a,π[Gt+k]=Eτksk=s,ak=a,πi=0t1[γiri+k]+γt+kE(sk,...,st,at+k)sk=s,ak=a,πQπ(st+k,at+k)=Eτksk=s,ak=a,π[i=0t1γiri+k+γt+kQπ(st+k,at+k)]

Corollary.

(1)
V π ( s , a ) = ∑ a π ( a ∣ s ) ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V π ( s ′ ) ] V^{\pi}(s,a)=\sum_{a}\pi(a|s)\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')] Vπ(s,a)=aπ(as)sT(s,a,s)[r(s,a,s)+γVπ(s)]
This is called the Bellman Equation for V function.

The proof is easily derived with TD(1) formula for V and is omitted.

(2)
Q π ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ ∑ a ′ π ( a ′ ∣ s ′ ) Q π ( s ′ , a ′ ) ] Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma \sum_{a'} \pi(a'|s') Q^{\pi}(s',a')] Qπ(s,a)=sT(s,a,s)[r(s,a,s)+γaπ(as)Qπ(s,a)]
This is called the Bellman Equation for Q function.

The proof is easily derived with TD(1) formula for Q and is omitted.

Relation of V and Q

The relation between V V V and Q Q Q function is given by the following theorems.

Theorem 1.
V π ( s ) = ∑ a π ( a ∣ s ) Q π ( s , a ) ,   s ∈ S V^{\pi}(s)=\sum_{a}\pi(a|s)Q^{\pi}(s,a), \ s\in \mathcal{S} Vπ(s)=aπ(as)Qπ(s,a), sS
Proof.
V π ( s ) = E τ ∣ s 0 = s , π { ∑ t = 0 ∞ γ t r t } = E a 0 ∣ s 0 = s , π E τ ∣ s 0 = s , a 0 , π { G 0 } = E a 0 ∣ s 0 = s , π { Q π ( s , a 0 ) } = ∑ a π ( a ∣ s ) Q π ( s , a ) \begin{align} V^{\pi}(s)&=\mathbb{E}_{\tau |s_{0}=s, \pi}\{\sum\limits_{t=0}^{\infty}\gamma^{t}r_{t}\}\notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\mathbb{E}_{\tau |s_{0}=s, a_0, \pi}\{G_{0}\} \notag \\ &=\mathbb{E}_{a_0|s_0=s, \pi}\{Q^{\pi}(s,a_0)\}\notag \\ &=\sum_{a}\pi(a|s)Q^{\pi}(s,a) \end{align} Vπ(s)=Eτs0=s,π{t=0γtrt}=Ea0s0=s,πEτs0=s,a0,π{G0}=Ea0s0=s,π{Qπ(s,a0)}=aπ(as)Qπ(s,a)
Theorem 2. For t = 1 , . . . t=1,... t=1,..., it holds that
V π ( s ) = E τ k ∣ s k = s , π [ ∑ i = 0 t − 1 γ i r k + i + γ t Q π ( s k + t , a k + t ) ] ,   s ∈ S V^{\pi}(s)=\mathbb{E}_{\tau_k|s_k=s,\pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}Q^{\pi}(s_{k+t}, a_{k+t})],\ s\in \mathcal{S} Vπ(s)=Eτksk=s,π[i=0t1γirk+i+γtQπ(sk+t,ak+t)], sS

Q π ( s , a ) = E τ k ∣ s k = s , a k = a , π [ ∑ i = 0 t − 1 γ i r k + i + γ t V π ( s k + t ) ] ,   s ∈ S , a ∈ A Q^{\pi}(s,a)=\mathbb{E}_{\tau_k|s_k=s, a_k=a, \pi}[\sum_{i=0}^{t-1}\gamma^{i}r_{k+i}+\gamma^{t}V^{\pi}(s_{k+t})],\ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=Eτksk=s,ak=a,π[i=0t1γirk+i+γtVπ(sk+t)], sS,aA
The proof is similar to the foregoing proofs and is omitted.

Corollary.

Q π ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V π ( s ′ ) ] ,   s ∈ S , a ∈ A Q^{\pi}(s,a)=\sum\limits_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\pi}(s')], \ s\in \mathcal{S}, a\in \mathcal{A} Qπ(s,a)=sT(s,a,s)[r(s,a,s)+γVπ(s)], sS,aA

Policy Criterion

A policy π \pi π is said to be not worse than a policy π ′ \pi' π ( π ≥ π ′ \pi \geq \pi' ππ) if V π ( s ) ≥ V π ′ ( s ) V^{\pi}(s)\geq V^{\pi'}(s) Vπ(s)Vπ(s), s ∈ S s\in \mathcal{S} sS.

A policy π \pi π is said to be strictly better than a policy π ′ \pi' π ( π > π ′ \pi > \pi' π>π) if V π ( s ) ≥ V π ′ ( s ) V^{\pi}(s)\geq V^{\pi'}(s) Vπ(s)Vπ(s), s ∈ S s\in \mathcal{S} sS and there exists a state s 0 s_0 s0 V π ( s 0 ) > V π ′ ( s 0 ) V^{\pi}(s_0)> V^{\pi'}(s_0) Vπ(s0)>Vπ(s0).

Policy Improvement Theorem

Theorem. For any policy π \pi π and π ~ \tilde\pi π~:
V π ( s ) ≤ ∑ a π ~ ( a ∣ s ) Q π ( s , a ) ,   s ∈ S V^{\pi}(s) \leq \sum_{a}\tilde\pi(a|s)Q^{\pi}(s,a),\ s\in \mathcal{S} Vπ(s)aπ~(as)Qπ(s,a), sS
it holds that π ~ ≥ π \tilde\pi\geq \pi π~π.

Proof.

V π ( s ) = E s 0 ∣ s 0 = s , π ′ V π ( s 0 ) V^{\pi}(s)=\mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0) Vπ(s)=Es0s0=s,πVπ(s0)

E s 0 ∣ s 0 = s , π ′ V π ( s 0 ) = E s 0 ∣ s 0 = s , π ′ [ ∑ a π ( a ∣ s 0 ) Q π ( s 0 , a ) ] ≤ E s 0 ∣ s 0 = s , π ′ [ ∑ a π ~ ( a ∣ s 0 ) Q π ( s 0 , a ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ [ Q π ( s 0 , a 0 ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ [ ∑ s T ( s 0 , a 0 , s ) ( R ( s 0 , a 0 , s ) + γ V π ( s ) ) ] = E ( s 0 , a 0 ) ∣ s 0 = s , π ′ E s 1 ∣ ( s 0 , a 0 ) , s 0 = s , π ′ [ r 0 + γ V π ( s 1 ) ] = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ [ r 0 ] + γ E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ [ V π ( s 1 ) ] \begin{align} \mathbb{E}_{s_0|s_0=s,\pi'} V^{\pi}(s_0)&=\mathbb{E}_{s_0|s_0=s,\pi'} [\sum_{a}\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &\leq \mathbb{E}_{s_0|s_0=s,\pi'}[\sum_{a}\tilde\pi(a|s_0)Q^{\pi}(s_0,a)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[Q^{\pi}(s_0,a_0)]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}[\sum_{s}T(s_0, a_0,s)(R(s_0, a_0,s)+\gamma V^{\pi}(s))]\notag\\ &=\mathbb{E}_{(s_0,a_0)|s_0=s,\pi'}\mathbb{E}_{s_1|(s_0,a_0),s_0=s,\pi'}[r_0+\gamma V^{\pi}(s_1)]\notag\\ &=\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[r_0]+\gamma\mathbb{E}_{(s_0,a_0,s_1)|s_0=s,\pi'}[ V^{\pi}(s_1)] \end{align} Es0s0=s,πVπ(s0)=Es0s0=s,π[aπ(as0)Qπ(s0,a)]Es0s0=s,π[aπ~(as0)Qπ(s0,a)]=E(s0,a0)s0=s,π[Qπ(s0,a0)]=E(s0,a0)s0=s,π[sT(s0,a0,s)(R(s0,a0,s)+γVπ(s))]=E(s0,a0)s0=s,πEs1(s0,a0),s0=s,π[r0+γVπ(s1)]=E(s0,a0,s1)s0=s,π[r0]+γE(s0,a0,s1)s0=s,π[Vπ(s1)]
For i = 0 , 1 , . . . i=0,1,... i=0,1,..., it can be derived in a similar way that

E ( s 0 , . . . s i ) ∣ s 0 = s , π ′ [ V π ( s i ) ] ≤ E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ r i ] + γ E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] \mathbb{E}_{(s_0,...s_i)|s_0=s,\pi'}[ V^{\pi}(s_{i})]\leq \mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[r_i]+\gamma\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s,\pi'}[V^{\pi}(s_{i+1})] E(s0,...si)s0=s,π[Vπ(si)]E(s0,...,si+1)s0=s,π[ri]+γE(s0,...,si+1)s0=s,π[Vπ(si+1)]

Then it is easily derived that for i = 0 , 1 , . . . i=0,1,... i=0,1,...,
V π ( s ) ≤ ∑ l = 0 i E ( s 0 , . . . , s l + 1 ) ∣ s 0 = s , π ′ [ γ l r l ] + γ i + 1 E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] = E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ ∑ l = 0 i γ l r l ] + γ i + 1 E ( s 0 , . . . , s i + 1 ) ∣ s 0 = s , π ′ [ V π ( s i + 1 ) ] \begin{align} V^{\pi}(s) &\leq \sum_{l=0}^{i}\mathbb{E}_{(s_0,...,s_{l+1})|s_0=s, \pi'}[\gamma^{l} r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag\\ &=\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[\sum_{l=0}^{i}\gamma^{l}r_{l}]+\gamma^{i+1}\mathbb{E}_{(s_0,...,s_{i+1})|s_0=s, \pi'}[V^{\pi}(s_{i+1})]\notag \end{align} Vπ(s)l=0iE(s0,...,sl+1)s0=s,π[γlrl]+γi+1E(s0,...,si+1)s0=s,π[Vπ(si+1)]=E(s0,...,si+1)s0=s,π[l=0iγlrl]+γi+1E(s0,...,si+1)s0=s,π[Vπ(si+1)]
Let i → ∞ i\rightarrow \infty i, then V π ( s ) ≤ E τ ∣ s 0 = s , π ′ [ ∑ l = 0 ∞ γ l r l ] = V π ~ ( s ) V^{\pi}(s)\leq \mathbb{E}_{\tau|s_0=s, \pi'}[\sum_{l=0}^{\infty}\gamma^{l}r_{l}]=V^{\tilde\pi}(s) Vπ(s)Eτs0=s,π[l=0γlrl]=Vπ~(s).

Therefore, π ~ ≥ π \tilde\pi\geq \pi π~π.

Corollary.

For any policy π \pi π and π ~ \tilde\pi π~:
π ~ ( a ∣ s ) = { 1 ,   a = a r g m a x a Q π ( s , a ) 0 ,   a ≠ a r g m a x a Q π ( s , a ) ,   s ∈ S \tilde\pi(a|s)= \begin{cases} 1, \ a=\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a)\\ 0, \ a\neq\mathop{\mathrm{argmax}}\limits_{a}Q^{\pi}(s,a) \end{cases}, \ s\in \mathcal{S} π~(as)= 1, a=aargmaxQπ(s,a)0, a=aargmaxQπ(s,a), sS
it holds that π ~ ≥ π \tilde\pi\geq \pi π~π.

Optimal Policy

A policy π ∗ \pi^{*} π is called an optimal policy if V π ∗ ( s ) ≥ V π ( s ) V^{\pi^{*}}(s)\geq V^{\pi}(s) Vπ(s)Vπ(s) for all s ∈ S s\in \mathcal{S} sS and for all policy π ≠ π ∗ \pi \neq \pi^{*} π=π. (Denote V π ∗ ( s ) V^{\pi^{*}}(s) Vπ(s) / Q π ∗ ( s , a ) Q^{\pi^{*}}(s,a) Qπ(s,a) as V ∗ ( s ) V^{*}(s) V(s) / Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a) in the following content.)

The relation of V ∗ V^{*} V and Q ∗ Q^{*} Q is given as follows:
V ∗ ( s ) = max ⁡ a Q ∗ ( s , a ) ,   s ∈ S ,   a ∈ A V^{*}(s)=\max\limits_{a}Q^{*}(s,a),\ s\in \mathcal{S}, \ a\in \mathcal{A} V(s)=amaxQ(s,a), sS, aA

The optimal policy is given by:
π ∗ ( a ∣ s ) = { 1 a = a r g m a x a Q ∗ ( s , a ) 0 a ≠ a r g m a x a Q ∗ ( s , a ) ,   s ∈ S \pi^{*}(a|s)=\begin{cases}&1 &a=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a)\\ &0 & a\neq \mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a) \end{cases},\ s\in \mathcal{S} π(as)= 10a=aargmaxQ(s,a)a=aargmaxQ(s,a), sS
Proof.

If these does not hold, then there exists a better policy according to the corollary of policy improvement theorem.

Intuitively, the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state.

Reinforcement Learning

Reinforcement Learning Algorithm: Algorithms that seek the optimal policy for MDPs are collectively known as reinforcement learning algorithms. We will introduce several reinforcement learning algorithms in the following content.

Fundamental RL Algorithms: Policy Iteration

We now introduce an RL algorithm for policy improvement called policy iteration that, with each iteration, yields a better strategy than the previous one and ultimately converges to an optimal policy. It pertains to finite MDPs (MDPs with finite state and action spaces).

Recall that
Q π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q π ( s ′ , π ( s ′ ) ) ) Q^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma Q^{\pi}(s',\pi(s')) \big) Qπ(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γQπ(s,π(s)))
This gives an iterative way to optimize estimator of Q Q Q function:
Q ^ π ( s , a ) ← ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q ^ π ( s ′ , π ( s ′ ) ) ) \hat{Q}^{\pi}(s,a)\leftarrow\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big) Q^π(s,a)sT(s,π(s),s)(R(s,π(s),s)+γQ^π(s,π(s)))

The Whole Procedure

Require: initial policy π \pi π
Construct a function Q ^ π : S × A → R \hat{Q}^{\pi}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^π:S×AR arbitrarily as an estimator of Q π Q^{\pi} Qπ.
repeat // policy evaluation
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S sS, a ∈ A a\in \mathcal{A} aA do
\quad \quad q = Q ^ π ( s , a ) q=\hat{Q}^{\pi}(s,a) q=Q^π(s,a)
\quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ Q ^ π ( s ′ , π ( s ′ ) ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{Q}^{\pi}(s',\pi(s')) \big) Q^π(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γQ^π(s,π(s)))
\quad \quad Δ = max ⁡ ( Δ , ∣ q − Q ^ π ( s , a ) ∣ ) \Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|) Δ=max(Δ,qQ^π(s,a))
until Δ < σ \Delta<\sigma Δ<σ
policy-stable=true
for s ∈ S s \in S sS do // policy improvement
\quad b = π ( s ) b=\pi(s) b=π(s)
\quad π ( s ) = arg max ⁡ a Q ^ π ( s , a ) \pi(s)=\argmax_{a}\hat{Q}^{\pi}(s,a) π(s)=argmaxaQ^π(s,a)
\quad if b ≠ π ( s ) b\neq \pi(s) b=π(s) then policy-stable = false
if policy-stable == false then go to policy evaluation else return π ( s ) \pi(s) π(s).

Policy Evaluation
Updating Q π ( s , a ) Q^{\pi}(s,a) Qπ(s,a) in the forementioned iterative way based on Bellman Equation.

Policy Improvement

Update π ( s ) \pi(s) π(s) in a greedy way:
π ( s ) = a r g m a x a Q π ( s , a ) \pi(s) =\mathop{\mathrm{argmax}}\limits_{a} Q^{\pi}(s,a) π(s)=aargmaxQπ(s,a)
在这里插入图片描述

*Note: Policy-stable means that the Q Q Q has no improvement compared to the previous iteration, that is, π k + 1 ( s ) \pi_{k+1}(s) πk+1(s) is as good as π k ( s ) \pi_{k}(s) πk(s).

参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.

Convergence Analysis

It can be proved that policy iteration generates a monotonically increasing sequence of policies.
V π ( s ) = Q π ( s , π ( s ) ) ≤ Q π ( s , π ′ ( s ) ) = ∑ s ′ T ( s , π ′ ( s ) , s ′ ) [ r ( s , π ′ ( s ) , s ′ ) + γ V π ( s ′ ) ] = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ V π ( s 1 ) } ≤ E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ Q π ( s 1 , π ′ ( s 1 ) ) } = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ ∑ s ′ T ( s 1 , π ′ ( s 1 ) , s ′ ) [ r ( s 1 , π ′ ( s 1 ) , s ′ ) + V π ( s ′ ) ] ) = E ( s 0 , a 0 , s 1 ) ∣ s 0 = s , π ′ { r 0 + γ E ( a 1 , s 2 ) ∣ s 1 , π ′ { r 1 + V π ( s 2 ) } } = E ( s 0 , a 0 , s 1 , a 1 , s 2 ) ∣ s 0 = s , π ′ { r 0 + γ r 1 + γ V π ( s 2 ) ] ) ≤ . . . ≤ E ( s 0 , a 0 , . . , s n − 1 , a n − 1 , s n ) ∣ s 0 = s , π ′ { ∑ t = 0 n − 1 γ t r t + γ n V π ( s n ) } ≤ . . . \begin{aligned} V^{\pi}(s)=Q^{\pi}(s,\pi(s)) & \leq Q^{\pi}(s,\pi'(s))\\ & = \sum\limits_{s'}T(s,\pi'(s),s')[r(s,\pi'(s),s')+\gamma V^{\pi}(s')]\\ &=\mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma V^{\pi}(s_{1})\}\\ &\leq \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma Q^{\pi}(s_{1}, \pi'(s_1))\}\\ &= \mathbb{E}_{(s_0, a_0, s_1)| s_0=s, \pi'}\{r_0 + \gamma \sum\limits_{s'}T(s_1,\pi'(s_1),s')[r({s_1}, \pi'(s_1),s')+V^{\pi}(s')])\\ &=\mathbb{E}_{(s_0, a_0, s_1)|s_0=s, \pi'}\{r_0+\gamma \mathbb{E}_{(a_1, s_2)|s_1, \pi'}\{r_1+V^{\pi}(s_2)\}\}\\ &=\mathbb{E}_{(s_0, a_0, s_1, a_1, s_2)| s_0=s, \pi'}\{r_0 + \gamma r_1+\gamma V^{\pi}(s_2)])\\ & \leq ... \\ & \leq \mathbb{E}_{(s_{0},a_{0},..,s_{n-1},a_{n-1},s_{n})|s_{0}=s, \pi'} \{\sum\limits_{t=0}^{n-1}\gamma^{t} r_{t}+\gamma^{n}V^{\pi}(s_{n})\}\\ & \leq ... \\ \end{aligned} Vπ(s)=Qπ(s,π(s))Qπ(s,π(s))=sT(s,π(s),s)[r(s,π(s),s)+γVπ(s)]=E(s0,a0,s1)s0=s,π{r0+γVπ(s1)}E(s0,a0,s1)s0=s,π{r0+γQπ(s1,π(s1))}=E(s0,a0,s1)s0=s,π{r0+γsT(s1,π(s1),s)[r(s1,π(s1),s)+Vπ(s)])=E(s0,a0,s1)s0=s,π{r0+γE(a1,s2)s1,π{r1+Vπ(s2)}}=E(s0,a0,s1,a1,s2)s0=s,π{r0+γr1+γVπ(s2)])...E(s0,a0,..,sn1,an1,sn)s0=s,π{t=0n1γtrt+γnVπ(sn)}...
Therefore, V π ( s ) ≤ E τ ∣ s 0 = s , π ′ { ∑ t = 0 ∞ γ t r t } = V π ′ ( s ) V^{\pi}(s)\leq \mathbb{E}_{\tau | s_{0} = s, \pi'}\{\sum\limits_{t=0}^{\infty}\gamma^{t} r_{t}\} = V^{\pi'}(s) Vπ(s)Eτs0=s,π{t=0γtrt}=Vπ(s), ∀ s ∈ S \forall s \in \mathcal{S} sS.

Because a finite MDP has only a finite number of policies, π \pi π must converge in a finite number of iterations.

Suppose π ( s ) \pi(s) π(s) converges to π ∗ \pi^{*} π. Then it is easily derived that from the stop criterion that:
π ∗ ( s ) = a r g m a x a ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ V ∗ ( s ′ ) ) ,   ∀ s ∈ S \pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}\sum\limits_{s'}T(s,a,s')(R(s,a,s')+\gamma V^{*}(s')),\ \forall s\in S π(s)=aargmaxsT(s,a,s)(R(s,a,s)+γV(s)), sS
Then the Bellman optimality equation holds.

  • Each policy π k + 1 \pi_{k+1} πk+1 is a strictly better policy than π k \pi_{k} πk until the algorithm stops.
  • The policy iteration algorithm completely separates the evaluation and improvement phases.
  • Although policy iteration computes the original policy for a given MDP in finite time, it is relatively inefficient. In particular the first step, the policy evaluation step, is computationally expensive.

Fundamental RL Algorithms: Value Iteration

通过前面内容我们知道: π ∗ ( s ) = a r g m a x a Q ∗ ( s , a ) \pi^{*}(s)=\mathop{\mathrm{argmax}}\limits_{a}Q^{*}(s,a) π(s)=aargmaxQ(s,a),因此,只要求出 Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a),就求出了 π ∗ ( s ) \pi^{*}(s) π(s),本节我们给出一种通过迭代求 Q ∗ ( s , a ) Q^{*}(s,a) Q(s,a) 的算法:Value Iteration.

Iterative ways to derive Q ∗ Q^{*} Q

Q ∗ Q^{*} Q π ∗ \pi^{*} π 的关系式
Q ∗ ( s , a ) = ∑ s ′ ∈ S T ( s , a , s ′ ) [ R ( s , a , s ′ ) + γ V ∗ ( s ′ ) ] Q^{*}(s, a) = \sum_{s' \in S} T(s, a, s') [R(s, a, s') + \gamma V^{*}(s')] Q(s,a)=sST(s,a,s)[R(s,a,s)+γV(s)]
可以得到两种迭代更新 Q ∗ Q^{*} Q estimator的方式:

(1)
Q ^ ∗ ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ V ^ ∗ ( s ′ ) ) \hat{Q}^{*}(s, a)\leftarrow \sum _{s'}T(s, a, s')(R(s, a, s')+\gamma \hat{V}^{*}(s')) Q^(s,a)sT(s,a,s)(R(s,a,s)+γV^(s))

V ^ ∗ ( s ) ← max ⁡ a Q ^ ∗ ( s , a ) \hat{V}^{*}(s)\leftarrow\max _{a}\hat{Q}^{*}(s, a) V^(s)amaxQ^(s,a)
(2)
Q ^ ∗ ( s , a ) ← ∑ s ′ T ( s , a , s ′ ) ( R ( s , a , s ′ ) + γ max ⁡ a Q ^ ∗ ( s ′ , a ) ) \hat{Q}^{*}(s, a) \leftarrow \sum_{s'} T(s, a, s') \left( R(s, a, s') + \gamma \max_{a} \hat{Q}^{*}(s', a) \right) Q^(s,a)sT(s,a,s)(R(s,a,s)+γamaxQ^(s,a))

Given state s s s, action a a a, reward r r r and next state s i s_i si, it is possible to approximate Q ∗ ( s , a ) Q^{*}(s, a) Q(s,a) by iteratively solving the Bellman recurrence equation
Q i + 1 ( s , a ) = E s ′ [ r + γ max ⁡ a ′ Q i ( s ′ , a ′ ) ] Q_{i+1}(s,a)=\mathbb{E}_{s'}[r+\gamma\max_{a'}Q_{i}(s',a')] Qi+1(s,a)=Es[r+γamaxQi(s,a)]
摘自: Multiagent cooperation and competition with deep reinforcement learning

value iteration算法就是基于这两种迭代方式计算 Q ∗ Q^{*} Q的.

The Whole Procedure

版本1:

Require: initial policy π \pi π
Construct a function Q ^ : S × A → R \hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^:S×AR arbitrarily as the calculated Q ∗ Q^{*} Q. Construct a function V ^ : S → R \hat{V}: \mathcal{S}\rightarrow \mathbb{R} V^:SR arbitrarily as the calculated V ∗ V^{*} V.
repeat
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S sS do
\quad \quad v = V ^ ( s ) v=\hat{V}(s) v=V^(s)
\quad \quad for a ∈ A a\in \mathcal{A} aA do
\quad \quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ V ^ ( s ′ ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \hat{V}(s') \big) Q^π(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γV^(s)) // 采用第1种迭代方式更新 Q ^ ( s , a ) \hat{Q}(s,a) Q^(s,a)
\quad \quad V ^ ( s ) = max ⁡ a Q ^ ( s , a ) \hat{V}(s)=\max_{a}\hat{Q}(s,a) V^(s)=maxaQ^(s,a)
\quad \quad Δ = max ⁡ ( Δ , ∣ v − V ^ ( s , a ) ∣ ) \Delta=\max(\Delta, |v - \hat{V}(s,a)|) Δ=max(Δ,vV^(s,a))
until Δ < σ \Delta<\sigma Δ<σ

版本2:

Require: initial policy π \pi π
Construct a function Q ^ : S × A → R \hat{Q}: \mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R} Q^:S×AR arbitrarily as the calculated Q ∗ Q^{*} Q.
repeat
\quad Δ = 0 \Delta=0 Δ=0.
\quad for s ∈ S s \in S sS, a ∈ A a\in \mathcal{A} aA do
\quad \quad q = Q ^ π ( s , a ) q=\hat{Q}^{\pi}(s,a) q=Q^π(s,a)
\quad \quad Q ^ π ( s , a ) = ∑ s ′ T ( s , π ( s ) , s ′ ) ( R ( s , π ( s ) , s ′ ) + γ max ⁡ a ′ Q ^ π ( s ′ , a ′ ) ) \hat{Q}^{\pi}(s,a)=\sum \limits_{s'}T(s, \pi(s),s')\big( R(s, \pi(s), s')+\gamma \max_{a'}\hat{Q}^{\pi}(s',a') \big) Q^π(s,a)=sT(s,π(s),s)(R(s,π(s),s)+γmaxaQ^π(s,a)) // 采用第2种迭代方式更新 Q ^ ( s , a ) \hat{Q}(s,a) Q^(s,a)
\quad \quad Δ = max ⁡ ( Δ , ∣ q − Q ^ π ( s , a ) ∣ ) \Delta=\max(\Delta, |q - \hat{Q}^{\pi}(s,a)|) Δ=max(Δ,qQ^π(s,a))
until Δ < σ \Delta<\sigma Δ<σ

Convergence Analysis

V 0 → Q 1 → V 1 → Q 2 → V 2 → Q 3 → V 3 → Q 4 → V 4 → . . . V ∗ V_0 \rightarrow Q_1 \rightarrow V_1 \rightarrow Q_2 \rightarrow V_2 \rightarrow Q_3 \rightarrow V_3 \rightarrow Q_4 \rightarrow V_4 \rightarrow ... V^* V0Q1V1Q2V2Q3V3Q4V4...V, Value iteration is guaranteed to converge in the limit towards V ∗ V^* V, we have π ∗ ( s ) = π g r e e d y ( V ∗ ) ( s ) = argmax a Q ∗ ( s , a ) \pi ^ {*} (s)= \pi _{greedy}(V^*)(s)= \mathop{\text{argmax}}\limits_a Q^*(s,a) π(s)=πgreedy(V)(s)=aargmaxQ(s,a).

参考: Wiering M , Otterlo M V .Reinforcement Learning: State of the Art[J].Springer Publishing Company, Incorporated, 2012.DOI:10.1007/s10840-007-9174-1. P22.

最新修订于2021年10月29日
最新修订于2021年11月8日
最新修订于2021年11月12日
最新修订于2021年11月13日
最新修订于2021年11月22日
最新修订于2021年11月22日
最新修订于2021年11月29日
最新修订于2022年3月6日
最新修订于2022年6月11日
最新修订于2022年8月1日
最新修订于2022年10月14日
最新修订于2022年11月21日
最新修订于2022年11月28日
最新修订于2023年8月21日
最新修订于2024年12月26日
最新修订于2024年12月30日
最新修订于2024年12月31日
最新修订于2025年2月11日
最新修订于2025年2月19日
最新修订于2025年3月3日

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值