所有内容来自:http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
S--->a---r--->S'--->a'---r'--->S''(reward在take action之后才有反馈,注意顺序)
另外还有两个等式:Bellman Expectation Equation,Bellman Optimality Equation 。
传统的RL的研究对象就是MDP。直接假设就是【环境是完全可观察(当前状态唯一决定了整个过程的特性)】。
Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs
马尔科夫决策过程的性质:
一个状态的转移过程是MDP,当且仅当:P [St+1 | St] = P [St+1|S1, ..., St]
马尔科夫过程(马尔科夫链)A Markov Process(orMarkov Chain) is a tuple <S,P>
S is a (finite) set of states
P is a state transition probability matrix,
Pss’=P[St+1=s’
| St=s]
A Markov reward process is a Markov chain with values.
A Markov Reward Processis a tuple <S,P,R,γ>
S is a finite set of states
P is a state transition probability matrix,
Pss’ = P [St+1 = s’
| St = s]
R is a reward function,Rs=E[Rt+1jSt=s]
γ is a discount factor,γ∈[0,1]
The return Gt is the total discounted reward from time-step t.
The state value function v(s)of an MRP is the expected
return starting from state s
v(s) =E[Gt | St=s]
Bellman Equation for MRPs :
The value function can be decomposed into two parts:
immediate reward Rt+1
discounted value of successor state γv(St+1)
【计算时常用】
从计算的角度看,value function可以表示为:
Computational complexity is O(n3) fornstates,Direct
solution only possible for small MRPs
There are many iterative methods for large MRPs, e.g.
Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning
(动态规划、蒙特卡洛方法、时间差分学习,是计算马尔科夫value function的常用方法)
A Markov decision process (MDP) is a Markov reward process with decisions(actions). It is anenvironmentin which all states are Markov.
A policy
π is a distribution over actions given states,
π(a|s) =P[At=a
| St=s]
MDP policies depend on the current state (not the history) , Policies arestationary(time-independent)
【计算时常用】
对于添加了policy π的MRP,仍然可以用线性方程计算:
===》
Bellman Expectation Equation:
【计算时常用】
【计算时常用】
When we know the q*(s,a), we immediately have the optimal polic。 An optimal policy can be found by maximising overq∗(s, a):
There is always a deterministic optimal policy for any MDP
Bellman Optimality Equation:
Solving the Bellman Optimality Equation :
Bellman Optimality Equation is non-linear,No closed form solution (in general),Many iterative solution methods
Value Iteration
Policy Iteration
Q-learning
Sarsa