reinforcement learning，增强学习：Markov Decision Processes

最新推荐文章于 2024-01-08 01:29:22 发布

原创最新推荐文章于 2024-01-08 01:29:22 发布 · 2.5k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#reinforcement learni #增强学习 #Markov Decision Proc #MDP

（深度）增强学习专栏收录该内容

40 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

所有内容来自：http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

S--->a---r--->S'--->a'---r'--->S''（reward在take action之后才有反馈，注意顺序）

另外还有两个等式：Bellman Expectation Equation，Bellman Optimality Equation 。

传统的RL的研究对象就是MDP。直接假设就是【环境是完全可观察（当前状态唯一决定了整个过程的特性）】。

Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs

马尔科夫决策过程的性质：

一个状态的转移过程是MDP，当且仅当：P [St+1 | St] = P [St+1|S1, ..., St]

马尔科夫过程（马尔科夫链）A Markov Process(orMarkov Chain) is a tuple <S,P>
S is a (finite) set of states
P is a state transition probability matrix,
Pss’=P[St+1=s’ | St=s]

A Markov reward process is a Markov chain with values.

A Markov Reward Processis a tuple <S,P,R,γ>
S is a finite set of states
P is a state transition probability matrix,
Pss’ = P [St+1 = s’ | St = s]
R is a reward function,Rs=E[Rt+1jSt=s]
γ is a discount factor,γ∈[0,1]

The return Gt is the total discounted reward from time-step t.

The state value function v(s)of an MRP is the expected return starting from state s
v(s) =E[Gt | St=s]

Bellman Equation for MRPs ：

The value function can be decomposed into two parts:
immediate reward Rt+1
discounted value of successor state γv(St+1)

【计算时常用】

从计算的角度看，value function可以表示为：

Computational complexity is O(n3) fornstates，Direct solution only possible for small MRPs
There are many iterative methods for large MRPs, e.g.
Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning

（动态规划、蒙特卡洛方法、时间差分学习，是计算马尔科夫value function的常用方法）

A Markov decision process (MDP) is a Markov reward process with decisions（actions）. It is anenvironmentin which all states are Markov.

A policy π is a distribution over actions given states,
π(a|s) =P[At=a | St=s]
MDP policies depend on the current state (not the history) , Policies arestationary(time-independent)

【计算时常用】

对于添加了policy π的MRP，仍然可以用线性方程计算：

===》

Bellman Expectation Equation：

【计算时常用】

When we know the q*(s,a), we immediately have the optimal polic。 An optimal policy can be found by maximising overq∗(s, a)：

There is always a deterministic optimal policy for any MDP

Bellman Optimality Equation：

Solving the Bellman Optimality Equation ：

Bellman Optimality Equation is non-linear，No closed form solution (in general)，Many iterative solution methods
Value Iteration
Policy Iteration
Q-learning
Sarsa

您可能感兴趣的与本文相关的镜像