Chapter3 Markov Decision Processes(MDP)

本文介绍了强化学习的基本概念,包括马尔科夫过程、马尔科夫决策过程、价值函数及最优策略等内容,阐述了如何通过强化学习让智能体最大化长期收益。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

参考了《Reinforcement Learning: An Introduction》和
David Silver强化学习公开课,
这一章主要来自David Silver的ppt,建议直接看ppt,我只把容易犯错的地方点出来了


马尔科夫过程是强化学习的基础


Finite Markov Decision Processes

Markov property

A state StSt is Markov if and only of

P[St+1|St]=P[St+1|S1,,St]P[St+1|St]=P[St+1|S1,⋯,St]
  • The state captures all relevant information from the history
  • Once the state is know, the history may be thrown away
  • i.e. The state is a sufficient statistic of the future

A Markov process is a memoryless random process, i.e. a sequence of random states S1,S2,S1,S2,⋯ with the Markov property.
Markov Process

A Markov Process (or Markov Chain) is a tuple S,P⟨S,P⟩

  • S is a (finite) set of states
  • P is a state transition probability matrix, Pss=P[St+1=s|St=s]Pss′=P[St+1=s′|St=s]

A Markov reward process is a Markov chain with values.
Markov Reward Process

A Markov Process (or Markov Chain) is a tuple S,P,R,γ⟨S,P,R,γ⟩

  • S is a (finite) set of states
  • P is a state transition probability matrix, Pss=P[St+1=s|St=s]Pss′=P[St+1=s′|St=s]
  • R is a reward function, Rs=E[Rt+1|St=s]R is a reward function, Rs=E[Rt+1|St=s]
  • γ is a discount factor, γ[0,1]γ is a discount factor, γ∈[0,1]

注意这里PssPss′的定义,是指从状态sss的概率

后面常因为名字(return)忘记这个的定义,跟上面的单个Reward不一样
Return

The return GtGt is the total discounted reward from time-step t.

Gt=Rt+1+γRt+2+=k=0γkRt+k+1Gt=Rt+1+γRt+2+⋯=∑k=0∞γkRt+k+1
  • The discount γ[0,1]γ∈[0,1] is the present value of future rewards
  • The value of receiving reward R after k+1 time-steps is γkRγkR
    • γγ close to 0 leads to “myopic(近视)” evaluation
    • γγ close to 1 leads to “far-sighted(远见)” evaluation
      后面提到的很多方法都是看的很远(远见)的

Value Function

The state value function v(s) of an MRPMRP is the expected return starting form state s

v(s)=E[Gt|St=s]v(s)=E[Gt|St=s]

确实有必要看一下MRP的Bellman Equation,并与MDP对比。在MRP中没有考虑任何关于action的事情。因为MDP才是强化学习的主角,所以不看David Silver的ppt中的MRP实例了,容易对后面MDP的理解造成误解。
简单看一下Bellman Equation

v(s)=E[Gt|St=s]=E[Rt+1+γv(St+1)|St=s]v(s)=E[Gt|St=s]=E[Rt+1+γv(St+1)|St=s]

MRP的状态转换,没有任何action的影响,我们在后面MDP中会考虑actions的影响
MRP state transfer
v(s)=Rs+γsSPssv(s)v(s)=Rs+γ∑s′∈SPss′v(s′)

其实观察上式,上面计算的是动态规划,而注意到Bellman Equation又称为动态规划方程,上面的计算就很容易理解了

A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
Markov Decision Process

A Markov Process (or Markov Chain) is a tuple S,A,P,R,γ⟨S,A,P,R,γ⟩

  • S is a (finite) set of states
  • A is finite set of actions
  • P is a state transition probability matrix, Pass=P[St+1=s|St=s,At=a]Pss′a=P[St+1=s′|St=s,At=a]
  • R is a reward function, Ras=E[Rt+1|St=s,At=a]Rsa=E[Rt+1|St=s,At=a]
  • γγ is a discount factor, γ[0,1]γ∈[0,1]

Student example for MDP
注意与上面MRP的区别,这里的黑点是执行一个action之后到达的中间状态,后面用q(s,a)q(s,a)来定义此状态,黑点到达后面的状态ss′的概率就是上面MDP中定义的那个Pass=P[St+1=s|St=s,At=a]Pss′a=P[St+1=s′|St=s,At=a]

Policy

A policy ππ is a distribution over actions given states,

π(a|s)=P[At=a|St=s]π(a|s)=P[At=a|St=s]
  • A policy fully defines the behaviour of an agent
  • MDP policies depend on the current state (not the history)
  • i.e. Policies are stationary (time-independent), Atπ(|St),t>0At∼π(⋅|St),∀t>0
  • Given an MDP M=S,A,P,R,γM=⟨S,A,P,R,γ⟩ and a policy ππ
  • The state sequence S1,S2,S1,S2,⋯ is a Markov reward process S,Pπ⟨S,Pπ⟩
  • The state and reward sequence S1,R2,S2,S1,R2,S2,⋯ is a Markov reward process S,Pπ,Rπ,γ⟨S,Pπ,Rπ,γ⟩
  • where
    Pπs,s=aAπ(a|s)PassRπs=aAπ(a|s)RasPs,s′π=∑a∈Aπ(a|s)Pss′aRsπ=∑a∈Aπ(a|s)Rsa

要特别注意policy的distribution的定义,因为在后面讲的off-policy方法的概念中,生成样本的policy和目标policy是不同的

Value Function这个是针对MDP的

The state-value function vπ(s)vπ(s) of an MDP is the expected return starting from state ss, and then following policy π

vπ(s)=Eπ[Gt|St=s]vπ(s)=Eπ[Gt|St=s]

The action-value function qπ(s,a)qπ(s,a) is the expected return
starting from state ss, taking action a, and then following policy ππ

qπ(s|a)=Eπ[Gt|St=s,At=a]qπ(s|a)=Eπ[Gt|St=s,At=a]

Bellman Expectation Equation for Vπ
Bellman Expectation Equation for $V^{\pi}$

vπ(s)=aAπ(a|s)qπ(s,a)vπ(s)=∑a∈Aπ(a|s)qπ(s,a)

Bellman Expectation Equation for Qπ
Bellman Expectation Equation for $Q^{\pi}$
qπ(s,a)=Ras+γsSPassvπ(s)qπ(s,a)=Rsa+γ∑s′∈SPss′avπ(s′)

Bellman Expectation Equation for $v_{\pi} 2$
vπ(s)=aAπ(a|s)(Ras+γsSPassvπ(s))vπ(s)=∑a∈Aπ(a|s)(Rsa+γ∑s′∈SPss′avπ(s′))

Bellman Expectation Equation for $q_{\pi} 2$
qπ(s,a)=Ras+γsSPassaAπ(a|s)qπ(s,a)qπ(s,a)=Rsa+γ∑s′∈SPss′a∑a′∈Aπ(a′|s′)qπ(s′,a′)

Optimal Value Function

The optimal state-value function v(s)v∗(s) is the maximum value function over all policies

v(s)=maxπvπ(s)v∗(s)=maxπvπ(s)

The optimal action-value function q(s,a)q∗(s,a) is the maximum action-value function over all policies

q(s,a)=maxπqπ(s,a)q∗(s,a)=maxπqπ(s,a)

只要知道了qq∗问题就解决了,比知道vv∗更方便。还有注意的是,上面是在所有的ππ(policy)中选择使得 qq 最大的π(policy),这就是值给出了最佳policy的概念,当然是没有很直接的办法得到结果的,后面将针对上述问题介绍各种逼近的方法

Optimal Policy
Dene a partial ordering over policies

ππ if vπ(s)vπ(s),sπ≥π′ if vπ(s)≥vπ′(s),∀s

Finding an Optimal Policy
An optimal policy can be found by maximising over q(s,a)q∗(s,a),

π(a|s)={10if a = argmaxaAq(s,a)otherwiseπ∗(a|s)={1if a = argmaxa∈Aq∗(s,a)0otherwise

如果我们知道了q(s,a)q∗(s,a),那么我就可以马上得到optimal policy

Optimal Bellman Expectation Equation

vπ(s)Eπ[Gt|St=s]=Eπ[k=0γkRt+k+1|St=s]=aπ(a|s)srp(s,r|s,a)[r+γEπ[Gt+1|St+1=s]]=aπ(a|s)s,rp(s,r|s,a)[r+γvπ(s)], for all sSvπ(s)≐Eπ[Gt|St=s]=Eπ[∑k=0∞γkRt+k+1|St=s]=∑aπ(a|s)∑s′∑rp(s′,r|s,a)[r+γEπ[Gt+1|St+1=s′]]=∑aπ(a|s)∑s′,rp(s′,r|s,a)[r+γvπ(s′)], for all s∈S
The Agent-Environment Interface
  • The learner and decision maker is called the agent.
  • The thing it interacts with, comprising everything outside the agent, is called the environment.

MDP和agent一起生成的sequence或者trajectory

S0,A0,R1,S1,A1,R2,S2,A2,R3,S0,A0,R1,S1,A1,R2,S2,A2,R3,⋯

以下函数定义了MDP的动态性,agent处于某个状态s,在该状态下采取行动a,然后到达状态ss′,并获得奖励r。这个公式是MDP的关键。这个四参数的函数可以推导出任何东西

p(s,r|s,a)Pr{St=s,Rt=r|St1=s,At1=a}p(s′,r|s,a)≐Pr{St=s′,Rt=r|St−1=s,At−1=a}

The agent-environment interaction in a Markov decision process
for all ss′, sSs∈S, rRr∈R, and aA(s)a∈A(s)

其中有

sSrRp(s,r|s,a)=1, for all sSaA(s)∑s′∈S∑r∈Rp(s′,r|s,a)=1, for all s∈S, a∈A(s)
3.2 Goals and Rewards

agent的目的就是最大化它收到的全部rewards

3.5 Policies and Value Functions

state-value function for policy ππ

vπ(s)Eπ[Gt|St=s]=Eπ[k=0γkRt+k+1|St=s]=aπ(a|s)srp(s,r|s,a)[r+γEπ[Gt+1|St+1=s]]=aπ(a|s)s,rp(s,r|s,a)[r+γvπ(s)], for all sSvπ(s)≐Eπ[Gt|St=s]=Eπ[∑k=0∞γkRt+k+1|St=s]=∑aπ(a|s)∑s′∑rp(s′,r|s,a)[r+γEπ[Gt+1|St+1=s′]]=∑aπ(a|s)∑s′,rp(s′,r|s,a)[r+γvπ(s′)], for all s∈S

action-value function for policy ππ

qπ(s,a)Eπ[Gt|St=s,At=a]=Eπ[k=0γkRt+k+1|St=s,At=a]qπ(s,a)≐Eπ[Gt|St=s,At=a]=Eπ[∑k=0∞γkRt+k+1|St=s,At=a]

对于任何policy ππ和任何状态ss,state-value和其可能的后继状态的state-value之间存在以下一致性条件

3.6 Optimal Policies and Optimal Value Functions

optimal state-value function

v(s)maxπvπ(s)

optimal action-value function

q(s,a)maxπqπ(s,a)q∗(s,a)≐maxπqπ(s,a)

写出关于vv∗qq∗

q(s,a)=E[Rt+1+γvπ(St+1)|St=s,At=a]q∗(s,a)=E[Rt+1+γvπ(St+1)|St=s,At=a]

Bellman optimality equation

v(s)=maxaA(s)qπv∗(s)=maxa∈A(s)qπ∗

Bellman Optimality Equation for $V^*$

v(s)=maxaA(s)qπ(s,a)=maxaEπ[Gt|St=s,At=a]=maxaEπ[Rt+1+γGt+1|St=s,At=a]=maxaE[Rt+1+γv(St+1)|St=s,At=a]=maxas,rp(s,r|s,a)[r+γv(s)]v∗(s)=maxa∈A(s)qπ∗(s,a)=maxaEπ∗[Gt|St=s,At=a]=maxaEπ∗[Rt+1+γGt+1|St=s,At=a]=maxaE[Rt+1+γv∗(St+1)|St=s,At=a]=maxa∑s′,rp(s′,r|s,a)[r+γv∗(s′)]

Bellman Optimality Equation for $Q^*$

q(s,a)=E[Rt+1+γmaxaq(St+1,a)|St=s,At=a]=s,rp(s,r|s,a)[r+γmaxaq(s,a)]q∗(s,a)=E[Rt+1+γmaxa′q∗(St+1,a′)|St=s,At=a]=∑s′,rp(s′,r|s,a)[r+γmaxa′q∗(s′,a′)]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值