1. Finite Markov Decision Process

最新推荐文章于 2025-12-19 20:13:51 发布

原创最新推荐文章于 2025-12-19 20:13:51 发布 · 184 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#强化学习 #python

强化学习学习笔记专栏收录该内容

10 篇文章

订阅专栏

有限马尔科夫决策过程（Finite Markov Decision Process, FMDP）是一种决策理论模型，用于描述智能体在环境中如何通过观察状态、选择行动、获取奖励来学习最优策略。该过程涉及贝尔曼方程，用以计算不同策略下的价值函数。然而，实际中可能面临环境动态未知、状态过多无法穷举等问题。最优策略是指在长期中最大化总奖励的策略，满足特定的贝尔曼等式。寻找并实现这样的最优策略是强化学习的核心挑战。

Finite Markov Decision Process

Definition

Formula( Bellman equation )

Optimal Bellman equation

RL an introduction link:

Sutton & Barto Book: Reinforcement Learning: AnIntroduction

Finite Markov Decision Process

Definition

The ego (agent) monitors the situations from the environment, such as by data flow or sensors (cameras, lidars or others), which is called state in the view of term. We have to highlight that we presume the agent clearly know enough information of the situations all the time, by which the agent could make their decisions. So we have $\mathbf{S_t}$ at the time step of t.

After the agent knows the current state, it could have finite actions to choose ( $\mathbf{A_t}$ ). And after taking an action, the agent will obtain the reward in this step ( $\mathbf{R_{t+1}}$ ) and run into the next state ( $\mathbf{S_{t+1}}$ ), in which process, the agent could know the environment's dynamics ( $\mathbf{pr( r^{'}, s^{'} | s, a)}$ ).Then the agent will continue deal with this state until this scenario ends.

The environment's dynamics are not decided by people, but the policy of taking which action depends on the agent's jugement. In every state $\mathbf{S_t}$ ,we could choose actions, which gives us more rewards totally not just in the short run, but also in the long run. Therefore, the policy of choosing actions in state is the core of reinforcement learning. we use $\mathbf{\pi (a|s)}$ to describe the probability of each action taken in the current state.

Therefore, the Finite Markov Decision Process is the process,in which the agent know the current state ,actions that is about to choose, even the probability of $\mathbf{R_{t+1}}$ and $\mathbf{S_{t+1}}$ for each action ( $\mathbf{pr( r^{'}, s^{'} | s, a)}$ ) and obtain the expected return in different policy.

Formula( Bellman equation )

Mathematically, we could calculate the value function $\mathbf{v_{\pi }}$ .

$v_{\pi } (s) = E[ G_t | S_t = s] = E[ R_{t+1} + \gamma G_{t+1} | S_t = s] = \sum_{a}^{}\pi (a|s)\sum_{s^{'}}^{}\sum_{r}^{}r p(r,s^{'} | a, s) [ r+ \gamma v_\pi ( s^{'} ) ]$

Consideration

For every scenario, we know the dynamics of the environment $\mathbf{pr( r^{'}, s^{'} | s, a)}$ , the state set $\mathbf{S_t}$ and coresponding action set $\mathbf{A_t}$ . For evey policy we set, we know $\mathbf{\pi (a|s)}$ . So we could obtain N equations for $\mathbf{v_\pi (s)}$ .

Limitation

Many times, we could not know the dynamics of the environment.
Many times, such as gammon, there are too many states. So we have no capacity to compute this probelm in this way ( solve equations )
problems have Markov property, which means $\mathbf{r^{'}}$ and $\mathbf{s^{'}}$ only depend on r and a. In other words, r and a could get all possible $\mathbf{r^{'}}$ and $\mathbf{s^{'}}$ .

Optimal policies

Definition

For policy $\pi$ and policy $\pi^{'}$ , if for each state s, the ineuqation can be fulfilled:

$v(\pi)\geq v(\pi^{'})$

then we can say $\pi$ is better than $\pi^{'}$ .

Therefore, there must be more than one policy, that is the optimal policy $\pi_{*}$ .

At the meantime, every state in policy $\pi_{*}$ also will meet Bellman equation.

$v_{\pi_{*} } (s) = \sum_{a}^{}\pi_{*} (a|s)\sum_{s^{'}}^{}\sum_{r}^{}r p(r,s^{'} | a, s) [ r+ \gamma v_{ \pi_{*} } ( s^{'} ) ]$

Optimal Bellman equation

for $\pi_{*}$ in the total policy set:

$v_{ \pi_{*}}(s)= max.\sum_{a} \pi(s|a)q(s,a) =max. \sum_{a}^{}\pi (a|s)\sum_{s^{'}}^{}\sum_{r}^{}r p(r,s^{'} | a, s) [ r+ \gamma v_{ \pi } ( s^{'} ) ]$