（1）深度强化学习基础【基本概念】

本文链接：https://blog.youkuaiyun.com/weixin_49716548/article/details/125960576

1. Terminologies

There are many professional terms in reinforcement learning. If you want to get started with reinforcement learning, you must understand these professional terms.

1] state and action
state $s$ (this frame)

action $a$ $\in$ {left, right, up}

Who does the action is the agent.

2] policy

policy $\pi$ : According to the observed state, make decisions and control the movement of the agent.

$\cdot$ Policy function $\pi$ : $(s, a)$ → [0, 1] :
$\;\;\;\;\;\pi (a|s) = P(A=a|S=s).$
$\cdot$ It is the probability of taking action $A = a$ given $s$ , e.g,
$\;\;\;\;\;\cdot\pi (left\;|\;s) = 0.2,$
$\;\;\;\;\;\cdot\pi (right|s) = 0.1,$
$\;\;\;\;\;\cdot\pi (up\;\;|\;\;s) = 0.7.$
$\cdot$ Upon observing state $S = s$ , the agent’s action A can be random.

3] reward

reward $R$
$\cdot$ Collect a coin: $R$ = +1
$\cdot$ Win the game: $R$ = +10000
$\cdot$ Touch a Goomba: $R$ = -10000
$\;\;\;$ (game over)
$\cdot$ Nothing happens: $R$ = +1

4] state transition

$\;\;\;\;$ old state $\;\;\overset{action}{\longrightarrow}\;\;$ new state
$\cdot$ E.g., “up” action leads to a new state.
$\cdot$ State transition can be random.
$\cdot$ Ramdom is from the environment.
$\cdot$ $p(s^{'}|s,a)$ = $P(S^{'}=s|S=s,A=a).$

5] agent environment interaction

The environment here is a game program, the agent is Mary, and the state $s_{t}$ is what the environment tells us. In super Mary, we can take the current picture as the environment $s_{t}$ . when we see the state st, we need to make an action $a_{t}$ , which can be left, up, right.

After making the action $a_{t}$ , we will get a new state and a reward $r_{t}$ .

2. Randomness in Reinforcement Learning

1] Action have randomness

$\cdot$ Given state $s$ , the action can be random, e.g,.
$\;\;\;\;$ $\cdot$ $\pi("left|s")=0.2$
$\;\;\;\;$ $\cdot$ $\pi("right|s")=0.1$
$\;\;\;\;$ $\cdot$ $\pi("up|s")=0.7$

Actions are sampled by pocily function.

2] State transitions have randomness

$\cdot$ Given state $S = s$ and action $A = a$ , the environment randomly generates a new state $S^{'}$ .

The new state is sampled by the state transition function.

3. Play the game using AI

$\cdot$ Observe a frame(state $s_{1}$ )
$\cdot$ $\Rightarrow$ Make action $a_{1}$ (left, right, or up)
$\cdot$ $\Rightarrow$ Observe a new frame(state $s_{2}$ ) and reward $r_{1}$
$\cdot$ $\Rightarrow$ Make action $a_{2}$
$\cdot$ $\Rightarrow$ …

$\cdot$ (state, action, reward) trajectory:
$s_{1},a_{1},r_{1},s_{2},a_{2},r_{2},......,s_{T},a_{T},r_{T}.$

4. Rewards and Returns (important)

4.1 Rerun

Definition: Return (cumulative future reward)

$\cdot$ $U_{t}=R_{t}+R_{t+1}+R_{t+2}+R_{t+3}+...$

Question: Are $R_{t}$ and $R_{t+1}$ equally important?
$\cdot$ Which of the followings do you prefer?
$\;\;\;$ $\cdot$ I give you $100 right now.
$\;\;\;$ $\cdot$ I will give you $100 one year later.
$\cdot$ Future reward is less valuable than present reward.
$\cdot$ $R_{t+1}$ should be given less weight than $R_{t}$

Definition: Discounted return(cumulative discounted future reward)

$\cdot$ $\gamma$ : discount rate (tuning hyper-parameter).

$\cdot$ $U_{t}=R_{t}+\gamma R_{t+1}+\gamma ^{2}R_{t+2}+\gamma ^{3}R_{t+3}+...$

4.2 Randomness in Returns

Definition: Discounted return(at time step t)

$\cdot$ $U_{t}=R_{t}+R_{t+1}+R_{t+2}+R_{t+3}+...$

At time step t, the return $U_{t}$ is random.
$\cdot$ Two sources of randomness:
$\;\;\;\;$ 1. Action can be random: $\;P[A=a|S=s]=\pi(a|s).$
$\;\;\;\;$ 2. New state can be random: $P[S^{'}=s|S=s,A=a]=p(s^{'}|s,a).$

$\cdot$ For any i $\geq$ t, the reward $R_{i}$ depends on $S_{i}$ and $A_{i}$ .

$\cdot$ Thus, given $s_{t}$ , the return $U_{t}$ depends on the random variables:
$\;\;\;$ $\cdot$ $A_{t},A_{t+1},A_{t+2},...$ and $S_{t+1},S_{t+2},...$

5. Value Function

5.1 Action-Value Function $Q (s, a)$

Definition: Return (cumulative future reward)
$\cdot$ $U_{t}=R_{t}+R_{t+1}+R_{t+2}+R_{t+3}+...$

Definition: Action-value function for policy $\pi$
$\cdot$ $Q_{\pi}(s_{t},a_{t})=E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}]$

$\cdot$ Return $U_{t}$ (random variable) depends on actions $A_{t},A_{t+1},A_{t+2},...$ and $S_{t},S_{t+1},S_{t+2},...$

$\cdot$ Actions are random: $\;P[A=a|S=s]=\pi(a|s).$ (Policy function)

$\cdot$ States are random: $P[S^{'}=s|S=s,A=a]=p(s^{'}|s,a).$ (State transition)

Action value function represents: if the policy function is used $\pi$ , then whether it is good or bad to act $a_{t}$ in the state of $s_{t}$ , we know the policy function $\pi$ , You can score all actions $a$ in the current state.

Definition: Optimal action-value function
$\cdot$ $Q_{\pi}^{*}(s_{t},a_{t})=\underset{\pi}{max}$ $Q_{\pi}^{}(s_{t},a_{t})$
Evaluate action $a$ to tell the best action.

5.2 State-Value Function $V (s)$

Definition: State-value function
$\cdot$ $V_{\pi}(s_{t})=$ $E_{A}[Q_{\pi}^{}(s_{t},A)]=\sum_{a} \pi(a|s_{t}) \cdot Q_{\pi}(s_{t},a)$ . (Actions are discrete)

$\cdot$ $V_{\pi}(s_{t})=$ $E_{A}[Q_{\pi}^{}(s_{t},A)]=\int \pi(a|s_{t}) \cdot Q_{\pi}(s_{t},a) da$ . (Actions are continuous)

$V_{\pi}(s_{t})$ could make a judgment on the current situation and tell us whether we are going to win or lose, or others.

5.3 Understanding the Value Functions

$\cdot$ Action-value function: $Q_{\pi}(s_{t},a_{t})=E[U_{t}|S_{t}=s_{t},A_{t}=a_{t}]$ .
$\cdot$ For policy $\pi$ , $\;Q_{\pi} (s,a)$ evaluates how good it is for an agent to pick action $a$ while being in state $s$ .

$\cdot$ State-value function: $V_{\pi}(s_{t})=E_{A}[Q_{\pi}^{}(s_{t},A)]$
$\cdot$ For fixed policy $\pi$ , $\;V_{\pi}(s)$ evaluates how good the situation is in state $s$ .
$\cdot$ $E_{s}[V_{\pi}(s)]$ evaluates how good the policy $\pi$ is.

6. How does AI control the agent?

Suppose we have a good policy $\pi(a|s)$ .
$\cdot$ Upon observe the state $s_{s},$
$\cdot$ random sampling: $a_{t}\backsim\pi(\cdot|s_{t})$ .

Suppose we know the optimal action-value function $Q^{*}(s,a)$ .
$\cdot$ Upon observe the state $s_{t},$
$\cdot$ choose the action that maximizes the values: $a_{t}=argmax_{a}Q^{*}(s_{t},a).$