Reinforcement Learning - An Introduction memo

本文介绍了马尔可夫决策过程(MDP)的基本概念,包括有限状态空间与动作空间、转移概率等,并详细阐述了价值函数(state-value function与action-value function)的概念与计算方式。此外,还介绍了策略评估(policy evaluation)、策略改进(policy improvement)、策略迭代(policy iteration)及价值迭代(value iteration)等强化学习核心算法。

1.MDP(Markov Decision Processes)

finite MDP: finite state space&finite action space

transition probabilities:p(s′ | s, a) = Pr{St+1 = s′ | St = s, At = a}

r(s, a, s′) = E[Rt+1 | St = s, At = a, St+1 = s′]

2.Value Functions

state-value function: vπ(s)=Eπ[Gt|St = s] = Eπ[k=0γkRt+k+1| St = s]

action-value function: qπ(s, a) = Eπ[Gt|St = s, At = a] = Eπ[k=0γkRt+k+1| St = s, At = a]
Gt: return (cumulative discounted reward) following t
Rt: reward at t, dependent, like St, on At−1 and St−1
Gt = k=0γkRt+k+1

vπ, qπ: vπ(s) = aπ(a|s)qπ(s,a) qπ(s, a) = sp(s|s,a)[r(s,a,s)+γvπ(s’)]
π(a|s): probability of taking action a when in state s

Bellman Equation for vπ: vπ(s) = aπ(a|s)sp(s|s,a)[r(s,a,s)+γvπ(s)]
Bellman function => learn vπ

Bellman Equation for qπ: qπ(s, a) = sp(s|s,a)[r(s,a,s)+γaπ(a|s)qπ(s,a)]

3.Policy Evaluation

policy evaluation: compute vπ for policy π
Iteration policy evaluation:
1. For state s, the initial v0 is chosen arbitary(terminal state 0)
2.Successive approximation is obtained by using the Bellman Equation:
vk+1(s) = Eπ[Rt+1+γvk(st+1) | St = s]
code:

4.Policy Improvement

policy improvement: evaluate policy to find better policies

greedy policy π: π(x) = arg amaxqπ(s,a)
The greedy policy takes the action that looks best in the short term—after one step of lookahead—according to vπ.

5.Policy Iteration&Value Iteration

policy iteration

code:

an example:

value iteration

It can be written as a particularly simple backup operation that combines the policy improvement and truncated policy evaluation steps, more efficient.

vk+1(s) = amax E[Rt+1+γvk(St+1) | St = s, At = a]

code:

Q-Learning

Initialize Q(s,a) arbitrarily
Repeat (for each episode):
    Initialize s
    Repeat (for each episode):
        Choose a from s using policy derived from Q(e.g.,ε-greedy)
        Take action a, observe r, s'
        Q(s, a)←Q(s, a) + α[r + γmaxQ(s', a')-Q(s, a)]
        s←s'
    util s is terminal
Evaluation:

update Q-table, in every episode, Q(s, a) is the value in the table, maxa’Q(s’, a’) is the max approximation for Q(s’)

decision policy:

ε-greedy: ε = 0.9, choose the best action for 90%, choose others randomly for 10%

强化学习是一种机器学习方法,它致力于教会智能体在一个动态环境中做出最优决策。在强化学习中,智能体通过不断与环境进行交互来学习,并且根据环境的反馈来调整自己的行为。 强化学习的一个核心概念是“奖励”,它是环境对智能体行为的评价。智能体的目标是通过选择能够最大化长期奖励累积的行为策略来学习。在学习的过程中,智能体通过试错和学习的方法逐步改进自己的决策策略。 强化学习涉及到很多基本元素,比如:状态、动作、策略和价值函数。状态是指代表环境的信息,动作是智能体可以执行的动作选择,策略是智能体根据当前状态选择动作的方法,价值函数是用来评估每个状态或动作的价值。这些元素相互作用,并通过学习算法来更新和改善,使得智能体能够做出更好的决策。 强化学习有很多不同的算法,其中最著名的是Q-learning和Deep Q-Network(DQN)。Q-learning是一种基于值函数的学习方法,它通过不断更新状态-动作对的价值来优化策略。而DQN则是在Q-learning的基础上引入了深度神经网络,使得智能体能够处理更复杂的环境和任务。 总之,强化学习是一种通过交互式学习来教会智能体做出最优决策的方法。它在许多领域有广泛的应用,比如人工智能、自动驾驶、游戏AI等。通过不断的试错和学习,智能体可以不断改进自己的行为策略,达到最优性能。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值