深度学习入门(9) - Reinforcement Learning 强化学习

Reinforcement Learning

an agent performs actions in environment, and receives rewards

goal: Learn how to take actions that maximize reward

Stochasticity: Rewards and state transitions may be random

Credit assignment: Reward rtr_trt may not directly depend on action ata_tat

Nondifferentiable: Can’t backprop through the world

Nonstationary: What the agent experiences depends on how it acts

Markov Decision Process (MDP)

Mathematical formalization of the RL problem: A tuple (S,A,R,P,γ)(S,A,R,P,\gamma)(S,A,R,P,γ)

SSS: Set of possible states

AAA: Set of possible actions

RRR: Distribution of reward given (state, action) pair

PPP: Transition probability: distribution over next state given (state, action)

γ\gammaγ: Discount factor (trade-off between future and present rewards)

Markov Property: The current state completely characterizes the state of the world. Rewards and next states depend only on current state, not history.

Agent executes a policy π\piπ giving distribution of actions conditioned on states.

Goal: Find best policy that maximizes cumulative discounted reward ∑tγtrt\sum_t \gamma^tr_ttγtrt

请添加图片描述

We will try to find the maximal expected sum of rewards to reduce the randomness.

Value function Vπ(s)V^{\pi}(s)Vπ(s): expected cumulative reward from following policy π\piπ from state sss

Q function Qπ(s,a)Q^{ \pi}(s,a)Qπ(s,a) : expected cumulative reward from following policy π\piπ from taking action aaa in state sss

Bellman Equation

After taking action a in state s, we get reward r and move to a new state s’. After that, the max possible reward we can get is max⁡a′Q∗(s′,a′)\max_{a'} Q^*(s',a')maxaQ(s,a)

Idea: find a function that satisfy Bellman equation then it must be optimal

start with a random Q, and use Bellman equation as an update rule.

请添加图片描述

But if the state is large/infinite, we can’t iterate them.

Approximate Q(s, a) with a neural network, use Bellman equation as loss function.

-> Deep q learning

Policy Gradients

Train a network πθ(a,s)\pi_{\theta}(a,s)πθ(a,s) that takes state as input, gives distribution over which action to take

Objective function: Expected future rewards when following policy πθ\pi_{\theta}πθ

Use gradient ascent -> play some tricks to make it differentiable

请添加图片描述

Other approaches:

Actor-Critic

Model-Based

Imitation Learning

Inverse Reinforcement Learning

Adversarial Learning

Stochastic computation graphs

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值