Learn to play Pong with PG from scratch and pixels

本文介绍了如何使用Policy Gradients(PG)算法从原始像素开始学习玩Pong游戏。PG是解决强化学习问题的首选方法,因为它提供了一个明确的策略,并直接优化期望奖励。Pong游戏是一个特殊的马尔科夫决策过程(MDP),目标是计算在任何状态下最大化奖励的最佳行动方式。训练协议包括在赢得比赛的情况下,用获胜游戏中采取的动作进行正向更新,而在输掉比赛中则进行负向更新。损失函数可以表示为 ∑iAilog p(yi∣xi),其中Ai是优势,Rt=∑k=0∞γkrt+k是折扣后的总奖励。

Learn to play Pong with PG from scratch and pixels

http://karpathy.github.io/2016/05/31/rl/

Policy Gradients(PG) is default choice for attacking RL problems.

DQN changed Q-Learning.

PG is preferred because it is end-to-end. That means there’s an explicit policy and a principled approach that directly optimizes the expected reward.

Pong is a special case of a Markov Decision Process(MDP): A graph where each node is a particlular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.

Policy network as below:
policy
Input: raw image pixels.

2-layer neural network

Output: move UP or DOWN. Stochastic policy: only produce a probability of moving UP.

Every iteration we will sample from this distribution to get the actual move.

Policy network forward pass in Python/numpy:

def policy_forward(x):
    h = np.dot(W1, x) # compute hidden layer neuron activations
    h[h<0] = 0 # ReLU nonlinearity: threshold at zero
    logp = np.dot(W2, h) # compute log probability of going up
    p = 1.0 / (1.0 + np.exp(-logp)) # sigmoid function (gives probability of going up)
    return p, h  # return probability of taking action 2, and hidden state

Training protocol: If game won. Initialize the policy network with W1, W2 and play 100 games. Assume each game is made up of 200 frames also mean make 200 decisions per game. For example we won 12 games and lost 88. We’ll take all 200*12 = 2400 decisions we made in the winning games and do a positive update (filling in a +1.0 in the gradient for the sampled action, doing backdrop, and parameter update encou

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值