Learn to play Pong with PG from scratch and pixels
http://karpathy.github.io/2016/05/31/rl/
Policy Gradients(PG) is default choice for attacking RL problems.
DQN changed Q-Learning.
PG is preferred because it is end-to-end. That means there’s an explicit policy and a principled approach that directly optimizes the expected reward.
Pong is a special case of a Markov Decision Process(MDP): A graph where each node is a particlular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.
Policy network as below:

Input: raw image pixels.
2-layer neural network
Output: move UP or DOWN. Stochastic policy: only produce a probability of moving UP.
Every iteration we will sample from this distribution to get the actual move.
Policy network forward pass in Python/numpy:
def policy_forward(x):
h = np.dot(W1, x) # compute hidden layer neuron activations
h[h<0] = 0 # ReLU nonlinearity: threshold at zero
logp = np.dot(W2, h) # compute log probability of going up
p = 1.0 / (1.0 + np.exp(-logp)) # sigmoid function (gives probability of going up)
return p, h # return probability of taking action 2, and hidden state
Training protocol: If game won. Initialize the policy network with W1, W2 and play 100 games. Assume each game is made up of 200 frames also mean make 200 decisions per game. For example we won 12 games and lost 88. We’ll take all 200*12 = 2400 decisions we made in the winning games and do a positive update (filling in a +1.0 in the gradient for the sampled action, doing backdrop, and parameter update encou

本文介绍了如何使用Policy Gradients(PG)算法从原始像素开始学习玩Pong游戏。PG是解决强化学习问题的首选方法,因为它提供了一个明确的策略,并直接优化期望奖励。Pong游戏是一个特殊的马尔科夫决策过程(MDP),目标是计算在任何状态下最大化奖励的最佳行动方式。训练协议包括在赢得比赛的情况下,用获胜游戏中采取的动作进行正向更新,而在输掉比赛中则进行负向更新。损失函数可以表示为 ∑iAilog p(yi∣xi),其中Ai是优势,Rt=∑k=0∞γkrt+k是折扣后的总奖励。
最低0.47元/天 解锁文章
818

被折叠的 条评论
为什么被折叠?



