Learn to play Pong with PG from scratch and pixels_keyerror: 'pixels-优快云博客

本文链接：https://blog.youkuaiyun.com/JSerenity/article/details/91344956

本文介绍了如何使用Policy Gradients（PG）算法从原始像素开始学习玩Pong游戏。PG是解决强化学习问题的首选方法，因为它提供了一个明确的策略，并直接优化期望奖励。Pong游戏是一个特殊的马尔科夫决策过程（MDP），目标是计算在任何状态下最大化奖励的最佳行动方式。训练协议包括在赢得比赛的情况下，用获胜游戏中采取的动作进行正向更新，而在输掉比赛中则进行负向更新。损失函数可以表示为 ∑iAilog p(yi∣xi)，其中Ai是优势，Rt=∑k=0∞γkrt+k是折扣后的总奖励。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Learn to play Pong with PG from scratch and pixels

http://karpathy.github.io/2016/05/31/rl/

Policy Gradients(PG) is default choice for attacking RL problems.

DQN changed Q-Learning.

PG is preferred because it is end-to-end. That means there’s an explicit policy and a principled approach that directly optimizes the expected reward.

Pong is a special case of a Markov Decision Process(MDP): A graph where each node is a particlular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.

Policy network as below:
policy
Input: raw image pixels.

2-layer neural network

Output: move UP or DOWN. Stochastic policy: only produce a probability of moving UP.

Every iteration we will sample from this distribution to get the actual move.

Policy network forward pass in Python/numpy:

def policy_forward(x):
    h = np.dot(W1, x) # compute hidden layer neuron activations
    h[h<0] = 0 # ReLU nonlinearity: threshold at zero
    logp = np.dot(W2, h) # compute log probability of going up
    p = 1.0 / (1.0 + np.exp(-logp)) # sigmoid function (gives probability of going up)
    return p, h  # return probability of taking action 2, and hidden state

Training protocol: If game won. Initialize the policy network with W1, W2 and play 100 games. Assume each game is made up of 200 frames also mean make 200 decisions per game. For example we won 12 games and lost 88. We’ll take all 200*12 = 2400 decisions we made in the winning games and do a positive update (filling in a +1.0 in the gradient for the sampled action, doing backdrop, and parameter update encouraging the actions we picked in all those states). And we’ll take the other 200*88 = 17600 decisions we made in the losing games and do a negative update. And play another 100 games and repeat.

In summary the loss looks like