策略梯度算法_策略梯度算法和好奇心驱动学习的快速总结

最新推荐文章于 2025-06-11 10:06:29 发布

weixin_26641529

最新推荐文章于 2025-06-11 10:06:29 发布

阅读量523

点赞数

CC 4.0 BY-SA版权

文章标签：算法 python 深度学习 java 人工智能

原文链接：https://medium.com/@mswang12/quick-takeaways-from-policy-gradient-algorithms-and-curiosity-driven-learning-60d6a36fb4e4

本文提供策略梯度算法的快速总结，它是一种强化学习中的优化策略参数的方法。同时，文章也探讨了好奇心驱动学习的概念，它是如何激励智能体在环境中自主探索的一种策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

策略梯度算法

1.什么是强化学习？(1. What is Reinforcement Learning?)

Reinforcement Learning is a field of Machine Learning that has produced many important AI breakthroughs such Alpha Go and OpenAI Five. The game of Go was widely considered to be quite difficult for computers to learn and play at the same level of professional human players. AlphaGo is significant for being the first machine to surpass the best human Go players. Importantly, both Alpha Go and OpenAI Five use Reinforcement Learning algorithms to learn to play their respective games. One of the main goals of Reinforcement Learning is to create software agents that learn to maximize their reward in certain environments. Currently, these environments tend to be video games where the reward (e.g. score) is quite easy to obtain. These virtual environments can be contrasted against the real world where the rewards are less easily defined. OpenAI Gym is one of the most popular environments for students and researchers to learn and explore Reinforcement Learning. In this blog post, we’ll quickly review some basic concepts common in many RL problems.

强化学习是机器学习的一个领域，已经产生了许多重要的AI突破，例如Alpha Go和OpenAI Five 。人们普遍认为，围棋游戏很难让计算机学习和玩同等水平的职业玩家。 AlphaGo是首款超越最佳人类围棋选手的机器，意义非凡。重要的是，Alpha Go和OpenAI Five都使用强化学习算法来学习玩各自的游戏。强化学习的主要目标之一是创建可以学习在特定环境中获得最大回报的软件代理。当前，这些环境倾向于是视频游戏，其奖励(例如得分)非常容易获得。这些虚拟环境可以与不容易定义奖励的现实世界进行对比。 OpenAI Gym是学生和研究人员学习和探索强化学习的最受欢迎的环境之一。在此博客文章中，我们将快速回顾许多RL问题中常见的一些基本概念。

2.什么是勘探开发权衡？ (2. What is the exploration-exploitation tradeoff?)

In the field of RL, the Exploration-Exploitation tradeoff is a tradeoff that agents make where they choose to either explore new actions and states or explore known actions and states to maximize their reward. An agent that only “explores” the world is able to learn about its world but never uses this knowledge to maximize its reward. In contrast, an agent that only “exploits” the world may be able to achieve a local maxima but fail to reach the global maximum. An agent that only exploits its environment doesn’t try to learn anything new about the world so it’s unable to achieve the global maximum of potential reward. In both artificial environments and the real world, the optimal trade-off value for both software agents and humans is somewhere in between.

在RL领域中，“探究与开发”的权衡是代理人做出的一种权衡，他们选择探索新的行为和状态，或者探索已知的行为和状态以最大化其回报。一个仅“探索”世界的代理人能够了解其世界，却从不使用该知识来最大化其收益。相反，仅“利用”世界的代理可能能够达到局部最大值，但无法达到全局最大值。仅利用其环境的代理不会尝试学习有关该世界的任何新知识，因此它无法实现潜在收益的全球最大化。在人工环境和现实世界中，软件代理和人员的最佳权衡值介于两者之间。

3.策略梯度算法 (3. Policy Gradient Algorithms)

The vanilla Policy Gradient algorithm (REINFORCE) was introduced in the paper “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” in 1992. The basic idea of this algorithm is to run the policy for a while and see which actions lead to high rewards and which actions lead to low rewards. Next the algorithm tries to increase the probabilities of actions that led to the higher reward. A policy is a mapping of a state to a probability distribution of actions. Policy Gradients are on-policy methods meaning that the agent only learns from actions that the algorithm chooses to take. Policy Gradients are often compared to value-based methods such as Q-learning. Q-learning is an off-policy method meaning that it can update the algorithm’s parameters using saved and stored information from previously taken actions.

1992年，在“用于连接主义强化学习的简单统计梯度跟踪算法”一文中，引入了香草策略梯度算法(REINFORCE)。该算法的基本思想是将策略运行一段时间，看看哪些动作会带来高额回报。以及哪些行为导致低报酬。接下来，该算法尝试增加导致更高报酬的动作的概率。策略是状态到动作概率分布的映射。策略梯度是基于策略的方法，这意味着代理仅从算法选择采取的操作中学习。策略梯度通常与基于价值的方法(例如Q学习)进行比较。 Q学习是一种不合策略的方法，这意味着它可以使用先前执行的操作中保存和存储的信息来更新算法的参数。

4.政策网络 (4. Policy Network)

In a Policy Gradient algorithm, the policy is a neural network that is trained in similar fashion to Supervised Learning. In the classical Supervised Learning problem, we generally need to create a labeled dataset of positive and negative examples. For example a dog versus cat classifier might contain 100 examples of dogs and 100 examples of cats. In an RL environment, the labeled ground truth comes from the final reward in an environment. Because a game often has many time steps, the ground truth labels are noisy and will often not be correct. The credit assignment problem (minimizing this noise) is the problem of determining the exact set of actions that led to the reward. However after many episodes, the policy network learns to favor the actions that lead to higher reward and avoid actions that lead to a lower reward.

在策略梯度算法中，策略是一种神经网络，以类似于监督学习的方式进行训练。在经典的监督学习问题中，我们通常需要创建一个带有标签的正例和负例的数据集。例如，狗对猫分类器可能包含100个狗实例和100个猫实例。在RL环境中，标记的基本事实来自环境中的最终回报。由于游戏通常具有许多时间步长，因此地面真相标签会产生干扰，并且通常不正确。 信用分配问题(最小化这种噪音)是确定导致奖励的确切一组操作的问题。但是，在经历多次事件之后，策略网络学会了偏向于导致更高奖励的行为，而避免了导致更低奖励的行为。

5.什么是好奇心驱动学习？ (5. What is Curiosity Driven Learning?)

As you can imagine, exploring the world or environment is pretty important for RL agents to do. For problems where rewards are sparse, agents can struggle to explore the world in a meaningful way. The Curiosity Driven Learning algorithm builds on top of the Policy Gradient algorithm to explore the world using a curiosity mechanism.

您可以想象，探索世界或环境对于RL代理商而言非常重要。对于奖励稀少的问题，代理商可能难以以有意义的方式探索世界。 “好奇心驱动学习”算法建立在“策略梯度”算法的基础上，可以利用好奇心机制探索世界。

The Intrinsic Reward function is a measure of the prediction error. Given a state at time t, how well is the agent able to predict the state at time (t + 1). The error is measured using the L2 norm of the predicted feature vector and the actual feature vector.

内在奖励功能是对预测误差的度量。给定时间t的状态，代理可以在时间(t +1)预测状态的程度。使用预测特征向量和实际特征向量的L2范数来测量误差。

The Intrinsic Curiosity Module consists of the forward model and the inverse model. The forward model predicts the feature representation of the next state given the previous state and the previous action. The inverse model predicts the next action given the previous state and the next state.

内在好奇心模块由正向模型和逆向模型组成。给定前一个状态和前一个动作，前向模型将预测下一个状态的特征表示。逆模型在给定前一个状态和下一个状态的情况下预测下一个动作。

By minimizing these prediction errors, the agent learns to explore the world using a curiosity mechanism which tends to be important for RL agents to do well in their environments.

通过最小化这些预测误差，代理可以学习使用好奇心机制探索世界，这对于RL代理在自己的环境中表现出色很重要。

6.外卖 (6. Takeaways)

Reinforcement Learning is an increasingly important field that studies the problem of teaching software agents to maximize their potential reward in an environment. The Policy Gradient algorithm is an important basic component of many more advanced RL algorithms. Making the exploration-exploitation tradeoff is also an important factor for many algorithms to consider. In order to help agents to explore their environments better, Curiosity Driven Learning is one mechanism that many algorithms find useful as a way to successfully explore their environments. Overall, we’ve reviewed some important basic concepts in RL including the Policy Gradient algorithm and the exploration-exploitation tradeoff.

强化学习是一个日益重要的领域，它研究了教学软件代理以在环境中最大化其潜在回报的问题。策略梯度算法是许多更高级的RL算法的重要基本组件。进行勘探与开发的权衡也是许多算法要考虑的重要因素。为了帮助代理更好地探索他们的环境，好奇心驱动学习是一种机制，许多算法都认为它是成功探索他们的环境的一种有用方法。总体而言，我们回顾了RL中的一些重要基本概念，包括“策略梯度”算法和勘探与开发权衡。