强化学习框架

RLlib：Python中的可扩展强化学习框架

最新推荐文章于 2025-10-08 11:37:32 发布

翻译

最新推荐文章于 2025-10-08 11:37:32 发布 · 1.7k 阅读

4 ·

CC 4.0 BY-SA版权

原文链接：https://towardsdatascience.com/reinforcement-learning-frameworks-e349de4f645a

文章标签：

#python #java #人工智能 #强化学习

本文介绍了RLlib，一个基于Ray的开源强化学习库，它提供了高可伸缩性和统一API，支持TensorFlow和PyTorch。RLlib通过其简单原语简化了分布式和并行强化学习，解决了大规模RL算法的工程挑战，同时允许用户自定义算法。文章还展示了如何使用RLlib的PPO算法解决OpenAI Gym的CartPole环境问题。

深层钢筋学习讲解— 20(DEEP REINFORCEMENT LEARNING EXPLAINED — 20)

This is the post number 20 in the “Deep Reinforcement Learning Explained” series devoted to Reinforcement Learning frameworks.

这是职位号20 致力于强化学习框架的“深度强化学习解释”系列中的内容。

So far, in previous posts, we have been looking at a basic representation of the corpus of RL algorithms (although we have skipped several) that have been relatively easy to program. But from now on, we need to consider both the scale and complexity of the RL algorithms. In this scenario, programming a Reinforcement Learning implementation from scratch can become tedious work with a high risk of programming errors.

到目前为止，在以前的文章中，我们一直在研究相对容易编程的RL算法语料库的基本表示形式(尽管我们已经跳过了几个)。但是从现在开始，我们需要同时考虑RL算法的规模和复杂性。在这种情况下，从头开始编写Reinforcement Learning实施可能会变得很乏味，并且存在编程错误的高风险。

To address this, the RL community began to build frameworks and libraries to simplify the development of RL algorithms, both by creating new pieces and especially by involving the combination of various algorithmic components. In this post, we will make a general presentation of those frameworks and solving the previous problem of CartPole using the PPO algorithm with RLlib, an open-source library in Python, based on Ray framework.

为了解决这个问题，RL社区开始建立框架和库来简化RL算法的开发，包括创建新的片段，特别是涉及各种算法组件的组合。在这篇文章中，我们将这些框架的总体介绍和使用PPO算法求解CartPole以前的问题RLlib ，在Python的开放源代码库的基础上，雷框架。

超越需求 (Beyond REINFORCE)

But before continuing, as a motivational example, let’s remember that in the previous post, we presented REINFORCE, a Monte Carlo variant of a policy gradient algorithm in Reinforcement Learning. The method collects samples of an episode using its current policy and directly updates the policy parameter. Since one full trajectory must be completed to construct a sample space, it is updated as an off-policy algorithm.

但是，在继续之前，作为一个激励示例，让我们记住在上一篇文章中，我们介绍了REINFORCE ，这是强化学习中策略梯度算法的蒙特卡洛变体。该方法使用其当前策略来收集情节的样本，并直接更新策略参数。由于必须完成一条完整的轨迹才能构建样本空间，因此将其更新为不合策略的算法。

However, there are some limitations associated with REINFORCE algorithm. Although we cannot go into more detail, we can highlight three of the main issues:

但是，REINFORCE算法存在一些限制。尽管我们无法进一步详细介绍，但我们可以重点介绍三个主要问题：

The update process is very inefficient. We run the policy once, update once, and then throw away the trajectory.
更新过程效率很低。 我们运行一次策略，更新一次，然后丢弃轨迹。
The gradient estimate is very noisy. There is a possibility that the collected trajectory may not be representative of the policy.
梯度估计非常嘈杂。收集的轨迹可能无法代表该策略。
There is no clear credit assignment. A trajectory may contain many good/bad actions and whether or not these actions are reinforced depends only on the final total output.
没有明确的学分分配。一个轨迹可能包含许多好/坏动作，这些动作是否得到加强仅取决于最终的总产出。

As we have already advanced in the previous post, a proposal that solves these limitations is the PPO algorithm, introduced in the paper “Proximal Policy Optimization Algorithms” by John Schulman et al. (2017) at OpenAI. But understanding the PPO algorithm requires a more complex mathematical treatment, and its programming becomes more convoluted than that of REINFORCE. And this is going to happen with all the algorithms that we will present from now on in this series.

正如我们在上一篇文章中已经提出的那样，解决这些局限性的一个建议是PPO算法，该算法由John Schulman等人在“近端策略优化算法”一文中介绍。 (2017)在OpenAI。但是了解PPO算法需要更复杂的数学处理，并且其编程比REINFORCE更加复杂。从现在开始，我们将在本系列中介绍的所有算法中都会发生这种情况。

But actually, although we cannot avoid having to understand a specific algorithm well to see its suitability as a solution to a specific problem, its programming can be greatly simplified with the new Reinforcement Learning frameworks and libraries that the research community is creating and sharing.

但是实际上，尽管我们不能避免必须很好地理解特定算法才能将其适用性作为对特定问题的解决方案，但是可以使用研究团体正在创建和共享的新的强化学习框架和库，极大地简化其编程。

强化学习框架 (Reinforcement Learning frameworks)

Before presenting these RL frameworks, let’s see a bit of their context.

在介绍这些RL框架之前，让我们先了解一下它们的上下文。

从互动中学习而不是从例子中学习 (Learning from interactions instead of examples)

In the last several years pattern-recognition side has been the focus of much of the work and much of the discussion in the community of Deep Learning. We are using powerful supercomputers that process large labeled data sets (with expert-provided outputs for the training set), and apply gradient-based methods that find patterns in those data sets that can be used to predict or to try to find structures inside the data.

在过去的几年中，模式识别方面一直是深度学习社区中许多工作和讨论的重点。我们正在使用功能强大的超级计算机来处理大型带标签的数据集(具有训练集的专家提供的输出)，并应用基于梯度的方法在这些数据集中查找模式，这些模式可用于预测或尝试在模型内部查找结构。数据。

This contrasts with the fact that an important part of our knowledge of the world is acquired through interaction, without an external teacher telling us what the outcomes of every single action we take will be. Humans are able to discover solutions to new problems from interaction and experience, acquiring knowledge about the world by actively exploring it.

与此相反的事实是，我们对世界的了解的重要部分是通过互动获得的，而没有外部老师告诉我们我们采取的每一项行动的结果将是什么。人们能够通过互动和经验来发现新问题的解决方案，并通过积极探索来获取有关世界的知识。

For this reason, current approaches will study the problem of learning from interaction with simulated environments through the lens of Deep Reinforcement Learning (DRL), a computational approach to goal-directed learning from the interaction that does not rely on expert supervision. I.e., a Reinforcement Learning Agent must interact with an Environment to generate its own training data.

因此，当前的方法将通过深度强化学习(DRL)的视角研究与模拟环境交互的学习问题，这是一种从交互中进行目标导向学习的计算方法，该方法不依赖专家的监督。即，强化学习代理必须与环境交互以生成其自己的培训数据。

This motivates interacting with multiple instances of an Environment in parallel to generate faster more experience to learn from. This has led to the widespread use of increasingly large-scale distributed and parallel systems in RL training. This introduces numerous engineering and algorithmic challenges that can be fixed by these frameworks we are talking about.

这激发了与环境的多个实例并行交互的作用，从而产生了更快的学习经验。这导致在RL培训中越来越广泛地使用越来越大规模的分布式和并行系统。这引入了许多工程和算法挑战，这些挑战可以通过我们正在讨论的这些框架来解决。

开源救援 (Open source to the rescue)

In recent years, frameworks such as TensorFlow or PyTorch (we have spoken extensively about both in this blog) have arisen to help turn pattern recognition into a commodity, making deep learning easier to try and use for practitioners.

近年来，已经出现了诸如TensorFlow或PyTorch之类的框架(我们在本博客中对此进行了广泛讨论)，以帮助将模式识别转化为商品，从而使深度学习更易于尝试并为从业者使用。

A similar pattern is beginning to play out in the Reinforcement Learning arena. We are beginning to see the emergence of many open source libraries and tools to address this both by helping in creating new pieces (not writing from scratch), and above all, involving the combination of various prebuild algorithmic components. As a result, these Reinforcement Learning frameworks help engineers by creating higher-level abstractions of the core components of an RL algorithm. In summary, this makes code easier to develop, more comfortable to read, and improves efficiency.

在强化学习领域也开始出现类似的模式。我们开始看到许多开放源代码库和工具的出现，它们可以通过帮助创建新片段(而不是从头开始)来解决这个问题，最重要的是涉及各种预构建算法组件的组合。结果，这些强化学习框架通过创建RL算法核心组件的高级抽象来帮助工程师。总之，这使代码更易于开发，更易于阅读并提高了效率。

In this post, I provide some notes about the most popular RL frameworks available. I think the readers will benefit by using code from an already-established framework or library. At the time of writing this post, I