目录
The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games
4.1 Testbeds, Baselines, and Common Experimental Setup
5Factors Influential to PPO’s Performance5影响PPO绩效的因素
5.2 Input Representation to Value Function
The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games
PPO在合作多智能体博弈中的惊人效果
https://arxiv.org/abs/2103.01955
Abstract 摘要
Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, Google Research Football, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO’s empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods can be a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at https://github.com/marlbenchmark/on-policy.
近端策略优化 (PPO) 是一种无处不在的策略强化学习算法,但在多智能体设置中的使用率明显低于策略外学习算法。这通常是由于人们认为 PPO 的样本效率明显低于多智能体系统中的非策略方法。在这项工作中,我们仔细研究了 PPO 在合作多代理环境中的性能。我们表明,基于 PPO 的多智能体算法在四个流行的多智能体测试平台中实现了令人惊讶的强大性能:粒子世界环境、星际争霸多智能体挑战赛、Google Research Football 和 Hanabi 挑战赛,只需最少的超参数调整,并且没有任何特定领域的算法修改或架构。重要的是,与竞争性非政策方法相比,PPO通常在最终回报和样本效率方面都具有竞争力或优越的结果。最后,通过消融研究,分析了对PPO实证性能至关重要的实施和超参数因素,并就这些因素给出了具体的实践建议。我们的结果表明,当使用这些实践时,基于PPO的简单方法可以成为合作多智能体强化学习的有力基线。源代码发布于 https://github.com/marlbenchmark/on-policy。
1 Introduction
1 引言
Recent advances in reinforcement learning (RL) and multi-agent reinforcement learning (MARL) have led to a great deal of progress in creating artificial agents which can cooperate to solve tasks: DeepMind’s AlphaStar surpassed professional-level performance in the StarCraft II [35], OpenAI Five defeated the world-champion in Dota II [4], and OpenAI demonstrated the emergence of human-like tool-use agent behaviors via multi-agent learning [2]. These notable successes were driven largely by on-policy RL algorithms such as IMPALA [10] and PPO [30, 4] which were often coupled with distributed training systems to utilize massive amounts of parallelism and compute. In the aforementioned works, tens of thousands of CPU cores and hundreds of GPUs were utilized to collect and train on an extraordinary volume of training samples. This is in contrast to recent academic progress and literature in MARL which has largely focused developing off-policy learning frameworks such as MADDPG [22] and value-decomposed Q-learning [32, 27]; methods in these frameworks have yielded state-of-the-art results on a wide range of multi-agent benchmarks [36, 37].
强化学习 (RL) 和多智能体强化学习 (MARL) 的最新进展导致在创建可以协作解决任务的人工智能体方面取得了很大进展:DeepMind 的 AlphaStar 在《星际争霸 II》中超越了专业级性能 [ 35],OpenAI Five 在 Dota II 中击败了世界冠军 [ 4],OpenAI 通过多智能体学习展示了类人工具使用智能体行为的出现 [ 2]。这些显著的成功很大程度上是由策略RL算法推动的,如IMPALA [ 10]和PPO [ 30, 4],这些算法通常与分布式训练系统相结合,以利用大量的并行性和计算。在上述工作中,数以万计的 CPU 内核和数百个 GPU 被用于收集和训练大量训练样本。这与MARL最近的学术进展和文献形成鲜明对比,后者主要侧重于开发非政策学习框架,如MADDPG [ 22]和价值分解Q学习 [ 32, 27];这些框架中的方法在广泛的多智能体基准上产生了最先进的结果[36,37]。
In this work, we revisit the use of Proximal Policy Optimization (PPO) – an on-policy algorithm1 popular in single-agent RL but under-utilized in recent MARL literature – in multi-agent settings. We hypothesize that the relative lack of PPO in multi-agent settings can be attributed to two related factors: first, the belief that PPO is less sample-efficient than off-policy methods and is correspondingly less useful in resource-constrained settings, and second, the fact that common implementation and hyperparameter tuning practices when using PPO in single-agent settings often do not yield strong performance when transferred to multi-agent settings.
在这项工作中,我们重新审视了近端策略优化 (PPO) 的使用——一种策略算法 1。在单 RL 中很受欢迎,但在最近的 MARL 文献中未得到充分利用——在多药环境中。我们假设,在多智能体环境中相对缺乏 PPO 可归因于两个相关因素:首先,认为 PPO 的采样效率低于非策略方法,并且在资源受限的环境中相应地不太有用,其次,在单智能体环境中使用 PPO 时的常见实现和超参数调整实践在转移到多智能体设置时通常不会产生强大的性能。
We conduct a comprehensive empirical study to examine the performance of PPO on four popular cooperative multi-agent benchmarks: the multi-agent particle world environments (MPE) [22], the StarCraft multi-agent challenge (SMAC) [28], Google Research Football (GRF) [19] and the Hanabi challenge [3]. We first show that when compared to off-policy baselines, PPO achieves strong task performance and competitive sample-efficiency. We then identify five implementation factors and hyperparameters which are particularly important for PPO’s performance, offer concrete suggestions about these configuring factors, and provide intuition as to why these suggestions hold.
我们进行了一项全面的实证研究,以检验PPO在四个流行的合作多智能体基准上的性能:多智能体粒子世界环境(MPE)[22]、星际争霸多智能体挑战(SMAC)[28]、Google Research Football(GRF)[19]和Hanabi挑战[3]。我们首先表明,与非政策基线相比,PPO实现了强大的任务绩效和有竞争力的样本效率。然后,我们确定了对 PPO 性能特别重要的五个实现因素和超参数,就这些配置因素提供了具体建议,并提供了这些建议成立原因的直觉。
Our aim in this work is not to propose a novel MARL algorithm, but instead to empirically demonstrate that with simple modifications, PPO can achieve strong performance in a wide variety of cooperative multi-agent settings. We additionally believe that our suggestions will assist practitioners in achieving competitive results with PPO.
我们这项工作的目的不是提出一种新的MARL算法,而是通过实证证明,通过简单的修改,PPO可以在各种合作多智能体环境中实现强大的性能。此外,我们相信我们的建议将有助于从业者通过PPO取得有竞争力的结果。
Our contributions are summarized as follows:
我们的贡献总结如下:
- •
We demonstrate that PPO, without any domain-specific algorithmic changes or architectures and with minimal tuning, achieves final performances competitive to off-policy methods on four multi-agent cooperative benchmarks.
• 我们证明,在没有任何特定领域的算法更改或架构以及最少的调整的情况下,PPO 在四个多智能体合作基准上实现了与非策略方法竞争的最终性能。 - •
We demonstrate that PPO obtains these strong results while using a comparable number of samples to many off-policy methods.
• 我们证明,PPO在使用与许多非政策方法相当数量的样本时获得了这些强有力的结果。 - •
We identify and analyze five implementation and hyperparameter factors that govern the practical performance of PPO in these settings, and offer concrete suggestions as to best practices regarding these factors.
• 我们确定并分析了在这些环境中控制 PPO 实际绩效的五个实施和超参数因素,并就这些因素的最佳实践提供了具体建议。
2 Related Works
2 相关著作
MARL algorithms generally fall between two frameworks: centralized and decentralized learning. Centralized methods [6] directly learn a single policy to produce the joint actions of all agents. In decentralized learning [21], each agent optimizes its reward independently; these methods can tackle general-sum games but may suffer from instability even in simple matrix games [12]. Centralized training and decentralized execution (CTDE) algorithms fall in between these two frameworks. Several past CTDE methods [22, 11] adopt actor-critic structures and learn a centralized critic which takes global information as input. Value-decomposition (VD) methods are another class of CTDE algorithms which represent the joint Q-function as a function of agents’ local Q-functions [32, 27, 31] and have established state of the art results in popular MARL benchmarks [37, 36].
MARL 算法通常介于两个框架之间:集中式学习和分散式学习。
集中式方法[ 6] 直接学习单个策略来产生所有智能体的联合操作。
在去中心化学习[21]中,每个智能体独立优化其奖励;这些方法可以处理一般和博弈,但即使在简单的矩阵博弈中也可能不稳定[ 12]。
集中式训练和分散式执行 (CTDE) 算法介于这两个框架之间。过去的几种CTDE方法[22,11]采用行动者-批评家结构,并学习一个集中的批评家,将全球信息作为输入。值分解 (VD) 方法是另一类 CTDE 算法,它将联合 Q 函数表示为代理的局部 Q 函数的函数 [ 32, 27, 31] ,并在流行的 MARL 基准 [ 37, 36] 中建立了最先进的结果。
In single-agent continuous control tasks [8], advances in off-policy methods such as SAC [13] led to a consensus that despite their early success, policy gradient (PG) algorithms such as PPO are less sample efficient than off-policy methods. Similar conclusions have been drawn in multi-agent domains: [25] report that multi-agent PG methods such as COMA are outperformed by MADDPG and QMix [27] by a clear margin in the particle-world environment [23] and the StarCraft multi-agent challenge [28].
在单智能体连续控制任务[8]中,SAC等非策略方法[13]的进步导致了共识,即尽管它们取得了早期成功,但策略梯度(PG)算法(如PPO)的样本效率不如非策略方法。在多智能体领域也得出了类似的结论:[ 25] 报告称,在粒子世界环境 [ 23] 和星际争霸多智能体挑战 [ 28] 中,多智能体 PG 方法(如 COMA)的性能明显优于 MADDPG 和 QMix [ 27]。
The use of PPO in multi-agent domains is studied by several concurrent works. [7