论文阅读|Nash Q-Learning for General-Sum Stochastic Games基于强化学习的多智能体研究（附代码）

最新推荐文章于 2025-03-23 18:32:30 发布

kaohoooo

最新推荐文章于 2025-03-23 18:32:30 发布

阅读量4.9k

点赞数 9

CC 4.0 BY-SA版权

分类专栏：论文阅读文章标签：论文阅读 python 机器学习

本文链接：https://blog.youkuaiyun.com/kaohoooo/article/details/127397620

本文探讨了如何将Q-learning扩展到非合作多智能体环境中，通过广义随机博弈框架。提出了NashQ-learning算法，智能体基于当前Q值的纳什均衡行为进行更新。实验表明，该算法在特定条件下能收敛，并在具有唯一Q均衡函数的环境中表现稳定。此外，与单智能体Q-learning相比，NashQ-learning在达到联合最优路径上更有效。同时，还实现了在线版本，平衡了探索与利用，提高了性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

论文：Nash Q-learning for general-sum stochastic games
链接：http://www.jmlr.org/papers/volume4/hu03a/hu03a.pdf

Abstract：

We extend Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (deﬁned by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restric-tions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the ﬁrst grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of ofﬂine learn-ing performance in both games, we ﬁnd agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.

简言之：本文研究的是将Q-learning应用到竞争的多智能体中（多智能体的关系包括竞争、合作、竞争与合作--超市老板与顾客），整体基于广义随机博弈的游戏框架。每个智能体通过联合动作共同维护Q函数，并基于对当前Q值假设的Nash均衡行为来更新，可证明其是收敛的。本文通过两个游戏来对比，其中具有独特Q均值函数的算法在一个网格中始终收敛，而具有三个不同Q均衡函数的第二个网格在学习过程中无法收敛。在两款游戏中的离线学习性能中，智能体通过Nash Q-learning可以更快达到联合最优路径。同时还实现了在线版本的Nash Q-learning，其在探索（exploration）与利用（exploitation）之间取得平衡。

关键字：Reinforcement Learning, Q-learning, Multiagent Learning

1.Introduction

强化学习可以使多智能体在学习中行动，即一边学一边干活，不需要提前预知环境模型。在典型的多智能体系统中，智能体缺乏关于其他智能体的完整信息，因此随着智能体相互了解并相应地调整其行为，多智能体环境也在不断变化。

关于为什么不将Q-learning直接应用到多个单智能体中有以下两原因：（1）由多个智能体组成的环境不再是静止的，之前单智能体的理论不适用；（2）环境的非平稳性不是由任意随机过程产生的，而是由在某些重要方面有规律的智能体产生的。因此基于强化学习的多智能体研究不仅仅是将单智能体的Q-learning应用到多智能体这么简单，需要重新指定新的规范。

在将Q-learning扩展到多智能体环境时，采用通用随机博弈框架。在随机博弈中，每个智能体的奖励取决于所有智能体的联合动作和当前状态，状态转移服从马尔可夫性质。在竞争关系中，奖励总是负相关，而在合作关系中，奖励是正相关。

解决通用随机博弈的基本办法是Nsah均衡（Nash equilibrium），每个参与者有效地对其他参与者的行为持有正确