论文:Nash Q-learning for general-sum stochastic games
链接:http://www.jmlr.org/papers/volume4/hu03a/hu03a.pdf
Abstract:
We extend Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (defined by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restric-tions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of offline learn-ing performance in both games, we find agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.
简言之:本文研究的是将Q-learning应用到竞争的多智能体中(多智能体的关系包括竞争、合作、竞争与合作--超市老板与顾客),整体基于广义随机博弈的游戏框架。每个智能体通过联合动作共同维护Q函数,并基于对当前Q值假设的Nash均衡行为来更新,可证明其是收敛的。本文通过两个游戏来对比,其中具有独特Q均值函数的算法在一个网格中始终收敛,而具有三个不同Q均衡函数的第二个网格在学习过程中无法收敛。在两款游戏中的离线学习性能中,智能体通过Nash Q-learning可以更快达到联合最优路径。同时还实现了在线版本的Nash Q-learning,其在探索(exploration)与利用(exploitation)之间取得平衡。
关键字:Reinforcement Learning, Q-learning, Multiagent Learning
1.Introduction
强化学习可以使多智能体在学习中行动,即一边学一边干活,不需要提前预知环境模型。在典型的多智能体系统中,智能体缺乏关于其他智能体的完整信息,因此随着智能体相互了解并相应地调整其行为,多智能体环境也在不断变化。
关于为什么不将Q-learning直接应用到多个单智能体中有以下两原因:(1)由多个智能体组成的环境不再是静止的,之前单智能体的理论不适用;(2)环境的非平稳性不是由任意随机过程产生的,而是由在某些重要方面有规律的智能体产生的。因此基于强化学习的多智能体研究不仅仅是将单智能体的Q-learning应用到多智能体这么简单,需要重新指定新的规范。
在将Q-learning扩展到多智能体环境时,采用通用随机博弈框架。在随机博弈中,每个智能体的奖励取决于所有智能体的联合动作和当前状态,状态转移服从马尔可夫性质。在竞争关系中,奖励总是负相关,而在合作关系中,奖励是正相关。
解决通用随机博弈的基本办法是Nsah均衡(Nash equilibrium),每个参与者有效地对其他参与者的行为持有正确