[RL 12] Multi-Agent Reinforcement Learning A Selective Overview of Theories and Algorithms

最新推荐文章于 2025-03-31 12:51:26 发布

xyp99

最新推荐文章于 2025-03-31 12:51:26 发布

阅读量1k

点赞数

CC 4.0 BY-SA版权

分类专栏： DRL 算法

本文链接：https://blog.youkuaiyun.com/xyp99/article/details/111056926

DRL 算法专栏收录该内容

16 篇文章

订阅专栏

本文综述了多智能体强化学习的理论与算法，探讨了合作设置下同质及网络化智能体的学习方法，并分析了部分可观测环境下的学习策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

To be continued
paper: Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

4 MARL Algorithms with Theory

Notation

Markov Game = Stocastic Game
Multi-Agent MDP = Markov Teams

4.1 Cooperative Setting

4.1.1 Homogeneous Agents

Multi-Agent MDP & Markov Teams

R1 = R2 = … = R

Szepesv′ari and Littman (1999); Littman (2001)
1. Performing the standard Q-learning update (2.1) at each agent, but taking the max over the joint action space
2. Convergence to the optimal/equilibrium Q-function has been established in Szepesv′ari and Littman (1999); Littman (2001), when both state and action spaces are finite.
3. Convergence to the NE policy is only guaranteed if either the equilibrium is assumed to be unique (Littman, 2001), or the agents are coordinated for equilibrium selection. As any combination of equilibrium policies extracted at each agent may not constitute an equilibrium policy, if the equilibrium policies are non-unique, and the agents fail to agree on which one to select.
equilibrium selection
1. cooperative repeated games (stateless) setting (Claus and Boutilier, 1998)
  1. convergence to equilibrium point is claimed in Claus and Boutilier (1998), without a formal proof.
2. Markov teams exploited in Wang and Sandholm (2003)
  1. optimal adaptive learning (OAL), the first MARL algorithm with provable convergence to the equilibrium policy
address the scalability issue
1. independent Q-learning may fail to converge (Tan, 1993)
2. Lauer and Riedmiller (2000), which advocates a distributed Q-learning algorithm
  1. converges for deterministic finite MMDPs.
  2. Q or policy?
3. heuristics
  1. egarding either reward or value function factorization have been proposed to mitigate the scalability issue. (Guestrin et al., 2002a,b; Kok and Vlassis, 2004; Sunehag et al., 2018 VDN; Rashid et al., 2018 QMIX)
4. Son et al. (2019) QTAN
  1. provides a rigorous characterization of conditions that justify this value factorization idea.
5. Qu and Li (2019)
  1. imposes a special dependence structure, i.e., a one-directional tree, so that the (near-)optimal policy of the overall MMDP can be provably well-approximated by local policies.
6. Yongacoglu et al. (2019) decentralized RL algorithm
  1. guaranteed to converge to team optimal equilibrium policies, and not just equilibrium policies.
7. Perolat et al. (2018).
  1. policy-based methods the only convergence guarantee for MMD

Markov Potential Games

conception
1. potential games
  1. if any agent changes its policy unilaterally, the change in its reward equals (or proportions to) that in the potential function.
  2. stateless
2. Markov potential games (MPGs)
  1. stateful
3. relationship between MMPG and MG
  1. MMDPs/Markov teams constitute a particular case of MPGs, with the potential function being the common reward
  2. MPGs can also be viewed as being strategically equivalent to Markov teams, using the terminology
Valcarcel Macua et al. (2018)
1. provides verifiable conditions for a Markov game to be an MPG
2. and shows the equivalence between finding closed-loop NE in MPG and solving a singleagent optimal control problem
3. Hence, single-agent RL algorithms are then enabled to solve this MARL problem.

Mean-Field Regime

conception
1. toward tackling the scalability issue, with extremely large number of homogeneous agents
2. agent’s effect on the overall multi-agent system can thus become infinitesimal, resulting in all agents being interchangeable/indistinguishable
mean-field
1. mean-field quantity
  1. e.g., the average state, or the empirical distribution of states.
  2. Each agent only needs to find the best response to the mean-field, which considerably simplifies the analysis.
2. mean-field view of multi-agent systems not RL
  1. mean-field games (MFGs) model
  2. team model with mean-field sharing
  3. the game model with mean-field actions
MARL x mean-field
1. Subramanian et al. (2018) studies RL for Markov teams with mean-field sharing
  1. common reward function depending only on the local state and the mean-field
  2. Based on the dynamic programming decomposition for the specified model (Arabneydi and Mahajan, 2014), several popular RL algorithms are easily translated to address this setting
2. approach the problem from a mean-field control (MFC) model
  1. Policy gradient methods are proved to converge for linear quadratic MFCs in Carmona et al. (2019a),
  2. and mean-field Q-learning is then shown to converge for general MFCs (Carmona et al., 2019b).

4.1.2 Decentralized Paradigm with Networked Agents

Settings:

team-average reward
1. multi-agent systems are not always homogeneous
2. This setting finds broad applications in engineering systems …
3. With a central controller, most MARL algorithms reviewed in §4.1.1 directly apply
  1. since the controller can collect and average the rewards, and distributes the information to all agents.
  2. Nonetheless, such a controller may not exist in most aforementioned applications, due to either cost, scalability, or robustness concerns
decentralized/distributed paradigm
1. agents may be able to share/exchange information with their neighbors over a possibly time-varying and sparse communication network
2. relatively less-investigated, most studied static/one-stage optimization problems

Learning Optimal Policy

QD-learning algorithm Kar et al. (2013)
1. the first provably convergent MARL algorithm under this setting
2. incorporates the idea of consensus + innovation to the standard Q-learning algorithm
3. guaranteed to converge to the optimum Qfunction for the tabular setting.
Zhang et al. (2018) actor-critic algorithms
1. ideas
  1. local PG actor step
  2. local TD neighbors-weighted critc step
2. have advantage variant
3. sure convergence when linear functions are used for value function approximation.
Zhang et al., 2018a continuous spaces
1. on policy
  1. in the multi-agent setting, as the policies of other agents are unknown, the common off-policy approach for DPG (Silver et al., 2014, §4.2) does not apply.
2. expected policy gradient (EPG)
  1. (EPG) method (Ciosek and Whiteson, 2018) which unifies stochastic PG (SPG) and DPG
3. Convergence of the algorithm is then also guaranteed when linear function approximation is used
off-policy
1. Suttle et al. (2019) considers the extension of Zhang et al. (2018) to an off-policy setting
  1. building upon the emphatic temporal differences (ETD) method for the critic
2. Zhang and Zavlanos (2019) off-policy
  1. a local critic and a consensus actor
multi-task
1. conception
  1. simplified version of the MA settings, where each agent deals with an independent MDP that is not affected by other agents, while the goal is still to learn the optimal joint policy that accounts for the average reward of all agents.
2. Pennesi and Paschalidis (2010)
  1. a local TD-based critic step, followed by a consensus-based actor step
  2. Gradient of the average return is then proved to converge to zero as the iteration goes to infinity
3. Zhang et al. (2018)
  1. Diff-DAC, another distributed actor-critic algorithm for this setting, from duality theory
  2. instance of the dual ascent method for solving a linear program.
finite-sample analyses in this setting with more general function approximation
1. limitations
  1. all the aforementioned convergence guarantees are asymptotic, i.e., the algorithms converge as the iteration numbers go to infinity,
  2. and are restricted to the case with linear function approximations.
2. Zhang et al. (2018b)
  1. consider batch RL algorithms(Lange et al., 2012)
    1. i.e. decentralized variants of the fitted-Q iteration (FQI) (Riedmiller, 2005; Antos et al., 2008) in both the
    2. cooperative setting with networked agents,
      1. global Q-function estimate, by fitting nonlinear least squares
      2. all agents cooperate to find a common Q-function estimate by solving EQ4.3, whose global optimal solution can be achieved by the algorithms therein, if F makes … convex for each i. This is indeed the case if F is a linear function class
      3. with only a finite iteration of distributed optimization algorithms (common in practice), agents may not reach exact consensus, leading to an error of each agent’s Q-function estimate away from the actual optimum of (4.3).
      4. error propagation analysis
        to establish the finite-sample performance of the proposed algorithms
    3. and the competitive setting with two teams of such networked agents

Policy Evaluation

The use of MSPBE as an objective is standard in multi-agent policy evaluation
the idea of saddle-point reformulation has been adopted in …

Other Learning Goals

optimal consensus problem
1. Zhang et al. (2016)
  1. each agent over the network tracks the states of its neighbors’ as well as a leader’s, so that the consensus error is minimized by the joint policy
2. Zhang et al. (2018)
  1. under the name of cooperative multi-agent graphical games
  2. centralized-criticdecentralized-actor
  3. off-policy
Communication efficiency
1. Chen et al., 2018;
  1. distributed PG algorithm that can reduce the communication rounds between the agents and a central controller,
2. Ren and Haupt, 2019;
  1. addresses the same policy evaluation problem as Wai et al. (2018), and develops a hierarchical distributed algorithm
3. Lin et al., 2019
  1. transmitting only one scaled entry of its state vector at each iteration

4.1.3 Partially Observed

Seetings
1. modeled by a decentralized POMDP (Dec-POMDP)
2. shares almost all elements such as the reward function and the transition model, as the MMDP/Markov team model in §2.2.1, except that each agent now only has its local observations of the system state s.
3. . Most of the algorithms are based on a centralized-learning-decentralized-execution
  scheme.
conceptions
1. Finite-state controllers (FSCs)
  1. which map local observation histories to actions
computational efficiency
1. Monte-Carlo sampling
2. Monte-Carlo tree search
3. policy gradient based algorithm
decentralized learning