PyMARL框架学习（二）——main.py & run.py & learners & modules

原创已于 2024-11-13 14:02:04 修改 · 2k 阅读

28 ·

CC 4.0 BY-SA版权

文章标签：

#学习 #python #深度学习

于 2024-09-24 14:46:16 首次发布

PyMARL框架学习系列目录

第一章 PyMARL框架学习 runners & envs & controllers & utils & components以及环境部署和训练问题
第二章 PyMARL框架学习 main.py & run.py & learners & modules

一、main.py

main.py是PyMARL框架的项目入口点，负责设置实验配置并启动实验流程。
其主要功能包括：

初始化配置和日志记录：通过sacred库管理实验流程，创建日志记录器，并定义结果保存路径。该文件会捕获和管理实验输出，确保标准输出和日志信息正确记录。
读取并合并配置文件：从多个配置文件中加载算法和环境的配置，使用 _get_config函数读取YAML文件中的内容，并通过recursive_dict_update合并默认配置、环境配置和算法配置。
设置随机种子：为了保证实验的可重复性，main.py中设置了NumPy和PyTorch的随机种子。
启动实验：调用ex.run_commandline(params)启动实验流程，调用run()函数执行主要的MARL训练过程。

1.引入库

import numpy as np
import os
import collections
from os.path import dirname, abspath
from copy import deepcopy
from sacred import Experiment, SETTINGS
from sacred.observers import FileStorageObserver
from sacred.utils import apply_backspaces_and_linefeeds
import sys
import torch as th
from utils.logging import get_logger
import yaml
from run import run

numpy(np)：numpy是经典的用于科学计算的库，尤其在处理多维数组和矩阵。在main.py中主要用于生成随机数np.random.seed()和其他数值计算相关的操作。
os：用于操作系统相关的功能，例如路径操作、文件管理等。在main.py中，os.path.join用于构建文件路径，os.path.dirname和os.path.abspath分别用于获取文件的目录名和绝对路径。
collections：包含高效的数据结构。在main.py的recursive_dict_update()方法中，collections.Mapping被用来判断字典的嵌套层次，用于递归地更新配置。
copy：用于深拷贝复杂的对象，包括嵌套的列表和字典，确保拷贝后的对象和原对象之间没有引用关系。在main.py中，用于复制配置字典（config_copy）。
sacred：sacred库函数不会在本博客详细介绍，感兴趣的朋友可以访问以下链接，参考大佬的介绍：Sacred 教程；同样，有关Python装饰器的内容也不在此论述，参考大佬的介绍：Python 装饰器
。我们只需要简单知道Sacred是一个Python库, 可以帮助我们配置、组织、记录和复制实验。其作用是保存实验中的一系列重要信息与结果。Experiment用于定义实验对象，SETTINGS用于配置捕获实验输出的模式。
sacred.observers.FileStorageObserver：是sacred的一个观察者类，用于将实验结果和日志保存到磁盘，确保实验过程中的数据可以被持久化。
sacred.utils.apply_backspaces_and_linefeeds：是sacred中处理日志的类。处理日志中的回车和换行符，确保输出日志格式整洁，避免字符错乱。
sys：提供与Python解释器交互的接口，常用于处理命令行参数。在main.py中，它通过sys.argv获取命令行参数列表，并传递给实验运行函数。
torch：torch是PyTorch的核心库，用于深度学习和张量计算。在main.py中，th.manual_seed()被用来设置PyTorch的随机种子，确保实验结果的可重复性。
utils.logging.get_logger：从自定义的utils.logging模块中导入get_logger函数，用于初始化日志记录器。它负责输出实验中的日志信息。
yaml：yaml库用于解析YAML格式的配置文件。在main.py中，用于加载实验的默认配置、算法配置和环境配置

2.创建sacred对象

SETTINGS['CAPTURE_MODE'] = "fd"                # "fd" 表示通过文件描述符来捕获标准输出（stdout）和标准错误（stderr）。
                                               # 这样，所有的打印输出和错误信息将被写入文件，而不是直接显示在控制台。如果设置为 "no"，则标准输出和标准错误将直接显示在控制台。
logger = get_logger()                          # 创建一个日志记录器实例，get_logger() 是一个自定义的函数，用于获取配置好的日志记录器。这个记录器负责在实验过程中记录重要的日志信息。记录器的配置通常包括日志级别、日志格式、日志输出位置（例如控制台或文件），以及其他日志相关的设置
ex = Experiment("pymarl")                      # 创建一个 Sacred 实验对象。Experiment 类用于定义和管理实验。"pymarl"是实验的名称，用于标识实验。
ex.logger = logger                             # 将之前的将之前创建的日志记录器关联到Sacred实验对象
ex.captured_out_filter = apply_backspaces_and_linefeeds # 设置输出过滤器。apply_backspaces_and_linefeeds是一个工具函数，用于处理标准输出中的回车和换行符。它确保日志输出的格式在文件中不会因为控制字符而混乱，使得输出更加清晰可读。
results_path = os.path.join(dirname(dirname(abspath(__file__))), "results") # 定义实验结果的存储路径，

3.my_main函数

@ex.main                           # @ex.main是sacred框架的装饰器，表明这是实验的主函数，实验开始时会调用此函数
def my_main(_run, _config, _log):  # my_main是使用sacred框架装饰的主要实验函数。当实验开始时，这个函数被执行，负责设置随机种子并启动实验流程。
    config = config_copy(_config)  # 这个语句将实验的配置_config深拷贝到config。使用config_copy的目的是确保修改config时不会影响原始配置对象_config，因为_config可能会在其他地方被使用。
    np.random.seed(config["seed"]) # 设置numpy的随机种子，使得整个实验中的随机数生成是可控且可重复的。config["seed"]是从配置中获取的随机种子
    th.manual_seed(config["seed"]) # 同样，设置PyTorch的随机种子。
    config['env_args']['seed'] = config["seed"] # 为环境参数设置相同的随机种子，确保环境初始化时使用的随机数也是可控的。
    run(_run, config, _log)        # 运行实验框架

4._get_config函数

def _get_config(params, arg_name, subfolder):  # _get_config函数用于从命令行参数中提取特定的配置文件，并将其加载为字典形式返回。
    config_name = None                         # 初始化config_name为None，用于存储命令行参数中指定的配置文件名。
    for _i, _v in enumerate(params):           # 循环遍历params列表，逐个检查参数。
        if _v.split("=")[0] == arg_name:       # _v.split("=")用于将参数按等号拆分，检查参数是否符合arg_name。
            config_name = _v.split("=")[1]     # 若参数的名字与arg_name相符，则提取等号右边的值作为config_name，即配置文件名。
            del params[_i]                     # 删除已经处理的参数，避免在后续处理中重复使用。
            break

    if config_name is not None:                # 如果找到了有效的配置文件名，则继续读取配置文件
        with open(os.path.join(os.path.dirname(__file__), "config", subfolder, "{}.yaml".format(config_name)), "r") as f:
            try:
                config_dict = yaml.load(f)     # 打开指定路径的YAML配置文件，路径由config_name和subfolder共同组成；使用yaml库解析YAML配置文件，返回一个字典格式的配置。
            except yaml.YAMLError as exc:
                assert False, "{}.yaml error: {}".format(config_name, exc) # 如果读取YAML文件出错，则通过assert抛出错误并终止程序运行，输出相应的错误信息。
        return config_dict

5.recursive_dict_update函数

def recursive_dict_update(d, u):                           # 递归地合并两个字典d和u，如果遇到嵌套字典会继续递归更新。
    for k, v in u.items():                                 # 遍历字典u中的键值对，k是键，v是对应的值。
        if isinstance(v, collections.Mapping):             # 检查值v是否是一个字典（即是否是映射类型）。collections.Mapping是Python中的抽象基类，表示字典类型。
            d[k] = recursive_dict_update(d.get(k, {}), v)  # 如果v是字典，对字典d的相应键k执行递归更新。如果d中不存在键k，则返回一个空字典{}。这一步递归地合并字典d和u。
        else: 
            d[k] = v                                       # 如果v不是字典，直接将v赋值给d[k]，覆盖或新增键k的值。
    return d                                               # 返回合并后的字典d。

6.config_copy函数

def config_copy(config):                                       # 递归深拷贝配置对象，确保字典或列表中的每个元素都被独立复制，不共享原对象的引用
    if isinstance(config, dict):                               # 检查config是否为字典类型
        return {k: config_copy(v) for k, v in config.items()}  # 如果config是字典，递归地对每个键值对调用config_copy，创建一个新的字典，其中的值是递归复制的
    elif isinstance(config, list):                             # 如果config是列表，归地对列表中的每个元素调用config_copy，创建一个新的列表
        return [config_copy(v) for v in config]
    else:
        return deepcopy(config)                                # 如果config既不是字典也不是列表，则直接返回config的深拷贝

6.主程序_main_

if __name__ == '__main__':
    params = deepcopy(sys.argv)                                                             # 深拷贝命令行参数列表sys.argv，params是一个复制的列表，防止对命令行参数的修改影响原始参数
    with open(os.path.join(os.path.dirname(__file__), "config", "default.yaml"), "r") as f: # 打开default.yaml配置文件，路径为当前文件所在目录下的config文件夹中的default.yaml文件
        try:
            config_dict = yaml.load(f)                                                      # 使用yaml库加载default.yaml文件，将其解析为字典config_dict
        except yaml.YAMLError as exc:                                                       # 如果解析yaml文件时发生错误，捕获异常并终止程序，报告错误信息
            assert False, "default.yaml error: {}".format(exc)

    env_config = _get_config(params, "--env-config", "envs")         # 通过_get_config函数从命令行参数params中提取环境配置--env-config，加载位于config/envs/文件夹中的.yaml配置文件，返回为字典。
    alg_config = _get_config(params, "--config", "algs")             # 从命令行参数中提取算法配置--config，加载位于config/algs/文件夹中的.yaml文件
    config_dict = recursive_dict_update(config_dict, env_config)     # 递归地将env_config中的配置合并到config_dict中
    config_dict = recursive_dict_update(config_dict, alg_config)     # 接着将alg_config的配置合并到config_dict

    ex.add_config(config_dict)                                       # 将合并后的配置config_dict添加到sacred的实验对象ex中。

    logger.info("Saving to FileStorageObserver in results/sacred.")  # 记录日志信息，表示实验结果将保存在results/sacred/目录中
    file_obs_path = os.path.join(results_path, "sacred")             # 构建保存路径，实验结果将存储在results/sacred目录中
    ex.observers.append(FileStorageObserver.create(file_obs_path))   # 将FileStorageObserver添加到sacred的实验观察者中，用于将实验数据和日志保存到磁盘
    ex.run_commandline(params)                                       # 运行 acred实验，并传递命令行参数params，启动整个实验流程

二、run.py

1.引入库

import datetime
import os
import pprint
import time
import threading
import torch as th
from types import SimpleNamespace as SN
from utils.logging import Logger
from utils.timehelper import time_left, time_str
from os.path import dirname, abspath

from learners import REGISTRY as le_REGISTRY
from runners import REGISTRY as r_REGISTRY
from controllers import REGISTRY as mac_REGISTRY
from components.episode_buffer import ReplayBuffer
from components.transforms import OneHot

datetime：用于处理日期和时间，提供了日期和时间的类，如datetime.datetime、datetime.date等，便于进行时间计算、格式转换等操作。
pprint：用于操作系统相关的功能，例如路径操作、文件管理等。在main.py中，os.path.join用于构建文件路径，os.path.dirname和os.path.abspath分别用于获取文件的目录名和绝对路径。
time：提供时间相关的功能，如获取当前时间、测量程序运行时间、暂停执行等。
threading：用于创建和管理线程，实现多线程并发执行。
SimpleNamespace：SimpleNamespace是types模块中的一个简单类，用于创建具有动态属性的对象。用于存储和访问配置参数或其他需要动态属性的对象，提供类似于JavaScript对象的属性访问方式
Logger：从PyMARL项目中的utils.logging模块中导入Logger类。
time_left & time_str：从utils.timehelper模块中导入time_left和 time_str函数。time_left函数用于估算剩余的时间。它根据已经经过的时间和当前的进度，来推测剩余的任务所需时间。time_str函数将时间（秒数）转换为易读的字符串格式。
REGISTRY as le_REGISTRY：从learners模块中引入注册表REGISTRY。管理和存储不同类型的学习器（learners）。
REGISTRY as r_REGISTRY：从 runners 模块中引入注册表 REGISTRY。管理和存储不同类型的运行器（runners）。
REGISTRY as mac_REGISTRY：从 controllers 模块中引入注册表 REGISTRY。管理和存储不同的控制器（controllers）。
ReplayBuffer：从episode_buffer模块中引入ReplayBuffer。用于存储环境与智能体交互经验的缓冲区。
OneHot：从transforms模块中引入OneHot变换。OneHot 是用于将分类变量转换为独热编码的工具，通常用于处理离散动作或状态。

2.run函数

def run(_run, _config, _log):
    _config = args_sanity_check(_config, _log)               # 对配置参数进行合理性检查。通过args_sanity_check函数验证_config的各项参数是否设置正确，以避免运行时发生错误
    args = SN(**_config)                                     # 将_config转换为SimpleNamespace对象，允许通过点操作符访问配置参数。                    
    args.device = "cuda" if args.use_cuda else "cpu"         # 根据配置中的use_cuda参数，决定是否使用GPU（cuda）还是CPU

    logger = Logger(_log)                                    # 创建一个Logger对象
    _log.info("Experiment Parameters:")                      # 记录实验的配置信息
    experiment_params = pprint.pformat(_config, indent=4, width=1) # 美化配置信息，使其在日志输出中更加易读
    _log.info("\n\n" + experiment_params + "\n")             # 将配置信息记录到日志中

    unique_token = "{}__{}".format(args.name, datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")) # 生成一个唯一的实验标识符unique_token，由实验名称和当前时间组成
    args.unique_token = unique_token                         # 将其保存到args中，用于标识实验
    if args.use_tensorboard:                                 # 如果配置中开启了Tensorboard(args.use_tensorboard)，则
        tb_logs_direc = os.path.join(dirname(dirname(abspath(__file__))), "results", "tb_logs") # 定义Tensorboard日志的存储路径tb_logs_direc和具体实验目录tb_exp_direc
        tb_exp_direc = os.path.join(tb_logs_direc, "{}").format(unique_token)              
        logger.setup_tb(tb_exp_direc)                                                           # 使用logger.setup_tb(tb_exp_direc)初始化Tensorboard日志记录器

    logger.setup_sacred(_run)                                # 配置Sacred框架，使用logger.setup_sacred(_run)，让Sacred记录日志等实验信息

    run_sequential(args=args, logger=logger)                 # 调用run_sequential函数，传入配置参数args和日志记录器logger

    print("Exiting Main")                                    # 训练结束后的清理操作。首先打印一条信息表示主程序即将退出。
    print("Stopping all threads")                            # 遍历所有正在运行的线程，确保除主线程外的其他线程都安全地关闭。对于每个非主线程，使用join方法等待其结束，避免程序强制退出时导致线程不安全退出
    for t in threading.enumerate():
        if t.name != "MainThread":
            print("Thread {} is alive! Is daemon: {}".format(t.name, t.daemon))
            t.join(timeout=1)
            print("Thread joined") 

    print("Exiting script")                                  # 打印脚本即将退出的提示。

    os._exit(os.EX_OK)                                       # 使用os._exit(os.EX_OK)进行强制退出，确保框架和所有线程都彻底结束

3.evaluate_sequential函数

def evaluate_sequential(args, runner):
    for _ in range(args.test_nepisode):                      # 使用for循环，遍历args.test_nepisode次（即进行指定数量的测试episode）
        runner.run(test_mode=True)                           # 调用runner.run()方法，每次调用runner.run()都是在测试模式下进行的。test_mode=True通常意味着这次运行不会影响模型的训练过程，比如不会更新参数或进行探索性动作，只会用于评估模型的表现
    if args.save_replay:                                     # 检查args.save_replay是否为True，是否需要保存测试期间的回放
        runner.save_replay()                                 # 如果save_replay为True，调用runner.save_replay()方法，保存测试的回放数据（例如动作、状态、奖励等）
    runner.close_env()                                       # 在所有测试episode完成后，调用runner.close_env()关闭环境，释放资源

4.run_sequential函数

def run_sequential(args, logger):
    runner = r_REGISTRY[args.runner](args=args, logger=logger) # 初始化runner。r_REGISTRY是一个注册表，存储了不同的runner类型，根据args.runner指定的类型创建合适的runner对象
    env_info = runner.get_env_info()                            
    args.n_agents = env_info["n_agents"]                        
    args.n_actions = env_info["n_actions"]                      
    args.state_shape = env_info["state_shape"]                 # 获取环境的相关信息env_info，如智能体数量、动作空间大小、状态空间大小等，并将这些信息存储在args中供后续使用

    # Default/Base scheme
    scheme = {
        "state": {"vshape": env_info["state_shape"]},
        "obs": {"vshape": env_info["obs_shape"], "group": "agents"},
        "actions": {"vshape": (1,), "group": "agents", "dtype": th.long},
        "avail_actions": {"vshape": (env_info["n_actions"],), "group": "agents", "dtype": th.int},
        "reward": {"vshape": (1,)},
        "terminated": {"vshape": (1,), "dtype": th.uint8},
    }
    groups = {
        "agents": args.n_agents
    }
    preprocess = {
        "actions": ("actions_onehot", [OneHot(out_dim=args.n_actions)])
    }
    # 定义scheme，用于描述不同数据的形状（例如状态、观测、动作、奖励等）和数据类型。groups用于定义智能体组（即多智能体的数量），preprocess负责对动作进行预处理（例如将动作进行one-hot编码）

    buffer = ReplayBuffer(scheme, groups, args.buffer_size, env_info["episode_limit"] + 1,
                          preprocess=preprocess,
                          device="cpu" if args.buffer_cpu_only else args.device)
    # 初始化ReplayBuffer，用于存储智能体与环境交互的数据。这个缓冲区将记录每个episode的状态、动作、奖励等，并支持后续的批量采样以进行训练

    mac = mac_REGISTRY[args.mac](buffer.scheme, groups, args)
    # 初始化多智能体控制器mac，从注册表mac_REGISTRY中根据args.mac加载控制器。控制器负责为各个智能体选择动作，并根据策略进行决策

    runner.setup(scheme=scheme, groups=groups, preprocess=preprocess, mac=mac)
    # 将数据的scheme和预处理步骤传递给runner

    learner = le_REGISTRY[args.learner](mac, buffer.scheme, logger, args)
    if args.use_cuda:
        learner.cuda()
    # 始化learner，用于实际执行模型的训练过程。它从注册表le_REGISTRY中根据args.learner创建训练组件。如果使用CUDA，则将learner转移到GPU上

    if args.checkpoint_path != "": # 检查是否有加载模型的路径
        timesteps = []
        timestep_to_load = 0
        if not os.path.isdir(args.checkpoint_path):
            logger.console_logger.info("Checkpoint directiory {} doesn't exist".format(args.checkpoint_path))
            return
        # 如果检查点目录不存在，记录错误信息并退出函数

        for name in os.listdir(args.checkpoint_path):
            full_name = os.path.join(args.checkpoint_path, name)
            # Check if they are dirs the names of which are numbers
            if os.path.isdir(full_name) and name.isdigit():
                timesteps.append(int(name))
        # 遍历检查点目录，找出保存的模型文件夹，并记录这些文件夹代表的时间步数

        if args.load_step == 0:
            # choose the max timestep
            timestep_to_load = max(timesteps)
        else:
            # choose the timestep closest to load_step
            timestep_to_load = min(timesteps, key=lambda x: abs(x - args.load_step))
        # 根据args.load_step决定加载哪个时间步的模型。如果load_step == 0，则加载最新的模型，否则加载离指定步数最近的模型。

        model_path = os.path.join(args.checkpoint_path, str(timestep_to_load))
        logger.console_logger.info("Loading model from {}".format(model_path))
        learner.load_models(model_path)
        runner.t_env = timestep_to_load

        if args.evaluate or args.save_replay:
            evaluate_sequential(args, runner)
            return
        # 加载指定时间步的模型并开始评估或保存回放

    # start training
    episode = 0
    last_test_T = -args.test_interval - 1
    last_log_T = 0
    model_save_time = 0

    start_time = time.time()
    last_time = start_time

    logger.console_logger.info("Beginning training for {} timesteps".format(args.t_max))
    # 初始化一些变量以控制训练过程，包括当前episode编号、上次测试和日志的时间步数、模型保存的时间步数等，并记录训练开始时间


    while runner.t_env <= args.t_max: # 开始训练循环，直到达到最大时间步 t_max
        episode_batch = runner.run(test_mode=False)
        buffer.insert_episode_batch(episode_batch)
        # 运行一个完整的episode，并将episode数据插入缓冲区

        if buffer.can_sample(args.batch_size):
            episode_sample = buffer.sample(args.batch_size)

            # Truncate batch to only filled timesteps
            max_ep_t = episode_sample.max_t_filled()
            episode_sample = episode_sample[:, :max_ep_t]

            if episode_sample.device != args.device:
                episode_sample.to(args.device)

            learner.train(episode_sample, runner.t_env, episode)
            # 如果缓冲区中有足够的数据进行采样，则从中采样并进行训练，更新模型的参数

        # Execute test runs once in a while
        n_test_runs = max(1, args.test_nepisode // runner.batch_size)
        if (runner.t_env - last_test_T) / args.test_interval >= 1.0: # 定期执行测试运行，并记录日志
            logger.console_logger.info("t_env: {} / {}".format(runner.t_env, args.t_max))
            logger.console_logger.info("Estimated time left: {}. Time passed: {}".format(
                time_left(last_time, last_test_T, runner.t_env, args.t_max), time_str(time.time() - start_time)))
            last_time = time.time()
            last_test_T = runner.t_env
            for _ in range(n_test_runs):
                runner.run(test_mode=True)

        if args.save_model and (runner.t_env - model_save_time >= args.save_model_interval or model_save_time == 0):
            model_save_time = runner.t_env
            save_path = os.path.join(args.local_results_path, "models", args.unique_token, str(runner.t_env))
            #"results/models/{}".format(unique_token)
            os.makedirs(save_path, exist_ok=True)
            logger.console_logger.info("Saving models to {}".format(save_path))
            # 定期保存模型到指定路径

            # learner should handle saving/loading -- delegate actor save/load to mac,
            # use appropriate filenames to do critics, optimizer states
            learner.save_models(save_path)

        episode += args.batch_size_run

        if (runner.t_env - last_log_T) >= args.log_interval:
            logger.log_stat("episode", episode, runner.t_env)
            logger.print_recent_stats()
            last_log_T = runner.t_env

    runner.close_env()
    logger.console_logger.info("Finished Training")
    # 训练完成后关闭环境并记录日志

5.args_sanity_check函数

def args_sanity_check(config, _log):
    if config["use_cuda"] and not th.cuda.is_available():        # 检查是否需要使用CUDA（即config["use_cuda"]为True）
        config["use_cuda"] = False # 如果没有可用的CUDA设备，但config["use_cuda"]设置为True，则自动将其改为False，以确保程序不会尝试在没有CUDA设备的情况下使用GPU。
        _log.warning("CUDA flag use_cuda was switched OFF automatically because no CUDA devices are available!") #如果CUDA被关闭，则通过日志记录器_log.warning()生成一条警告信息，提醒用户CUDA已自动关闭，因为没有可用的GPU

    if config["test_nepisode"] < config["batch_size_run"]:       # 检查config["test_nepisode"]（测试集中的 episode 数量）是否小于config["batch_size_run"]
        config["test_nepisode"] = config["batch_size_run"]       # 如果test_nepisode小于batch_size_run，则将test_nepisode设置为等于batch_size_run，确保在测试期间至少有足够的episode用于一个批次的处理
    else:
        config["test_nepisode"] = (config["test_nepisode"]//config["batch_size_run"]) * config["batch_size_run"]
    # 如果test_nepisode大于或等于batch_size_run，则将test_nepisode调整为batch_size_run的整数倍。这样做是为了确保test_nepisode可以整除batch_size_run，从而保证每个批次都有相同数量的episode进行处理
    return config  # 返回经过合理性检查和可能调整后的

三、learners模块

Learners模块中的类，如QLearner和COMALearner，负责执行强化学习算法的核心训练逻辑。通过利用经验数据如来更新智能体的策略和价值函数。Learners模块是PyMARL框架中的关键部分，其将经验数据转化为更新策略和价值函数的信号，决定了智能体如何从环境中学习并优化相关决策。

1.coma_learner.py

import copy
from components.episode_buffer import EpisodeBatch
from modules.critics.coma import COMACritic
from utils.rl_utils import build_td_lambda_targets
import torch as th
from torch.optim import RMSprop

class COMALearner:
    def __init__(self, mac, scheme, logger, args):
        self.args = args
        self.n_agents = args.n_agents
        self.n_actions = args.n_actions
        self.mac = mac
        self.logger = logger

        self.last_target_update_step = 0                        # 上次更新目标网络的步骤，用于控制目标网络的更新频率
        self.critic_training_steps = 0                          # 追踪训练评论网络的次数。
        self.log_stats_t = -self.args.learner_log_interval - 1

        self.critic = COMACritic(scheme, args)                  # COMA的评论网络，主要用于估计动作值函数Q
        self.target_critic = copy.deepcopy(self.critic)         # 目标评论网络，用于计算稳定的目标值，通过deepcopy self.critic来初始化。

        self.agent_params = list(mac.parameters())              # 分别保存了智能体网络和评论网络的参数
        self.critic_params = list(self.critic.parameters())     
        self.params = self.agent_params + self.critic_params    # 包含了智能体和评论网络的所有参数，用于优化器

        self.agent_optimiser = RMSprop(params=self.agent_params, lr=args.lr, alpha=args.optim_alpha, eps=args.optim_eps)
        self.critic_optimiser = RMSprop(params=self.critic_params, lr=args.critic_lr, alpha=args.optim_alpha, eps=args.optim_eps)
        # RMSprop：一种优化算法，分别用于智能体网络和评论网络。每个优化器有各自的学习率（lr）、微小量eps来避免除零错误

    def train(self, batch: EpisodeBatch, t_env: int, episode_num: int):
        # Get the relevant quantities
        bs = batch.batch_size
        max_t = batch.max_seq_length
        rewards = batch["reward"][:, :-1]
        actions = batch["actions"][:, :]
        terminated = batch["terminated"][:, :-1].float()
        mask = batch["filled"][:, :-1].float()               # 掩码，用于屏蔽填充数据（padded data）
        mask[:, 1:] = mask[:, 1:] * (1 - terminated[:, :-1]) # 确保后续步骤屏蔽终止状态后的无效数据
        avail_actions = batch["avail_actions"][:, :-1]
        critic_mask = mask.clone()                           # 评论网络使用的掩码，表示哪些数据有效
        mask = mask.repeat(1, 1, self.n_agents).view(-1)

        q_vals, critic_train_stats = self._train_critic(batch, rewards, terminated, actions, avail_actions,
                                                        critic_mask, bs, max_t)
        # 训练评论网络，返回Q值和训练过程中的统计信息

        actions = actions[:,:-1]

        mac_out = []
        self.mac.init_hidden(batch.batch_size)
        for t in range(batch.max_seq_length - 1):
            agent_outs = self.mac.forward(batch, t=t)
            mac_out.append(agent_outs)
        mac_out = th.stack(mac_out, dim=1)  # Concat over time
        # 为每个时间步执行前向传播，得到智能体的输出，并将每个时间步的输出堆叠起来

        # 对不可用的动作，将对应概率设为0，并重新归一化概率分布
        mac_out[avail_actions == 0] = 0
        mac_out = mac_out/mac_out.sum(dim=-1, keepdim=True)
        mac_out[avail_actions == 0] = 0
        #在强化学习中的多智能体系统中，每个智能体在某一时刻可能并不能执行所有可能的动作。
        # 此时，某些动作是不可用的，可能是因为环境的限制或者当前状态下这些动作无效。为了避免选择不可用的动作，
        # 需要用掩码（mask）将这些动作屏蔽掉，即将不可用的动作的概率置为0

        # 重新归一化，类似于动作选择中的归一化：当屏蔽了不可用的动作后，剩下的动作的概率和可能不再为1。
        # 为了保证这些动作的概率依然是一个有效的分布（即概率和为1），需要将剩下的动作概率重新进行归一化（renormalisation）。
        # 这一步通过将剩余的可用动作的概率重新调整，使得它们的总和为1，这样模型才能继续根据这个概率分布来进行动作选择

        # Calculated baseline
        # 通过策略pi和Q值计算基线，用于减少策略梯度的方差
        q_vals = q_vals.reshape(-1, self.n_actions)
        pi = mac_out.view(-1, self.n_actions)
        baseline = (pi * q_vals).sum(-1).detach()

        # Calculate policy grad with mask
        # 提取智能体采取的动作对应的Q值和动作概率
        q_taken = th.gather(q_vals, dim=1, index=actions.reshape(-1, 1)).squeeze(1)
        pi_taken = th.gather(pi, dim=1, index=actions.reshape(-1, 1)).squeeze(1)

        pi_taken[mask == 0] = 1.0                   # 对填充的动作，将其概率设置为1，避免log(0)错误。
        log_pi_taken = th.log(pi_taken)             # 计算采取动作的对数概率

        advantages = (q_taken - baseline).detach()  # 计算优势函数，用于指导策略更新

        coma_loss = - ((advantages * log_pi_taken) * mask).sum() / mask.sum() # 通过优势函数和对数概率计算COMA算法的损失函数

        # Optimise agents
        # 优化智能体的策略，通过反向传播更新网络参数
        self.agent_optimiser.zero_grad()
        coma_loss.backward()
        grad_norm = th.nn.utils.clip_grad_norm_(self.agent_params, self.args.grad_norm_clip)
        self.agent_optimiser.step()

        if (self.critic_training_steps - self.last_target_update_step) / self.args.target_update_interval >= 1.0:
            self._update_targets()
            self.last_target_update_step = self.critic_training_steps

        if t_env - self.log_stats_t >= self.args.learner_log_interval:
            ts_logged = len(critic_train_stats["critic_loss"])
            for key in ["critic_loss", "critic_grad_norm", "td_error_abs", "q_taken_mean", "target_mean"]:
                self.logger.log_stat(key, sum(critic_train_stats[key])/ts_logged, t_env)

            self.logger.log_stat("advantage_mean", (advantages * mask).sum().item() / mask.sum().item(), t_env)
            self.logger.log_stat("coma_loss", coma_loss.item(), t_env)
            self.logger.log_stat("agent_grad_norm", grad_norm, t_env)
            self.logger.log_stat("pi_max", (pi.max(dim=1)[0] * mask).sum().item() / mask.sum().item(), t_env)
            self.log_stats_t = t_env

    def _train_critic(self, batch, rewards, terminated, actions, avail_actions, mask, bs, max_t):
        # 从目标评论器（target_critic）中获取所有时间步的Q值预测target_q_vals。目标网络用于计算稳定的目标Q值，从而减少训练中的波动
        target_q_vals = self.target_critic(batch)[:, :]
        # 使用gather函数，从 target_q_vals 中提取智能体在每个时间步选择的动作对应的Q值。dim=3表示根据动作索引提取Q值。
        # .squeeze(3) 将结果从3维张量变为2维
        targets_taken = th.gather(target_q_vals, dim=3, index=actions).squeeze(3)

        # 调用build_td_lambda_targets计算TD-Lambda目标Q值（包含折扣因子gamma和TD-Lambda算法的参数lambda）
        targets = build_td_lambda_targets(rewards, terminated, mask, targets_taken, self.n_agents, self.args.gamma, self.args.td_lambda)

        # 初始化q_vals张量，用来存储当前评论器在各个时间步计算出的Q值。注意这里排除了最后一个时间步的数据（[:-1]）
        q_vals = th.zeros_like(target_q_vals)[:, :-1]
        running_log = {
            "critic_loss": [],
            "critic_grad_norm": [],
            "td_error_abs": [],
            "target_mean": [],
            "q_taken_mean": [],
        }

        # 从最后一个时间步开始向前迭代，逆向遍历时间步，计算每个时间步的损失并更新评论器参数。逆序处理是为了计算TD-Lambda目标和更新Q值
        for t in reversed(range(rewards.size(1))):
            # 对每个时间步t，扩展mask以覆盖所有智能体。如果该时间步没有有效数据（所有值都是0），则跳过该时间步的计算
            mask_t = mask[:, t].expand(-1, self.n_agents)
            if mask_t.sum() == 0:
                continue
            
            # 调用critic网络获取当前时间步t的Q值预测，并将其存储到q_vals中。
            # q_vals[:, t]表示在当前时间步t存储的Q值
            q_t = self.critic(batch, t)
            q_vals[:, t] = q_t.view(bs, self.n_agents, self.n_actions)

            # 使用gather函数提取当前时间步选择的动作对应的Q值，并将其与targets_t（目标Q值）进行比较
            q_taken = th.gather(q_t, dim=3, index=actions[:, t:t+1]).squeeze(3).squeeze(1)
            targets_t = targets[:, t]

            # 计算时序差分误差（TD error），即评论器预测的Q值与目标Q值之间的差异
            td_error = (q_taken - targets_t.detach())

            # 在深度学习中的序列数据（如多时间步的批量训练）处理中，经常会遇到长度不一致的序列。
            # 为了处理这些不等长的序列，通常会使用“填充”（padding）技术，将较短的序列用特殊的填充值（如0）扩展到与最长的序列相同的长度
            # 然而，这些填充值并不是真实数据的一部分，它们只是为了使批量处理中的序列长度一致而引入的“伪数据”。
            # 在训练过程中，应该忽略这些填充的数据，以避免它们对模型的学习过程产生负面影响。
            # 因此，在处理这些填充序列时，会使用掩码（mask）来标记哪些时间步是真实数据，哪些是填充数据
            # 0-out意味着将这些填充数据对应的目标值设置为0，以避免它们影响误差计算和梯度更新。
            # 在下面的代码中，这一操作确保了模型只会对真实数据进行训练，而不会受到填充数据的干扰
            # mask_t是掩码，它的作用是屏蔽填充数据。通过将时序差分误差td_error与掩码相乘，填充数据对应的误差会被置为0（即“0-out”），从而不参与后续的损失计算。
            masked_td_error = td_error * mask_t        # 使用mask_t屏蔽掉填充的无效时间步数据，确保只有有效时间步的误差会参与计算

            # Normal L2 loss, take mean over actual data
            # 计算L2损失（均方误差，MSE），即TD error的平方和，取均值
            loss = (masked_td_error ** 2).sum() / mask_t.sum()
            # 清空优化器的梯度，将损失反向传播，计算梯度并应用梯度裁剪（clip），然后更新评论器网络的参数。
            self.critic_optimiser.zero_grad()
            loss.backward()
            grad_norm = th.nn.utils.clip_grad_norm_(self.critic_params, self.args.grad_norm_clip)
            self.critic_optimiser.step()
            self.critic_training_steps += 1

            # 将当前时间步的损失和梯度范数记录到running_log
            running_log["critic_loss"].append(loss.item())
            running_log["critic_grad_norm"].append(grad_norm)

            mask_elems = mask_t.sum().item()    # 计算并记录时序差分误差的绝对值均值、Q值和目标值的均值
            running_log["td_error_abs"].append((masked_td_error.abs().sum().item() / mask_elems))
            running_log["q_taken_mean"].append((q_taken * mask_t).sum().item() / mask_elems)
            running_log["target_mean"].append((targets_t * mask_t).sum().item() / mask_elems)

        return q_vals, running_log   # 返回训练后的Q值和记录的日志数据

    def _update_targets(self):       # 更新目标网络的参数，将主评论器网络的参数拷贝到目标评论器网络。
        self.target_critic.load_state_dict(self.critic.state_dict())
        self.logger.console_logger.info("Updated target network")

    def cuda(self):                  # 将所有的模型（智能体控制器、评论器和目标评论器）移动到GPU上
        self.mac.cuda()
        self.critic.cuda()
        self.target_critic.cuda()

    def save_models(self, path)      # 保存模型的状态。将智能体控制器的模型、评论器网络的权重、智能体优化器和评论器优化器的状态保存到指定路径path，以便后续恢复模型
        self.mac.save_models(path)
        th.save(self.critic.state_dict(), "{}/critic.th".format(path))
        th.save(self.agent_optimiser.state_dict(), "{}/agent_opt.th".format(path))
        th.save(self.critic_optimiser.state_dict(), "{}/critic_opt.th".format(path))

    def load_models(self, path):
        self.mac.load_models(path)                  # 通过调用mac对象的load_models方法，加载与智能体相关的模型参数
        self.critic.load_state_dict(th.load("{}/critic.th".format(path), map_location=lambda storage, loc: storage))
        # self.critic是当前智能体使用的评论器网络，用来估计状态-动作值（Q值）
        # 这里通过调用critic.load_state_dict()加载之前保存的评论器模型参数
        # th.load("{}/critic.th".format(path), map_location=lambda storage, loc: storage)使用PyTorch的th.load函数从路径{}/critic.th加载评论器的参数文件
        # map_location=lambda storage, loc: storage 用于映射设备的参数，确保在加载时，如果没有GPU可用，模型可以正确加载到CPU上。会将所有的张量映射到与保存时相同的设备上
        # 虽然目标评论器网络 (target_critic) 在训练过程中是独立存在的，但开发者选择不单独保存目标网络的参数
        self.target_critic.load_state_dict(self.critic.state_dict())
        # 于是直接将主评论器网络的状态赋值给目标评论器网络
        # 这意味着目标网络会拥有与当前评论器网络相同的参数。用于避免保存多个冗余的模型文件，并确保目标网络与主网络同步
        self.agent_optimiser.load_state_dict(th.load("{}/agent_opt.th".format(path), map_location=lambda storage, loc: storage))
        # self.agent_optimiser是控制智能体策略的优化器。此代码使用th.load("{}/agent_opt.th".format(path)) 来加载优化器的状态文件
        # 加载之后，使用load_state_dict()方法将优化器的状态恢复到训练中断时的状态。 
        self.critic_optimiser.load_state_dict(th.load("{}/critic_opt.th".format(path), map_location=lambda storage, loc: storage))
        # self.critic_optimiser是用于训练评论器网络的优化器。与之前相同，使用th.load加载存储在{}/critic_opt.th文件中的优化器状态。
        # 然后，使用load_state_dict()恢复评论器优化器的状态，从而可以继续上次训练中断时的优化状态

2.q_learner.py

class QLearner:
    def __init__(self, mac, scheme, logger, args):
        self.args = args
        self.mac = mac
        self.logger = logger

        # Parameter initialization: Gets the parameters of the action selector and stores them in self.params
        self.params = list(mac.parameters())

        # Target network update counter: initializes last_target_update_episode to record the number of episodes that were last updated on the target network
        self.last_target_update_episode = 0

        # Select the mixer to use according to the input parameter args. 
        # If the specified mixer is "vdn" or "qmix", the corresponding mixer is instantiated, otherwise an error is thrown
        self.mixer = None
        if args.mixer is not None:
            if args.mixer == "vdn":
                self.mixer = VDNMixer()
            elif args.mixer == "qmix":
                self.mixer = QMixer(args)
            else:
                raise ValueError("Mixer {} not recognised.".format(args.mixer))
            self.params += list(self.mixer.parameters())
            self.target_mixer = copy.deepcopy(self.mixer)

        # Use the RMSprop optimizer to initialize the model's parameters, set the learning rate and other parameters
        self.optimiser = RMSprop(params=self.params, lr=args.lr, alpha=args.optim_alpha, eps=args.optim_eps)

        # a little wasteful to deepcopy (e.g. duplicates action selector), but should work for any MAC
        self.target_mac = copy.deepcopy(mac)

        self.log_stats_t = -self.args.learner_log_interval - 1

    # Define a training method that receives a batch of experiential replay data (batch), the current environment time (t_env), and the current number of episodes (episode_num)    
    def train(self, batch: EpisodeBatch, t_env: int, episode_num: int):
        # Get the relevant quantities
        # Extract rewards, actions, termination signals, and fill masks from the batch. The [:-1] operation ignores data from the last time step
        rewards = batch["reward"][:, :-1]
        actions = batch["actions"][:, :-1]
        terminated = batch["terminated"][:, :-1].float()
        mask = batch["filled"][:, :-1].float()
        # Update the mask to exclude terminated agents
        mask[:, 1:] = mask[:, 1:] * (1 - terminated[:, :-1])
        avail_actions = batch["avail_actions"]

        # Calculate estimated Q-Values
        mac_out = []
        self.mac.init_hidden(batch.batch_size)
        for t in range(batch.max_seq_length):
            agent_outs = self.mac.forward(batch, t=t)
            mac_out.append(agent_outs)
        mac_out = th.stack(mac_out, dim=1)  # Concat over time

        # Pick the Q-Values for the actions taken by each agent from mac_out based on the selected action
        chosen_action_qvals = th.gather(mac_out[:, :-1], dim=3, index=actions).squeeze(3)  # Remove the last dim

        # Calculate the Q-Values necessary for the target
        # Compute the target Q value: Similarly loop through the duration steps, compute the output of the target action selector, and store it in the target_mac_out list
        target_mac_out = []
        self.target_mac.init_hidden(batch.batch_size)
        for t in range(batch.max_seq_length):
            target_agent_outs = self.target_mac.forward(batch, t=t)
            target_mac_out.append(target_agent_outs)

        # Merge the target output into a tensor via th.stack, ignoring the first time step
        # We don't need the first timesteps Q-Value estimate for calculating targets
        target_mac_out = th.stack(target_mac_out[1:], dim=1)  # Concat across time

        # Mask out unavailable actions, setting them to a small value (-9999999) to ensure that they are not selected
        target_mac_out[avail_actions[:, 1:] == 0] = -9999999

        # Max over target Q-Values
        # If double Q-learning is enabled, the maximum action is selected using the current Q value, and the corresponding Q value is extracted from the target network according to the selected action. 
        # Otherwise, the maximum Q value is directly obtained from the target network
        if self.args.double_q:
            # Get actions that maximise live Q (for double q-learning)
            mac_out_detach = mac_out.clone().detach()
            mac_out_detach[avail_actions == 0] = -9999999
            cur_max_actions = mac_out_detach[:, 1:].max(dim=3, keepdim=True)[1]
            target_max_qvals = th.gather(target_mac_out, 3, cur_max_actions).squeeze(3)
        else:
            target_max_qvals = target_mac_out.max(dim=3)[0]

        # Mix, if there is a mixer, enter the Q value of the selected action and the target maximum Q value into the mixer for processing
        if self.mixer is not None:
            chosen_action_qvals = self.mixer(chosen_action_qvals, batch["state"][:, :-1])
            target_max_qvals = self.target_mixer(target_max_qvals, batch["state"][:, 1:])

        # Calculate 1-step Q-Learning targets, the target Q value is calculated using the target update formula of Q learning
        targets = rewards + self.args.gamma * (1 - terminated) * target_max_qvals

        # Td-error, the time difference (TD) error is calculated using the difference between the current Q value and the target Q value
        td_error = (chosen_action_qvals - targets.detach())

        # Extend the mask to the same shape as the TD error
        mask = mask.expand_as(td_error)

        # 0-out the targets that came from padded data, apply the TD error according to the mask
        masked_td_error = td_error * mask

        # Normal L2 loss, take mean over actual data
        loss = (masked_td_error ** 2).sum() / mask.sum()

        # Optimise，zero the gradient, calculate the gradient loss, perform gradient clipping to prevent explosion, and update the model parameters
        self.optimiser.zero_grad()
        loss.backward()
        grad_norm = th.nn.utils.clip_grad_norm_(self.params, self.args.grad_norm_clip)
        self.optimiser.step()

        if (episode_num - self.last_target_update_episode) / self.args.target_update_interval >= 1.0: # Update the target network according to the set interval
            self._update_targets()
            self.last_target_update_episode = episode_num

        if t_env - self.log_stats_t >= self.args.learner_log_interval: # Logging involves loss, gradient norm, absolute TD error, and the average Q value of the selected action
            self.logger.log_stat("loss", loss.item(), t_env)
            self.logger.log_stat("grad_norm", grad_norm, t_env)
            mask_elems = mask.sum().item()
            self.logger.log_stat("td_error_abs", (masked_td_error.abs().sum().item()/mask_elems), t_env)
            self.logger.log_stat("q_taken_mean", (chosen_action_qvals * mask).sum().item()/(mask_elems * self.args.n_agents), t_env)
            self.logger.log_stat("target_mean", (targets * mask).sum().item()/(mask_elems * self.args.n_agents), t_env)
            self.log_stats_t = t_env

    def _update_targets(self):
        self.target_mac.load_state(self.mac)
        if self.mixer is not None:
            self.target_mixer.load_state_dict(self.mixer.state_dict())
        self.logger.console_logger.info("Updated target network")

    def cuda(self):
        self.mac.cuda()
        self.target_mac.cuda()
        if self.mixer is not None:
            self.mixer.cuda()
            self.target_mixer.cuda()

    def save_models(self, path):
        self.mac.save_models(path)
        if self.mixer is not None:
            th.save(self.mixer.state_dict(), "{}/mixer.th".format(path))
        th.save(self.optimiser.state_dict(), "{}/opt.th".format(path))

    def load_models(self, path):
        self.mac.load_models(path)
        # Not quite right but I don't want to save target networks
        self.target_mac.load_models(path)
        if self.mixer is not None:
            self.mixer.load_state_dict(th.load("{}/mixer.th".format(path), map_location=lambda storage, loc: storage))
        self.optimiser.load_state_dict(th.load("{}/opt.th".format(path), map_location=lambda storage, loc: storage))

3.qtran_learner.py

class QLearner:
    def __init__(self, mac, scheme, logger, args):
        self.args = args
        self.mac = mac
        self.logger = logger

        self.params = list(mac.parameters())

        self.last_target_update_episode = 0

        self.mixer = None
        if args.mixer == "qtran_base":
            self.mixer = QTranBase(args)
        elif args.mixer == "qtran_alt":
            raise Exception("Not implemented here!")

        self.params += list(self.mixer.parameters())
        self.target_mixer = copy.deepcopy(self.mixer)

        self.optimiser = RMSprop(params=self.params, lr=args.lr, alpha=args.optim_alpha, eps=args.optim_eps)

        # a little wasteful to deepcopy (e.g. duplicates action selector), but should work for any MAC
        self.target_mac = copy.deepcopy(mac)

        self.log_stats_t = -self.args.learner_log_interval - 1

    def train(self, batch: EpisodeBatch, t_env: int, episode_num: int):
        # Get the relevant quantities
        rewards = batch["reward"][:, :-1]
        actions = batch["actions"][:, :-1]
        terminated = batch["terminated"][:, :-1].float()
        mask = batch["filled"][:, :-1].float()
        mask[:, 1:] = mask[:, 1:] * (1 - terminated[:, :-1])
        avail_actions = batch["avail_actions"]

        # Calculate estimated Q-Values
        mac_out = []
        mac_hidden_states = []
        self.mac.init_hidden(batch.batch_size)
        for t in range(batch.max_seq_length):
            agent_outs = self.mac.forward(batch, t=t) # Forward propagation yields proxy output
            mac_out.append(agent_outs)
            mac_hidden_states.append(self.mac.hidden_states)
        mac_out = th.stack(mac_out, dim=1)  # Concat over time
        mac_hidden_states = th.stack(mac_hidden_states, dim=1)
        mac_hidden_states = mac_hidden_states.reshape(batch.batch_size, self.args.n_agents, batch.max_seq_length, -1).transpose(1,2) #btav

        # Pick the Q-Values for the actions taken by each agent
        chosen_action_qvals = th.gather(mac_out[:, :-1], dim=3, index=actions).squeeze(3)  # Remove the last dim

        # Calculate the Q-Values necessary for the target
        target_mac_out = []
        target_mac_hidden_states = []
        self.target_mac.init_hidden(batch.batch_size)
        for t in range(batch.max_seq_length):
            target_agent_outs = self.target_mac.forward(batch, t=t) # Forward propagation using the target network
            target_mac_out.append(target_agent_outs)
            target_mac_hidden_states.append(self.target_mac.hidden_states)

        # We don't need the first timesteps Q-Value estimate for calculating targets
        target_mac_out = th.stack(target_mac_out[:], dim=1)  # Concat across time
        target_mac_hidden_states = th.stack(target_mac_hidden_states, dim=1)
        target_mac_hidden_states = target_mac_hidden_states.reshape(batch.batch_size, self.args.n_agents, batch.max_seq_length, -1).transpose(1,2) #btav

        # Mask out unavailable actions
        target_mac_out[avail_actions[:, :] == 0] = -9999999  # From OG deepmarl
        mac_out_maxs = mac_out.clone()
        mac_out_maxs[avail_actions == 0] = -9999999

        # Best joint action computed by target agents
        target_max_actions = target_mac_out.max(dim=3, keepdim=True)[1]
        # Best joint-action computed by regular agents
        max_actions_qvals, max_actions_current = mac_out_maxs[:, :].max(dim=3, keepdim=True)

        if self.args.mixer == "qtran_base":
            # -- TD Loss --
            # Joint-action Q-Value estimates
            joint_qs, vs = self.mixer(batch[:, :-1], mac_hidden_states[:,:-1])

            # Need to argmax across the target agents' actions to compute target joint-action Q-Values
            if self.args.double_q:
                max_actions_current_ = th.zeros(size=(batch.batch_size, batch.max_seq_length, self.args.n_agents, self.args.n_actions), device=batch.device)
                max_actions_current_onehot = max_actions_current_.scatter(3, max_actions_current[:, :], 1)
                max_actions_onehot = max_actions_current_onehot
            else:
                max_actions = th.zeros(size=(batch.batch_size, batch.max_seq_length, self.args.n_agents, self.args.n_actions), device=batch.device)
                max_actions_onehot = max_actions.scatter(3, target_max_actions[:, :], 1)
            target_joint_qs, target_vs = self.target_mixer(batch[:, 1:], hidden_states=target_mac_hidden_states[:,1:], actions=max_actions_onehot[:,1:])

            # Td loss targets
            td_targets = rewards.reshape(-1,1) + self.args.gamma * (1 - terminated.reshape(-1, 1)) * target_joint_qs
            td_error = (joint_qs - td_targets.detach())          # TD error calculation
            masked_td_error = td_error * mask.reshape(-1, 1)     # Only the filled part is counted
            td_loss = (masked_td_error ** 2).sum() / mask.sum()  # TD loss calculation
            # -- TD Loss --

            # -- Opt Loss --
            # Argmax across the current agents' actions
            if not self.args.double_q: # Already computed if we're doing double Q-Learning
                max_actions_current_ = th.zeros(size=(batch.batch_size, batch.max_seq_length, self.args.n_agents, self.args.n_actions), device=batch.device )
                max_actions_current_onehot = max_actions_current_.scatter(3, max_actions_current[:, :], 1)
            max_joint_qs, _ = self.mixer(batch[:, :-1], mac_hidden_states[:,:-1], actions=max_actions_current_onehot[:,:-1]) # Don't use the target network and target agent max actions as per author's email

            # max_actions_qvals = th.gather(mac_out[:, :-1], dim=3, index=max_actions_current[:,:-1])
            opt_error = max_actions_qvals[:,:-1].sum(dim=2).reshape(-1, 1) - max_joint_qs.detach() + vs
            masked_opt_error = opt_error * mask.reshape(-1, 1)
            opt_loss = (masked_opt_error ** 2).sum() / mask.sum()       # Optimization loss calculation
            # -- Opt Loss --

            # -- Nopt Loss --
            # target_joint_qs, _ = self.target_mixer(batch[:, :-1])
            nopt_values = chosen_action_qvals.sum(dim=2).reshape(-1, 1) - joint_qs.detach() + vs # Don't use target networks here either
            nopt_error = nopt_values.clamp(max=0)                    # Limit the Nopt error to a maximum of 0
            masked_nopt_error = nopt_error * mask.reshape(-1, 1)
            nopt_loss = (masked_nopt_error ** 2).sum() / mask.sum()  # Nopt loss calculation
            # -- Nopt loss --

        elif self.args.mixer == "qtran_alt":
            raise Exception("Not supported yet.")

        # total loss calculation
        loss = td_loss + self.args.opt_loss * opt_loss + self.args.nopt_min_loss * nopt_loss

        # Optimise
        self.optimiser.zero_grad()
        loss.backward()
        grad_norm = th.nn.utils.clip_grad_norm_(self.params, self.args.grad_norm_clip)
        self.optimiser.step()

        if (episode_num - self.last_target_update_episode) / self.args.target_update_interval >= 1.0:
            self._update_targets()
            self.last_target_update_episode = episode_num

        if t_env - self.log_stats_t >= self.args.learner_log_interval:
            self.logger.log_stat("loss", loss.item(), t_env)
            self.logger.log_stat("td_loss", td_loss.item(), t_env)
            self.logger.log_stat("opt_loss", opt_loss.item(), t_env)
            self.logger.log_stat("nopt_loss", nopt_loss.item(), t_env)
            self.logger.log_stat("grad_norm", grad_norm, t_env)
            if self.args.mixer == "qtran_base":
                mask_elems = mask.sum().item()
                self.logger.log_stat("td_error_abs", (masked_td_error.abs().sum().item()/mask_elems), t_env)
                self.logger.log_stat("td_targets", ((masked_td_error).sum().item()/mask_elems), t_env)
                self.logger.log_stat("td_chosen_qs", (joint_qs.sum().item()/mask_elems), t_env)
                self.logger.log_stat("v_mean", (vs.sum().item()/mask_elems), t_env)
                self.logger.log_stat("agent_indiv_qs", ((chosen_action_qvals * mask).sum().item()/(mask_elems * self.args.n_agents)), t_env)
            self.log_stats_t = t_env

    def _update_targets(self):
        self.target_mac.load_state(self.mac)
        if self.mixer is not None:
            self.target_mixer.load_state_dict(self.mixer.state_dict())
        self.logger.console_logger.info("Updated target network")

    def cuda(self):
        self.mac.cuda()
        self.target_mac.cuda()
        if self.mixer is not None:
            self.mixer.cuda()
            self.target_mixer.cuda()

    def save_models(self, path):
        self.mac.save_models(path)
        if self.mixer is not None:
            th.save(self.mixer.state_dict(), "{}/mixer.th".format(path))
        th.save(self.optimiser.state_dict(), "{}/opt.th".format(path))

    def load_models(self, path):
        self.mac.load_models(path)
        # Not quite right but I don't want to save target networks
        self.target_mac.load_models(path)
        if self.mixer is not None:
            self.mixer.load_state_dict(th.load("{}/mixer.th".format(path), map_location=lambda storage, loc: storage))
        self.optimiser.load_state_dict(th.load("{}/opt.th".format(path), map_location=lambda storage, loc: storage))

四、modules模块

1.agents——rnn_agent.py

import torch.nn as nn
import torch.nn.functional as F


class RNNAgent(nn.Module):
    def __init__(self, input_shape, args):
        super(RNNAgent, self).__init__()
        self.args = args
        # Define the first full connection layer, fc1, converting the input shape from input_shape to the RNN's hidden state dimension, args.rnn_hidden_dim
        self.fc1 = nn.Linear(input_shape, args.rnn_hidden_dim)
        # Define a GRU (gated recursive unit) unit rnn with input and output dimensions of args.rnn_hidden_dim. 
        # A GRU is a recurrent neural network for processing sequence data, capable of capturing contextual information in time series
        self.rnn = nn.GRUCell(args.rnn_hidden_dim, args.rnn_hidden_dim)
        # Define a second fully connected layer, fc2, that converts the output shape of the GRU from args.rnn_hidden_dim to the number of actions, 
        # args.n_actions, to output the Q value or probability distribution for each action
        self.fc2 = nn.Linear(args.rnn_hidden_dim, args.n_actions)

    def init_hidden(self):
        # make hidden states on same device as model
        # Returns an all-zero tensor of shape (1, args.rnn_hidden_dim) to provide the RNN with an initial hidden state at the beginning
        return self.fc1.weight.new(1, self.args.rnn_hidden_dim).zero_()

    def forward(self, inputs, hidden_state):
        # The input is processed through the fc1 layer and the hidden layer output x is obtained using the ReLU activation function
        x = F.relu(self.fc1(inputs))
        h_in = hidden_state.reshape(-1, self.args.rnn_hidden_dim)
        # Input x and the reshaped hidden state h_in into the GRU unit to get a new hidden state h
        h = self.rnn(x, h_in)
        # Enter the new hidden state h into the second fully connected layer fc2 and get the Q value q for each action
        q = self.fc2(h)
        return q, h

2.critics——coma.py

import torch as th
import torch.nn as nn
import torch.nn.functional as F
# nn is a neural network module and F is a functional API that provides common activation and loss functions

class COMACritic(nn.Module):
    def __init__(self, scheme, args):
        super(COMACritic, self).__init__()

        self.args = args
        self.n_actions = args.n_actions
        self.n_agents = args.n_agents

        # Call the _get_input_shape method to calculate the shape of the input data based on scheme and store it in input_shape
        input_shape = self._get_input_shape(scheme)
        self.output_type = "q"

        # Set up network layers
        # Define a Layer 3 fully connected network:
        # fc1: Converts the input shape from input_shape to 128.
        # fc2: Keeps the shape of the hidden layer output at 128.
        # fc3: Converts the output shape to the number of actions self.n_actions to output the Q value for each action
        self.fc1 = nn.Linear(input_shape, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, self.n_actions)

    def forward(self, batch, t=None):
        inputs = self._build_inputs(batch, t=t)
        # The output of fc1 and fc2 is activated using the ReLU activation function to generate the output x of the hidden layer.
        # Input the final hidden layer output x to the fc3 layer with the output Q value q.
        x = F.relu(self.fc1(inputs))
        x = F.relu(self.fc2(x))
        q = self.fc3(x)
        return q

    def _build_inputs(self, batch, t=None):
        bs = batch.batch_size
        max_t = batch.max_seq_length if t is None else 1
        ts = slice(None) if t is None else slice(t, t+1)
        inputs = []
        # state
        inputs.append(batch["state"][:, ts].unsqueeze(2).repeat(1, 1, self.n_agents, 1))

        # observation
        inputs.append(batch["obs"][:, ts])

        # actions (masked out by agent)
        actions = batch["actions_onehot"][:, ts].view(bs, max_t, 1, -1).repeat(1, 1, self.n_agents, 1)
        agent_mask = (1 - th.eye(self.n_agents, device=batch.device))
        agent_mask = agent_mask.view(-1, 1).repeat(1, self.n_actions).view(self.n_agents, -1)
        inputs.append(actions * agent_mask.unsqueeze(0).unsqueeze(0))

        # last actions
        if t == 0:
            inputs.append(th.zeros_like(batch["actions_onehot"][:, 0:1]).view(bs, max_t, 1, -1).repeat(1, 1, self.n_agents, 1))
        elif isinstance(t, int):
            inputs.append(batch["actions_onehot"][:, slice(t-1, t)].view(bs, max_t, 1, -1).repeat(1, 1, self.n_agents, 1))
        else:
            last_actions = th.cat([th.zeros_like(batch["actions_onehot"][:, 0:1]), batch["actions_onehot"][:, :-1]], dim=1)
            last_actions = last_actions.view(bs, max_t, 1, -1).repeat(1, 1, self.n_agents, 1)
            inputs.append(last_actions)

        inputs.append(th.eye(self.n_agents, device=batch.device).unsqueeze(0).unsqueeze(0).expand(bs, max_t, -1, -1))

        inputs = th.cat([x.reshape(bs, max_t, self.n_agents, -1) for x in inputs], dim=-1)
        return inputs

    def _get_input_shape(self, scheme):
        # state
        input_shape = scheme["state"]["vshape"]
        # observation
        input_shape += scheme["obs"]["vshape"]
        # actions and last actions
        # The shape of the action and the last action is added to the input shape, taking into account the action of each agent
        input_shape += scheme["actions_onehot"]["vshape"][0] * self.n_agents * 2
        # agent id
        input_shape += self.n_agents
        return input_shape

3.mixers——qmix.py

class QMixer(nn.Module):
    def __init__(self, args):
        super(QMixer, self).__init__()

        self.args = args
        self.n_agents = args.n_agents
        self.state_dim = int(np.prod(args.state_shape))

        # self.embed_dim saves the size of the embedded dimension (i.e. the number of hidden units in the middle layer of the network) 
        self.embed_dim = args.mixing_embed_dim

        # If the value of hypernet_layers in args is 1, the weights are generated using linear layers
        if getattr(args, "hypernet_layers", 1) == 1:
            # self.hyper_w_1 is a linear layer that maps the state dimension state_dim to a weight tensor of the size embed_dim * n_agents
            # which will later be used for the weight calculation of the first layer
            self.hyper_w_1 = nn.Linear(self.state_dim, self.embed_dim * self.n_agents)
            # self.hyper_w_final is another linear layer that is used to generate the weights for the second layer
            self.hyper_w_final = nn.Linear(self.state_dim, self.embed_dim)

        # If hypernet_layers is 2, the weights are generated using a two-layer neural network
        elif getattr(args, "hypernet_layers", 1) == 2:
            hypernet_embed = self.args.hypernet_embed
            # self.hyper_w_1 is a sequential network that maps state to the hypernet_embed size via a linear layer and ReLU activation function
            # and to the embed_dim * n_agents size via another linear layer
            self.hyper_w_1 = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
                                           nn.ReLU(),
                                           nn.Linear(hypernet_embed, self.embed_dim * self.n_agents))
            self.hyper_w_final = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
                                           nn.ReLU(),
                                           nn.Linear(hypernet_embed, self.embed_dim))
        # Exception handling
        elif getattr(args, "hypernet_layers", 1) > 2:
            raise Exception("Sorry >2 hypernet layers is not implemented!")
        else:
            raise Exception("Error setting number of hypernet layers.")

        # State dependent bias for hidden layer
        # self.hyper_b_1 is a linear layer that generates the offset items of the first layer
        self.hyper_b_1 = nn.Linear(self.state_dim, self.embed_dim)

        # V(s) instead of a bias for the last layers
        # self.V is a two-layer sequential network that generates a state-based scalar value V(s) that corresponds to the offset term of the second layer.
        # V(s) replaces the traditional linear bias and is used to adjust the final output
        self.V = nn.Sequential(nn.Linear(self.state_dim, self.embed_dim),
                               nn.ReLU(),
                               nn.Linear(self.embed_dim, 1))

    def forward(self, agent_qs, states):
        bs = agent_qs.size(0)
        states = states.reshape(-1, self.state_dim)
        agent_qs = agent_qs.view(-1, 1, self.n_agents)
        # First layer
        w1 = th.abs(self.hyper_w_1(states))
        b1 = self.hyper_b_1(states)
        w1 = w1.view(-1, self.n_agents, self.embed_dim)
        b1 = b1.view(-1, 1, self.embed_dim)
        # Multiply agent_qs with w1 using batch matrix multiplication (th.bmm), plus bias b1
        # and then get the output hidden of the first layer through the ELU activation function
        hidden = F.elu(th.bmm(agent_qs, w1) + b1)
        # Second layer
        # Calculate the weight w_final for the second layer and reshape it to a shape suitable for matrix multiplication (batch_size, embed_dim, 1)
        w_final = th.abs(self.hyper_w_final(states))
        w_final = w_final.view(-1, self.embed_dim, 1)
        # State-dependent bias
        v = self.V(states).view(-1, 1, 1)
        # Compute final output
        y = th.bmm(hidden, w_final) + v
        # Reshape and return
        q_tot = y.view(bs, -1, 1)
        return q_tot

4.mixers——qtran.py

import torch as th
import torch.nn as nn
import torch.nn.functional as F
import numpy as np


class QTranBase(nn.Module):
    def __init__(self, args):
        super(QTranBase, self).__init__()

        self.args = args

        self.n_agents = args.n_agents
        self.n_actions = args.n_actions
        self.state_dim = int(np.prod(args.state_shape))
        self.arch = self.args.qtran_arch # QTran architecture

        self.embed_dim = args.mixing_embed_dim

        # Q(s,u)
        # According to qtran_arch architecture, define the input size of Q network q_input_size:
        # If use the coma_critic architecture, the inputs to the Q value network are state and action.
        # If the qtran_paper schema is used, the inputs are state, hidden state, and action
        if self.arch == "coma_critic":
            # Q takes [state, u] as input
            q_input_size = self.state_dim + (self.n_agents * self.n_actions)
        elif self.arch == "qtran_paper":
            # Q takes [state, agent_action_observation_encodings]
            q_input_size = self.state_dim + self.args.rnn_hidden_dim + self.n_actions
        else:
            raise Exception("{} is not a valid QTran architecture".format(self.arch))

        # Q networks and V networks are defined for small network architectures:
        # self.Q is a network of Q-valued functions where the inputs are states and actions and the outputs are Q values.
        # self.V is a V-valued function network where the input is the state and the output is the V value.
        # self.action_encoding is used to handle action encoding, which is entered as hidden state and action.
        # For large network architectures, the defined network is deeper (one more layer of ReLU activation function), and the code structure is the same
        if self.args.network_size == "small":
            self.Q = nn.Sequential(nn.Linear(q_input_size, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, 1))

            # V(s)
            self.V = nn.Sequential(nn.Linear(self.state_dim, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, 1))
            ae_input = self.args.rnn_hidden_dim + self.n_actions
            self.action_encoding = nn.Sequential(nn.Linear(ae_input, ae_input),
                                                 nn.ReLU(),
                                                 nn.Linear(ae_input, ae_input))
        elif self.args.network_size == "big":
            self.Q = nn.Sequential(nn.Linear(q_input_size, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, 1))
            # V(s)
            self.V = nn.Sequential(nn.Linear(self.state_dim, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, self.embed_dim),
                                   nn.ReLU(),
                                   nn.Linear(self.embed_dim, 1))
            ae_input = self.args.rnn_hidden_dim + self.n_actions
            self.action_encoding = nn.Sequential(nn.Linear(ae_input, ae_input),
                                                 nn.ReLU(),
                                                 nn.Linear(ae_input, ae_input))
        else:
            assert False

    def forward(self, batch, hidden_states, actions=None):
        bs = batch.batch_size
        ts = batch.max_seq_length

        states = batch["state"].reshape(bs * ts, self.state_dim)
        # If actions is empty, use the unique heat encoding action data in the batch. The states and actions are then spliced into a model
        if self.arch == "coma_critic":
            if actions is None:
                # Use the actions taken by the agents
                actions = batch["actions_onehot"].reshape(bs * ts, self.n_agents * self.n_actions)
            else:
                # It will arrive as (bs, ts, agents, actions), we need to reshape it
                actions = actions.reshape(bs * ts, self.n_agents * self.n_actions)
            inputs = th.cat([states, actions], dim=1)
        # For qtran_paper architecture, state, hidden state and action are combined into the network, and the actions of each agent are encoded and summed
        elif self.arch == "qtran_paper":
            if actions is None:
                # Use the actions taken by the agents
                actions = batch["actions_onehot"].reshape(bs * ts, self.n_agents, self.n_actions)
            else:
                # It will arrive as (bs, ts, agents, actions), we need to reshape it
                actions = actions.reshape(bs * ts, self.n_agents, self.n_actions)

            hidden_states = hidden_states.reshape(bs * ts, self.n_agents, -1)
            agent_state_action_input = th.cat([hidden_states, actions], dim=2)
            agent_state_action_encoding = self.action_encoding(agent_state_action_input.reshape(bs * ts * self.n_agents, -1)).reshape(bs * ts, self.n_agents, -1)
            agent_state_action_encoding = agent_state_action_encoding.sum(dim=1) # Sum across agents

            inputs = th.cat([states, agent_state_action_encoding], dim=1)

        q_outputs = self.Q(inputs)

        states = batch["state"].reshape(bs * ts, self.state_dim)
        v_outputs = self.V(states)

        return q_outputs, v_outputs

5.mixers——vdn.py

class VDNMixer(nn.Module):
    # This constructor does not define any additional parameters or network structures
    # because the core idea of VDN is simply to sum the Q values of individual agents without requiring additional network layers
    def __init__(self):
        super(VDNMixer, self).__init__()

    def forward(self, agent_qs, batch):
        return th.sum(agent_qs, dim=2, keepdim=True)

PyMARL框架学习（二）——main.py & run.py & learners & modules

PyMARL框架学习系列目录

文章目录

一、main.py

1.引入库

2.创建sacred对象

3.my_main函数

4._get_config函数

5.recursive_dict_update函数

6.config_copy函数

6.主程序_main_

二、run.py

1.引入库

2.run函数

3.evaluate_sequential函数

4.run_sequential函数

5.args_sanity_check函数

三、learners模块

1.coma_learner.py

2.q_learner.py

3.qtran_learner.py

四、modules模块

1.agents——rnn_agent.py

2.critics——coma.py

3.mixers——qmix.py

4.mixers——qtran.py

5.mixers——vdn.py