DDPG算法实现multiple particle env的‘simple’ scenario

本文介绍了如何使用DDPG算法在multi particle envs(MPE)的simple环境中实现单智能体的navigation任务。文章指出,DDPG是AC算法的改进版,结合了目标网络以确保收敛,并能输出连续动作。代码参考了莫烦的强化学习教程,作者认为对算法的深入理解和环境的熟悉是解决问题的关键。

simple简介:

simple是multi particle envs(mpe)中最简单的一个环境,旨在测试算法和熟悉环境,我在mpe中使用DDPG算法完成了单智能体的navigation的功能。

DDPG算法是基于AC算法的改进版本,加入了target网络保证收敛,同时可以输出连续动作,具体不再赘述,不懂可以去看莫烦老师的强化学习教程

下面看代码

# -*- coding: utf-8 -*-
"""
Created on Tue Feb 26 09:17:43 2019

@author: Jack Lee
"""
from make_env import make_env
import tensorflow as tf
import numpy as np
import os
import shutil


np.random.seed(1)
tf.set_random_seed(1)

MAX_EPISODES = 600
MAX_EP_STEPS = 200
LR_A = 1e-3  # learning rate for actor
LR_C = 1e-3  # learning rate for critic
GAMMA = 0.9  # reward discount
REPLACE_ITER_A = 1100
REPLACE_ITER_C = 1000
MEMORY_CAPACITY = 5000
BATCH_SIZE = 16
VAR_MIN = 0.1
RENDER = True
LOAD = False
MODE = ['easy', 'hard']
n_model = 1


env = make_env('simple')
STATE_DIM = 4
ACTION_DIM = 2
ACTION_BOUND = [-0.2, 0.2]


with tf.name_scope('S'):
    S = tf.placeholder(tf.float32, shape=[None, STATE_DIM], name='s')
with tf.name_scope('R'):
    R = tf.placeholder(tf.float32, [None, 1], name='r')
with tf.name_scope('S_'):
    S_ = tf.placeholder(tf.float32, shape=[None, STATE_DIM], name='s_')


def Act225(a):
    a = a[np.newaxis, :]
    a = [[0, a[0][0], 0, a[0][1], 0]]
    return a
    
def Act522(a):
    a = [i for i in a[0] if i is not 0]
    return [a]
    
    
    
class Actor(object):
    def __init__(self, sess, action_dim, action_bound, learning_rate, t_replace_iter):
        self.sess = sess
        self.a_dim = action_dim
        self.action_bound = action_bound
        self.lr = learning_rate
        self.t_replace_iter = t_replace_iter
        self.t_replace_counter = 0

        with tf.variable_scope('Actor'):
            # input s, output a
            self.a = self._build_net(S, scope='eval_net', trainable=True)

            # input s_, output a, get a_ for critic
            self.a_ = self._build_net(S_, scope='target_net', trainable=False)

        self.e_params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval_net')
        
### DDPG算法在MPE环境中的简单应用 Deep Deterministic Policy Gradient (DDPG) 是一种用于连续动作空间的强化学习算法,它结合了策略梯度方法和Q-learning的优点。为了在Multi-Agent Particle Environment (MPE) 中实现简单的DDPG应用,可以按照以下方式构建模型并运行实验。 #### 1. 安装依赖库 首先需要安装必要的Python包,包括`gym`, `mpe-envs`, 和其他机器学习框架(如PyTorch或TensorFlow)。以下是基于PyTorch的一个示例: ```bash pip install gym torch numpy matplotlib mpe-envs ``` #### 2. 导入必要模块 导入所需的库以及定义环境初始化部分如下所示: ```python import os from collections import deque import random import numpy as np import torch import torch.nn as nn import torch.optim as optim from torch.utils.tensorboard import SummaryWriter from mpe_envs.simple_spread import SimpleSpreadEnv ``` #### 3. 初始化MPE环境 这里我们以`simple_spread.py`为例创建一个多代理扩散环境实例: ```python env = SimpleSpreadEnv(n_agents=3, local_ratio=0.5, max_cycles=25) state_dim = env.observation_space[0].shape[0] action_dim = env.action_space[0].n max_episodes = 10000 max_timesteps = 25 batch_size = 64 gamma = 0.99 tau = 0.001 actor_lr = 1e-4 critic_lr = 1e-3 replay_buffer_capacity = int(1e6) writer = SummaryWriter() device = 'cuda' if torch.cuda.is_available() else 'cpu' ``` 上述代码片段中设置了最大回合数、时间步长以及其他超参数[^1]。 #### 4. 构建Actor-Critic网络结构 下面分别给出Actor和Critic网络的设计方案: ```python class Actor(nn.Module): def __init__(self, state_dim, action_dim): super(Actor, self).__init__() self.fc1 = nn.Linear(state_dim, 400) self.fc2 = nn.Linear(400, 300) self.fc3 = nn.Linear(300, action_dim) def forward(self, state): x = torch.relu(self.fc1(state)) x = torch.relu(self.fc2(x)) output = torch.tanh(self.fc3(x)) * 2 # Assuming actions are bounded between [-2, 2]. return output class Critic(nn.Module): def __init__(self, state_dim, action_dim): super(Critic, self).__init__() self.fc1 = nn.Linear(state_dim + action_dim, 400) self.fc2 = nn.Linear(400, 300) self.fc3 = nn.Linear(300, 1) def forward(self, state, action): concat_input = torch.cat([state, action], dim=-1) x = torch.relu(self.fc1(concat_input)) x = torch.relu(self.fc2(x)) value = self.fc3(x) return value ``` 这些神经网络被用来近似最优控制策略π(a|s)及其对应的值函数Q(s,a)。 #### 5. 创建经验回放缓冲区 通过维护一个固定大小的经验池来存储交互数据,并从中采样训练批次: ```python class ReplayBuffer: def __init__(self, capacity=replay_buffer_capacity): self.buffer = deque(maxlen=capacity) def push(self, transition_tuple): """transition_tuple should contain (state, action, reward, next_state, done).""" self.buffer.append(transition_tuple) def sample(self, batch_size=batch_size): transitions = random.sample(self.buffer, k=batch_size) states, actions, rewards, next_states, dones = zip(*transitions) return map(torch.as_tensor, [states, actions, rewards, next_states, dones]) ``` 此设计有助于打破序列相关性从而提高估计稳定性。 #### 6. 更新目标网络权重 利用软更新机制同步主副两套参数集: ```python def soft_update(target_net, source_net, tau=tau): for target_param, param in zip(target_net.parameters(), source_net.parameters()): target_param.data.copy_(target_param.data * (1 - tau) + param.data * tau) ``` 这种方法能够平滑调整过程减少震荡现象发生概率。 #### 7. 训练循环逻辑 最后编写完整的训练脚本完成整个流程闭环操作: ```python actor = Actor(state_dim, action_dim).to(device) critic = Critic(state_dim, action_dim).to(device) target_actor = Actor(state_dim, action_dim).to(device) target_critic = Critic(state_dim, action_dim).to(device) for target_param, param in zip(target_actor.parameters(), actor.parameters()): target_param.data.copy_(param.data) for target_param, param in zip(target_critic.parameters(), critic.parameters()): target_param.data.copy_(param.data) actor_optimizer = optim.Adam(actor.parameters(), lr=actor_lr) critic_optimizer = optim.Adam(critic.parameters(), lr=critic_lr) buffer = ReplayBuffer() total_steps = 0 episode_rewards = [] for episode in range(max_episodes): obs_n = env.reset() ep_reward = 0 for t in range(max_timesteps): total_steps += 1 with torch.no_grad(): a_n = [] for o_i in obs_n: o_i = torch.from_numpy(o_i).float().unsqueeze(0).to(device) a_i = actor(o_i).squeeze(0).cpu().numpy() noise = np.random.normal(loc=0., scale=0.1, size=a_i.shape) clipped_noise = np.clip(noise, -0.5, 0.5) a_noisy = np.clip(a_i + clipped_noise, -2, 2) a_n.append(a_noisy) next_obs_n, r_n, d_n, _ = env.step(a_n) buffer.push((obs_n, a_n, r_n, next_obs_n, d_n)) if len(buffer.buffer) >= batch_size and total_steps % 10 == 0: s_b, a_b, r_b, ns_b, d_b = buffer.sample() y_target = r_b.unsqueeze(-1) + gamma * \ target_critic(ns_b.to(device), target_actor(ns_b.to(device))) * (~d_b).unsqueeze(-1) q_value = critic(s_b.to(device), a_b.to(device)) loss_q = ((y_target.detach() - q_value)**2).mean() critic_optimizer.zero_grad() loss_q.backward() critic_optimizer.step() policy_loss = -critic(s_b.to(device), actor(s_b.to(device))).mean() actor_optimizer.zero_grad() policy_loss.backward() actor_optimizer.step() soft_update(target_actor, actor) soft_update(target_critic, critic) obs_n = next_obs_n ep_reward += sum(r_n) writer.add_scalar('Episode Reward', ep_reward, global_step=episode) episode_rewards.append(ep_reward) print(f"Episode {episode}: Total Reward={ep_reward:.2f}") ``` 以上即为一个基础版本适用于多智能体粒子模拟场景下的DDPG实现案例说明文档。
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值