### DDPG算法在MPE环境中的简单应用
Deep Deterministic Policy Gradient (DDPG) 是一种用于连续动作空间的强化学习算法,它结合了策略梯度方法和Q-learning的优点。为了在Multi-Agent Particle Environment (MPE) 中实现简单的DDPG应用,可以按照以下方式构建模型并运行实验。
#### 1. 安装依赖库
首先需要安装必要的Python包,包括`gym`, `mpe-envs`, 和其他机器学习框架(如PyTorch或TensorFlow)。以下是基于PyTorch的一个示例:
```bash
pip install gym torch numpy matplotlib mpe-envs
```
#### 2. 导入必要模块
导入所需的库以及定义环境初始化部分如下所示:
```python
import os
from collections import deque
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
from mpe_envs.simple_spread import SimpleSpreadEnv
```
#### 3. 初始化MPE环境
这里我们以`simple_spread.py`为例创建一个多代理扩散环境实例:
```python
env = SimpleSpreadEnv(n_agents=3, local_ratio=0.5, max_cycles=25)
state_dim = env.observation_space[0].shape[0]
action_dim = env.action_space[0].n
max_episodes = 10000
max_timesteps = 25
batch_size = 64
gamma = 0.99
tau = 0.001
actor_lr = 1e-4
critic_lr = 1e-3
replay_buffer_capacity = int(1e6)
writer = SummaryWriter()
device = 'cuda' if torch.cuda.is_available() else 'cpu'
```
上述代码片段中设置了最大回合数、时间步长以及其他超参数[^1]。
#### 4. 构建Actor-Critic网络结构
下面分别给出Actor和Critic网络的设计方案:
```python
class Actor(nn.Module):
def __init__(self, state_dim, action_dim):
super(Actor, self).__init__()
self.fc1 = nn.Linear(state_dim, 400)
self.fc2 = nn.Linear(400, 300)
self.fc3 = nn.Linear(300, action_dim)
def forward(self, state):
x = torch.relu(self.fc1(state))
x = torch.relu(self.fc2(x))
output = torch.tanh(self.fc3(x)) * 2 # Assuming actions are bounded between [-2, 2].
return output
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.fc1 = nn.Linear(state_dim + action_dim, 400)
self.fc2 = nn.Linear(400, 300)
self.fc3 = nn.Linear(300, 1)
def forward(self, state, action):
concat_input = torch.cat([state, action], dim=-1)
x = torch.relu(self.fc1(concat_input))
x = torch.relu(self.fc2(x))
value = self.fc3(x)
return value
```
这些神经网络被用来近似最优控制策略π(a|s)及其对应的值函数Q(s,a)。
#### 5. 创建经验回放缓冲区
通过维护一个固定大小的经验池来存储交互数据,并从中采样训练批次:
```python
class ReplayBuffer:
def __init__(self, capacity=replay_buffer_capacity):
self.buffer = deque(maxlen=capacity)
def push(self, transition_tuple):
"""transition_tuple should contain (state, action, reward, next_state, done)."""
self.buffer.append(transition_tuple)
def sample(self, batch_size=batch_size):
transitions = random.sample(self.buffer, k=batch_size)
states, actions, rewards, next_states, dones = zip(*transitions)
return map(torch.as_tensor, [states, actions, rewards, next_states, dones])
```
此设计有助于打破序列相关性从而提高估计稳定性。
#### 6. 更新目标网络权重
利用软更新机制同步主副两套参数集:
```python
def soft_update(target_net, source_net, tau=tau):
for target_param, param in zip(target_net.parameters(), source_net.parameters()):
target_param.data.copy_(target_param.data * (1 - tau) + param.data * tau)
```
这种方法能够平滑调整过程减少震荡现象发生概率。
#### 7. 训练循环逻辑
最后编写完整的训练脚本完成整个流程闭环操作:
```python
actor = Actor(state_dim, action_dim).to(device)
critic = Critic(state_dim, action_dim).to(device)
target_actor = Actor(state_dim, action_dim).to(device)
target_critic = Critic(state_dim, action_dim).to(device)
for target_param, param in zip(target_actor.parameters(), actor.parameters()):
target_param.data.copy_(param.data)
for target_param, param in zip(target_critic.parameters(), critic.parameters()):
target_param.data.copy_(param.data)
actor_optimizer = optim.Adam(actor.parameters(), lr=actor_lr)
critic_optimizer = optim.Adam(critic.parameters(), lr=critic_lr)
buffer = ReplayBuffer()
total_steps = 0
episode_rewards = []
for episode in range(max_episodes):
obs_n = env.reset()
ep_reward = 0
for t in range(max_timesteps):
total_steps += 1
with torch.no_grad():
a_n = []
for o_i in obs_n:
o_i = torch.from_numpy(o_i).float().unsqueeze(0).to(device)
a_i = actor(o_i).squeeze(0).cpu().numpy()
noise = np.random.normal(loc=0., scale=0.1, size=a_i.shape)
clipped_noise = np.clip(noise, -0.5, 0.5)
a_noisy = np.clip(a_i + clipped_noise, -2, 2)
a_n.append(a_noisy)
next_obs_n, r_n, d_n, _ = env.step(a_n)
buffer.push((obs_n, a_n, r_n, next_obs_n, d_n))
if len(buffer.buffer) >= batch_size and total_steps % 10 == 0:
s_b, a_b, r_b, ns_b, d_b = buffer.sample()
y_target = r_b.unsqueeze(-1) + gamma * \
target_critic(ns_b.to(device), target_actor(ns_b.to(device))) * (~d_b).unsqueeze(-1)
q_value = critic(s_b.to(device), a_b.to(device))
loss_q = ((y_target.detach() - q_value)**2).mean()
critic_optimizer.zero_grad()
loss_q.backward()
critic_optimizer.step()
policy_loss = -critic(s_b.to(device), actor(s_b.to(device))).mean()
actor_optimizer.zero_grad()
policy_loss.backward()
actor_optimizer.step()
soft_update(target_actor, actor)
soft_update(target_critic, critic)
obs_n = next_obs_n
ep_reward += sum(r_n)
writer.add_scalar('Episode Reward', ep_reward, global_step=episode)
episode_rewards.append(ep_reward)
print(f"Episode {episode}: Total Reward={ep_reward:.2f}")
```
以上即为一个基础版本适用于多智能体粒子模拟场景下的DDPG实现案例说明文档。