这段代码实现了一个基于深度 Q 网络(DQN)算法的强化学习智能体,用于解决 CartPole-v1
环境中的平衡问题。智能体通过与环境进行交互,学习到最优的动作策略,以最大化累计奖励。代码主要包括定义 Q 网络结构、设置超参数、创建环境、初始化网络和优化器、实现经验回放机制以及训练循环等部分。
代码详细说明
导入必要的库
python
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
gymnasium
:用于创建和管理强化学习环境,这里是CartPole-v1
环境。torch
:PyTorch 深度学习框架,用于构建神经网络和进行张量运算。torch.nn
:包含神经网络模块和层的定义。torch.optim
:包含优化算法,如 Adam 优化器。numpy
:用于数值计算和数组操作。random
:用于生成随机数,在选择动作时实现探索与利用的平衡。
定义 Q 网络
python
class QNetwork(nn.Module):
def __init__(self, input_dim, output_dim):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(input_dim, 64)
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, output_dim)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
QNetwork
类继承自nn.Module
,定义了一个多层感知机(MLP)作为 Q 网络,用于估计状态 - 动作值函数。__init__
方法:初始化网络的全连接层,输入维度为input_dim
,输出维度为output_dim
,中间层有两个隐藏层,每层有 64 个神经元。forward
方法:定义了前向传播过程,输入状态x
,经过两个 ReLU 激活函数的隐藏层,最后输出动作值函数。
超参数设置
python
GAMMA = 0.99
EPSILON = 1.0
EPSILON_DECAY = 0.995
EPSILON_MIN = 0.01
LEARNING_RATE = 0.001
BATCH_SIZE = 64
MEMORY_SIZE = 10000
EPISODES = 500
GAMMA
:折扣因子,用于计算未来奖励的现值,取值范围在 [0, 1] 之间,越接近 1 表示对未来奖励的重视程度越高。EPSILON
:探索率,控制智能体在选择动作时是进行随机探索(概率为EPSILON
)还是选择当前认为的最优动作(概率为1 - EPSILON
),初始值为 1.0,表示完全探索。EPSILON_DECAY
:探索率的衰减因子,每次训练后EPSILON
乘以该因子,逐渐减小探索的概率。EPSILON_MIN
:探索率的最小值,防止探索率衰减到过低。LEARNING_RATE
:优化器的学习率,控制每次更新网络参数时的步长。BATCH_SIZE
:经验回放中每次采样的批量大小。MEMORY_SIZE
:经验回放缓冲区的最大容量。EPISODES
:训练的总 episodes 数。
创建环境
python
env = gym.make('CartPole-v1')
input_dim = env.observation_space.shape[0]
output_dim = env.action_space.n
- 使用
gym.make
创建CartPole-v1
环境。 input_dim
获取环境状态空间的维度,即智能体观察到的状态特征数量。output_dim
获取环境动作空间的维度,即智能体可以采取的动作数量。
初始化 Q 网络和目标 Q 网络
python
q_network = QNetwork(input_dim, output_dim)
target_network = QNetwork(input_dim, output_dim)
target_network.load_state_dict(q_network.state_dict())
- 创建
q_network
和target_network
两个 Q 网络实例,target_network
用于稳定训练过程,定期更新其参数使其与q_network
相同。 - 使用
load_state_dict
将q_network
的参数复制到target_network
。
定义优化器和损失函数
python
optimizer = optim.Adam(q_network.parameters(), lr=LEARNING_RATE)
criterion = nn.MSELoss()
optimizer
使用 Adam 优化器来更新q_network
的参数,学习率为LEARNING_RATE
。criterion
使用均方误差(MSE)损失函数来计算预测 Q 值和目标 Q 值之间的误差。
经验回放缓冲区
python
memory = []
memory
是一个列表,用于存储智能体与环境交互的经验,每个经验包含状态、动作、奖励、下一个状态和是否结束的标志。
选择动作
python
def select_action(state):
global EPSILON
if random.uniform(0, 1) < EPSILON:
return env.action_space.sample()
else:
state = torch.FloatTensor(state).unsqueeze(0)
q_values = q_network(state)
action = torch.argmax(q_values, dim=1).item()
return action
select_action
函数根据当前状态选择动作。- 如果随机数小于
EPSILON
,则随机选择一个动作。 - 否则,将状态转换为 PyTorch 张量,通过
q_network
计算动作值函数,选择值最大的动作。
经验回放
python
def replay():
if len(memory) < BATCH_SIZE:
return
batch = random.sample(memory, BATCH_SIZE)
states, actions, rewards, next_states, dones = zip(*batch)
states = torch.FloatTensor(states)
actions = torch.LongTensor(actions)
rewards = torch.FloatTensor(rewards)
next_states = torch.FloatTensor(next_states)
dones = torch.FloatTensor(dones)
q_values = q_network(states)
q_values = q_values.gather(1, actions.unsqueeze(1)).squeeze(1)
next_q_values = target_network(next_states)
next_q_values = next_q_values.max(1)[0]
target_q_values = rewards + (1 - dones) * GAMMA * next_q_values
loss = criterion(q_values, target_q_values)
optimizer.zero_grad()
loss.backward()
optimizer.step()
replay
函数实现经验回放机制。- 如果经验缓冲区中的经验数量小于
BATCH_SIZE
,则直接返回。 - 从经验缓冲区中随机采样一个批量的经验,将其解压缩为状态、动作、奖励、下一个状态和是否结束的标志。
- 将这些数据转换为 PyTorch 张量,计算当前状态的 Q 值和下一个状态的最大 Q 值。
- 根据 Q 学习的目标公式计算目标 Q 值,计算损失并进行反向传播和参数更新。
训练循环
python
for episode in range(EPISODES):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
action = select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
total_reward += reward
memory.append((state, action, reward, next_state, done))
if len(memory) > MEMORY_SIZE:
memory.pop(0)
replay()
state = next_state
# 更新目标网络
if episode % 10 == 0:
target_network.load_state_dict(q_network.state_dict())
# 衰减探索率
if EPSILON > EPSILON_MIN:
EPSILON *= EPSILON_DECAY
print(f"Episode {episode}: Total Reward = {total_reward}, Epsilon = {EPSILON:.2f}")
- 外层循环迭代
EPISODES
次,每次迭代表示一个 episode。 - 在每个 episode 开始时,重置环境,初始化总奖励和结束标志。
- 内层循环中,智能体根据当前状态选择动作,执行动作并获取下一个状态、奖励和是否结束的标志。
- 将经验存储到经验缓冲区中,如果缓冲区已满则移除最早的经验。
- 调用
replay
函数进行经验回放和参数更新。 - 每 10 个 episode 更新一次目标网络的参数。
- 如果
EPSILON
大于EPSILON_MIN
,则衰减探索率。 - 打印每个 episode 的总奖励和当前的探索率。
关闭环境
python
env.close()
- 训练结束后关闭环境,释放资源。
通过以上步骤,代码实现了一个基于 DQN 的强化学习智能体在 CartPole-v1
环境中的训练过程,不断优化 Q 网络以找到最优的动作策略。
完整代码
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
# 定义 Q 网络
class QNetwork(nn.Module):
def __init__(self, input_dim, output_dim):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(input_dim, 64)
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, output_dim)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
# 超参数设置
GAMMA = 0.99
EPSILON = 1.0
EPSILON_DECAY = 0.995
EPSILON_MIN = 0.01
LEARNING_RATE = 0.001
BATCH_SIZE = 64
MEMORY_SIZE = 10000
EPISODES = 500
# 创建环境
env = gym.make('CartPole-v1')
input_dim = env.observation_space.shape[0]
output_dim = env.action_space.n
# 初始化 Q 网络和目标 Q 网络
q_network = QNetwork(input_dim, output_dim)
target_network = QNetwork(input_dim, output_dim)
target_network.load_state_dict(q_network.state_dict())
# 定义优化器和损失函数
optimizer = optim.Adam(q_network.parameters(), lr=LEARNING_RATE)
criterion = nn.MSELoss()
# 经验回放缓冲区
memory = []
# 选择动作
def select_action(state):
global EPSILON
if random.uniform(0, 1) < EPSILON:
return env.action_space.sample()
else:
state = torch.FloatTensor(state).unsqueeze(0)
q_values = q_network(state)
action = torch.argmax(q_values, dim=1).item()
return action
# 经验回放
def replay():
if len(memory) < BATCH_SIZE:
return
batch = random.sample(memory, BATCH_SIZE)
states, actions, rewards, next_states, dones = zip(*batch)
states = torch.FloatTensor(states)
actions = torch.LongTensor(actions)
rewards = torch.FloatTensor(rewards)
next_states = torch.FloatTensor(next_states)
dones = torch.FloatTensor(dones)
q_values = q_network(states)
q_values = q_values.gather(1, actions.unsqueeze(1)).squeeze(1)
next_q_values = target_network(next_states)
next_q_values = next_q_values.max(1)[0]
target_q_values = rewards + (1 - dones) * GAMMA * next_q_values
loss = criterion(q_values, target_q_values)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 训练循环
for episode in range(EPISODES):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
action = select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
total_reward += reward
memory.append((state, action, reward, next_state, done))
if len(memory) > MEMORY_SIZE:
memory.pop(0)
replay()
state = next_state
# 更新目标网络
if episode % 10 == 0:
target_network.load_state_dict(q_network.state_dict())
# 衰减探索率
if EPSILON > EPSILON_MIN:
EPSILON *= EPSILON_DECAY
print(f"Episode {episode}: Total Reward = {total_reward}, Epsilon = {EPSILON:.2f}")
env.close()
完整代码(增加可视化)
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
# 定义 Q 网络
class QNetwork(nn.Module):
def __init__(self, input_dim, output_dim):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(input_dim, 64)
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, output_dim)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
# 超参数设置
GAMMA = 0.99
EPSILON = 1.0
EPSILON_DECAY = 0.995
EPSILON_MIN = 0.01
LEARNING_RATE = 0.001
BATCH_SIZE = 64
MEMORY_SIZE = 10000
EPISODES = 500
# 创建环境
env = gym.make('CartPole-v1', render_mode='human')
input_dim = env.observation_space.shape[0]
output_dim = env.action_space.n
# 初始化 Q 网络和目标 Q 网络
q_network = QNetwork(input_dim, output_dim)
target_network = QNetwork(input_dim, output_dim)
target_network.load_state_dict(q_network.state_dict())
# 定义优化器和损失函数
optimizer = optim.Adam(q_network.parameters(), lr=LEARNING_RATE)
criterion = nn.MSELoss()
# 经验回放缓冲区
memory = []
# 选择动作
def select_action(state):
global EPSILON
if random.uniform(0, 1) < EPSILON:
return env.action_space.sample()
else:
state = torch.FloatTensor(state).unsqueeze(0)
q_values = q_network(state)
action = torch.argmax(q_values, dim=1).item()
return action
# 经验回放
def replay():
if len(memory) < BATCH_SIZE:
return
batch = random.sample(memory, BATCH_SIZE)
states, actions, rewards, next_states, dones = zip(*batch)
states = torch.FloatTensor(states)
actions = torch.LongTensor(actions)
rewards = torch.FloatTensor(rewards)
next_states = torch.FloatTensor(next_states)
dones = torch.FloatTensor(dones)
q_values = q_network(states)
q_values = q_values.gather(1, actions.unsqueeze(1)).squeeze(1)
next_q_values = target_network(next_states)
next_q_values = next_q_values.max(1)[0]
target_q_values = rewards + (1 - dones) * GAMMA * next_q_values
loss = criterion(q_values, target_q_values)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 训练循环
for episode in range(EPISODES):
state, _ = env.reset()
total_reward = 0
done = False
while not done:
env.render()
action = select_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
total_reward += reward
memory.append((state, action, reward, next_state, done))
if len(memory) > MEMORY_SIZE:
memory.pop(0)
replay()
state = next_state
# 更新目标网络
if episode % 10 == 0:
target_network.load_state_dict(q_network.state_dict())
# 衰减探索率
if EPSILON > EPSILON_MIN:
EPSILON *= EPSILON_DECAY
print(f"Episode {episode}: Total Reward = {total_reward}, Epsilon = {EPSILON:.2f}")
env.close()