软演员-评论家(Soft Actor-Critic, SAC)是一种基于最大熵框架的深度强化学习算法,在连续动作空间任务中表现出色。下面我将全面介绍SAC的原理、实现细节以及在机器人控制中的应用。
1. SAC算法核心思想
1.1 最大熵强化学习
SAC的核心创新点是最大熵目标,它不仅最大化累积奖励,还最大化策略的熵:
π∗=argmaxπEπ[∑tr(st,at)+αH(π(⋅∣st))]π∗=argπmaxEπ[t∑r(st,at)+αH(π(⋅∣st))]
其中$\alpha$是温度系数,控制熵项的重要性。
1.2 关键技术创新
技术 | 说明 | 优势 |
---|---|---|
随机策略 | 输出动作分布而非确定性动作 | 更好的探索能力 |
自动温度调节 | 动态调整熵正则化系数 | 免去手动调参 |
双Q网络 | 使用两个Q函数估计器 | 减少高估偏差 |
目标网络 | 使用慢更新的目标网络 | 提高训练稳定性 |
2. SAC算法数学原理
2.1 价值函数和Q函数
状态价值函数:
V(s)=Ea∼π[Q(s,a)−αlogπ(a∣s)]V(s)=Ea∼π[Q(s,a)−αlogπ(a∣s)]
Q函数更新目标:
Q(s,a)←r(s,a)+γEs′∼p[V(s′)]Q(s,a)←r(s,a)+γEs′∼p[V(s′)]
2.2 策略更新
策略通过最小化KL散度进行更新:
πnew=argminπ′DKL(π′(⋅∣s)∥exp(Qπold(s,⋅)/α)Zπold(s))πnew=argπ′minDKL(π′(⋅∣s)Zπold(s)exp(Qπold(s,⋅)/α))
2.3 温度系数自动调节
通过优化以下目标自动调整$\alpha$:
Ea∼π∗[−αlogπ∗(a∣s)−αH0]Ea∼π∗[−αlogπ∗(a∣s)−αH0]
其中$\mathcal{H}_0$是目标熵。
3. SAC的PyTorch实现
python
import torch import torch.nn as nn import torch.nn.functional as F import numpy as np from torch.optim import Adam import gym # 策略网络(演员) class GaussianPolicy(nn.Module): def __init__(self, state_dim, action_dim, hidden_dim=256, max_action=1.0): super().__init__() self.max_action = max_action self.l1 = nn.Linear(state_dim, hidden_dim) self.l2 = nn.Linear(hidden_dim, hidden_dim) self.mean = nn.Linear(hidden_dim, action_dim) self.log_std = nn.Linear(hidden_dim, action_dim) def forward(self, state): x = F.relu(self.l1(state)) x = F.relu(self.l2(x)) mean = self.mean(x) log_std = self.log_std(x) log_std = torch.clamp(log_std, min=-20, max=2) return mean, log_std def sample(self, state): mean, log_std = self.forward(state) std = log_std.exp() normal = torch.distributions.Normal(mean, std) # 重参数化技巧 x_t = normal.rsample() action = torch.tanh(x_t) * self.max_action # 计算对数概率 log_prob = normal.log_prob(x_t) log_prob -= torch.log(self.max_action * (1 - action.pow(2)) + 1e-6) log_prob = log_prob.sum(1, keepdim=True) return action, log_prob # Q函数网络(评论家) class QNetwork(nn.Module): def __init__(self, state_dim, action_dim, hidden_dim=256): super().__init__() # Q1网络 self.l1 = nn.Linear(state_dim + action_dim, hidden_dim) self.l2 = nn.Linear(hidden_dim, hidden_dim) self.l3 = nn.Linear(hidden_dim, 1) # Q2网络 self.l4 = nn.Linear(state_dim + action_dim, hidden_dim) self.l5 = nn.Linear(hidden_dim, hidden_dim) self.l6 = nn.Linear(hidden_dim, 1) def forward(self, state, action): sa = torch.cat([state, action], 1) q1 = F.relu(self.l1(sa)) q1 = F.relu(self.l2(q1)) q1 = self.l3(q1) q2 = F.relu(self.l4(sa)) q2 = F.relu(self.l5(q2)) q2 = self.l6(q2) return q1, q2 # SAC主体算法 class SAC: def __init__(self, state_dim, action_dim, max_action, device='cuda'): self.device = device # 网络初始化 self.actor = GaussianPolicy(state_dim, action_dim, max_action=max_action).to(device) self.critic = QNetwork(state_dim, action_dim).to(device) self.critic_target = QNetwork(state_dim, action_dim).to(device) self.critic_target.load_state_dict(self.critic.state_dict()) # 优化器 self.actor_optimizer = Adam(self.actor.parameters(), lr=3e-4) self.critic_optimizer = Adam(self.critic.parameters(), lr=3e-4) # 自动温度调节 self.target_entropy = -torch.prod(torch.Tensor(action_dim).to(device)).item() self.log_alpha = torch.zeros(1, requires_grad=True, device=device) self.alpha_optimizer = Adam([self.log_alpha], lr=3e-4) self.max_action = max_action self.gamma = 0.99 self.tau = 0.005 def select_action(self, state, evaluate=False): state = torch.FloatTensor(state).to(self.device).unsqueeze(0) if evaluate: with torch.no_grad(): mean, _ = self.actor(state) action = torch.tanh(mean) * self.max_action else: action, _ = self.actor.sample(state) return action.cpu().data.numpy().flatten() def update_parameters(self, memory, batch_size=256): # 从记忆库采样 state_batch, action_batch, reward_batch, next_state_batch, done_batch = memory.sample(batch_size) state_batch = torch.FloatTensor(state_batch).to(self.device) action_batch = torch.FloatTensor(action_batch).to(self.device) reward_batch = torch.FloatTensor(reward_batch).to(self.device).unsqueeze(1) next_state_batch = torch.FloatTensor(next_state_batch).to(self.device) done_batch = torch.FloatTensor(done_batch).to(self.device).unsqueeze(1) with torch.no_grad(): # 计算目标Q值 next_action, next_log_prob = self.actor.sample(next_state_batch) q1_next, q2_next = self.critic_target(next_state_batch, next_action) q_next = torch.min(q1_next, q2_next) - self.alpha * next_log_prob target_q = reward_batch + (1 - done_batch) * self.gamma * q_next # 更新Critic current_q1, current_q2 = self.critic(state_batch, action_batch) critic_loss = F.mse_loss(current_q1, target_q) + F.mse_loss(current_q2, target_q) self.critic_optimizer.zero_grad() critic_loss.backward() self.critic_optimizer.step() # 更新Actor actions, log_prob = self.actor.sample(state_batch) q1, q2 = self.critic(state_batch, actions) q = torch.min(q1, q2) actor_loss = (self.alpha * log_prob - q).mean() self.actor_optimizer.zero_grad() actor_loss.backward() self.actor_optimizer.step() # 更新温度系数 alpha_loss = -(self.log_alpha * (log_prob + self.target_entropy).detach()).mean() self.alpha_optimizer.zero_grad() alpha_loss.backward() self.alpha_optimizer.step() self.alpha = self.log_alpha.exp() # 软更新目标网络 for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()): target_param.data.copy_(self.tau * param.data + (1 - self.tau) * target_param.data) def save(self, filename): torch.save({ 'actor': self.actor.state_dict(), 'critic': self.critic.state_dict(), 'critic_target': self.critic_target.state_dict(), 'actor_optimizer': self.actor_optimizer.state_dict(), 'critic_optimizer': self.critic_optimizer.state_dict(), 'alpha': self.alpha, 'log_alpha': self.log_alpha, 'alpha_optimizer': self.alpha_optimizer.state_dict() }, filename) def load(self, filename): checkpoint = torch.load(filename) self.actor.load_state_dict(checkpoint['actor']) self.critic.load_state_dict(checkpoint['critic']) self.critic_target.load_state_dict(checkpoint['critic_target']) self.actor_optimizer.load_state_dict(checkpoint['actor_optimizer']) self.critic_optimizer.load_state_dict(checkpoint['critic_optimizer']) self.alpha = checkpoint['alpha'] self.log_alpha = checkpoint['log_alpha'] self.alpha_optimizer.load_state_dict(checkpoint['alpha_optimizer'])
4. SAC在机器人控制中的应用
4.1 机器人抓取任务实现
python
import gym import numpy as np from sac import SAC from replay_buffer import ReplayBuffer # 创建环境和SAC智能体 env = gym.make('FetchPickAndPlace-v1') state_dim = env.observation_space['observation'].shape[0] + env.observation_space['desired_goal'].shape[0] action_dim = env.action_space.shape[0] max_action = float(env.action_space.high[0]) agent = SAC(state_dim, action_dim, max_action) memory = ReplayBuffer(1000000) # 训练参数 max_episodes = 1000 max_steps = 1000 batch_size = 256 # 训练循环 for episode in range(1, max_episodes+1): state = env.reset() episode_reward = 0 episode_steps = 0 for step in range(max_steps): # 组合状态表示 full_state = np.concatenate([state['observation'], state['desired_goal']]) # 选择动作 action = agent.select_action(full_state) # 执行动作 next_state, reward, done, _ = env.step(action) # 存储转换 next_full_state = np.concatenate([next_state['observation'], next_state['desired_goal']]) memory.push(full_state, action, reward, next_full_state, done) # 训练 if len(memory) > batch_size: agent.update_parameters(memory, batch_size) state = next_state episode_reward += reward episode_steps += 1 if done: break # 打印训练进度 print(f"Episode {episode}, Steps {episode_steps}, Reward {episode_reward:.2f}") # 定期保存模型 if episode % 50 == 0: agent.save(f"sac_fetch_{episode}.pth") env.close()
4.2 实际应用技巧
-
状态表示优化:
-
使用目标物体和末端执行器的相对位置
-
加入力/力矩传感器数据
-
使用历史状态堆叠
-
-
奖励函数设计:
python
-
def compute_reward(achieved_goal, desired_goal, info): # 基于距离的稀疏奖励 d = np.linalg.norm(achieved_goal - desired_goal, axis=-1) return -(d > 0.05).astype(np.float32) # 成功阈值5cm # 密集奖励版本 # return -d # 连续距离奖励
-
课程学习策略:
-
从简单场景开始(固定位置抓取)
-
逐步增加难度(不同位置、不同物体)
-
动态调整初始状态分布
-
5. SAC的改进与变体
5.1 常见改进方法
-
优先经验回放(PER):
python
-
# 使用TD误差作为优先级 td_error = (q1_next - target_q).abs().cpu().data.numpy() memory.update_priority(indices, td_error + 1e-5)
-
数据增强:
python
-
# 对状态应用随机变换 augmented_state = state + np.random.normal(0, 0.01, size=state.shape)
-
多步回报:
python
-
# 使用n步回报替代单步回报 n_step = 3 target_q = r_0 + γr_1 + ... + γ^{n-1}r_{n-1} + γ^n Q(s_n,a_n)
5.2 先进变体算法
-
SAC with Automatic Entropy Adjustment:
-
自动调节温度系数α
-
消除手动调参需求
-
更稳定的性能表现
-
-
SAC-Discrete:
-
适用于离散动作空间
-
使用Gumbel-Softmax重参数化
-
-
SAC-X:
-
分层强化学习框架
-
同时学习多个相关任务
-
提高样本效率
-
6. 调试与性能优化
6.1 常见问题解决
-
训练不稳定:
-
减小学习率(尝试1e-4到3e-4)
-
增加目标网络更新频率τ(0.001到0.01)
-
增大批处理大小(256-1024)
-
-
探索不足:
-
提高初始温度系数
-
使用更大的策略噪声
-
检查目标熵设置
-
-
收敛速度慢:
-
使用经验回放缓冲区预填充
-
实现并行环境采样
-
尝试不同的网络架构
-
6.2 性能评估指标
-
训练指标:
-
回合奖励(平滑处理)
-
策略熵(监控探索程度)
-
Q值变化(监控收敛情况)
-
-
测试指标:
-
任务成功率
-
平均完成时间
-
能量效率(对机器人重要)
-
-
鲁棒性测试:
-
不同初始条件
-
传感器噪声
-
环境扰动
-
通过以上SAC实现和优化技巧,您可以构建一个高效的强化学习系统来解决复杂的机器人控制问题,如抓取、操纵和导航任务。SAC因其出色的样本效率和稳定性,已成为连续控制任务的首选算法之一。