DQN_Continuous_Action

本文探讨了Q-learning在连续动作空间的应用,并对比了其与基于策略梯度的方法的优势。提出了四种解决方案,包括随机采样动作、梯度上升法、设计特定网络以及结合PPO与Q-learning的Actor-Critic算法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Q-learning for Continuous Actions

Q:Q-learning相比于policy gradient based方法为什么训练起来效果更好,更平稳?

A:只要能够 estimate 出Q-function,就保证可以 improve 对应的 policy。而因为 estimate Q-function 作为一个回归问题,一般情况下只需要关注 regression 的 loss 有没有下降,就知道 model learn 的好不好。所以 estimate Q-function 相较于 learn 一个 policy 是比较容易的。

Solution1–sample action

随机sample出N个可能的action,然后和discrete action space 一样操作就好了

Solution2–gradient ascend

将action看为我们的变量,使用gradient ascend方法去update action对应的Q-value。

Solution3–design a network

img
  • 产生的∑保证了其正定性,因此上式的第一项恒为负值,只需令 a = μ ( s ) a=μ(s) a=μ(s)​ 就可以确定Q值最大

Solution4–Don’t use Q-learning

img

结合 policy-based 的方法 PPO 和 value-based 的方法 Q-learning,就是 actor-critic 算法。

I want to compare TD3 with DDPG and DQN, give me the DQN code based on the following TD3 and DDPG codes: import torch import torch.nn as nn import torch.optim as optim import numpy as np device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Define the Actor network class Actor(nn.Module): def __init__(self, state_dim, action_dim, max_action): super(Actor, self).__init__() self.fc1 = nn.Linear(state_dim, 256) self.fc2 = nn.Linear(256, 256) self.fc3 = nn.Linear(256, action_dim) self.max_action = max_action def forward(self, state): x = torch.relu(self.fc1(state)) x = torch.relu(self.fc2(x)) return self.max_action * torch.tanh(self.fc3(x)) # Define the Critic network (TD3 uses two critics) class Critic(nn.Module): def __init__(self, state_dim, action_dim): super(Critic, self).__init__() self.fc1 = nn.Linear(state_dim + action_dim, 256) self.fc2 = nn.Linear(256, 256) self.fc3 = nn.Linear(256, 1) self.fc4 = nn.Linear(state_dim + action_dim, 256) self.fc5 = nn.Linear(256, 256) self.fc6 = nn.Linear(256, 1) def forward(self, state, action): x1 = torch.cat([state, action], 1) x1 = torch.relu(self.fc1(x1)) x1 = torch.relu(self.fc2(x1)) q1 = self.fc3(x1) x2 = torch.cat([state, action], 1) x2 = torch.relu(self.fc4(x2)) x2 = torch.relu(self.fc5(x2)) q2 = self.fc6(x2) return q1, q2 def Q1(self, state, action): x1 = torch.cat([state, action], 1) x1 = torch.relu(self.fc1(x1)) x1 = torch.relu(self.fc2(x1)) return self.fc3(x1) # TD3 Agent class TD3Agent: def __init__(self, state_dim, action_dim, max_action, gamma=0.99, tau=0.005, policy_noise=0.2, noise_clip=0.5, policy_delay=2): self.actor = Actor(state_dim, action_dim, max_action).to(device) self.actor_target = Actor(state_
最新发布
03-21
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值