第十一章 TPRO算法
11.1简介
本书之前介绍的基于策略的方法包括策略梯度算法和Actor-Critic算法。这些发方法虽然简单、直观,但在实际应用过程中会遇到训练不稳定的情况。基于策略的方法:参数化智能体的策略,并设计衡量策略好坏的目标函数,通过梯度上升的方法来最大化这个目标函数,使得策略最优。具体来说,假设θ\thetaθ表示策略πθ\pi_\thetaπθ的参数,定义J(θ)=Es0[Vπθ(s0)]=Eπθ[Σt=0∞γtr(st,at)]J(\theta)=\mathbb E_{s_0}[V^{\pi_\theta}(s_0)]=\mathbb E_{\pi_\theta}[\varSigma_{t=0}^{∞}\gamma^tr(s_t,a_t)]J(θ)=Es0[Vπθ(s0)]=Eπθ[Σt=0∞γtr(st,at)],基于策略的方法的目标是找到θ∗=argmaxθJ(θ)\theta^*=argmax_\theta J(\theta)θ∗=argmaxθJ(θ),策略梯度算法主要沿着∇θJ(θ)\nabla _{\theta}J\left( \theta \right) ∇θJ(θ)方向迭代更新策略参数θ\thetaθ。但是这种算法有一个明显的缺点:当策略网络是深度模型时,沿着策略梯度更新参数,很有可能步长太大,策略突然显著变差,进而影响训练效果。
针对以上问题,我们考虑在更新时找到一块信任区域,在这个区域更新策略时能够得到某种策略性能的安全保证,这就是信任区域策略优化算法的主要思想。
11.2 策略目标
假设当前策略为πθ\pi_\thetaπθ,参数为θ\thetaθ。我们考虑如何借助当前的θ\thetaθ找到一个更优的参数θ′\theta 'θ′,使得J(θ′)>=J(θ)J(\theta')>=J(\theta)J(θ′)>=J(θ)。具体来说,由于初始状态s0s_0s0的分布和策略无关,因此上述策略πθ\pi_\thetaπθ下的优化目标J(θ)J(\theta)J(θ)可以写成再新策略πθ′\pi_{\theta'}πθ′的期望形式:
J(θ)=Es0[Vπθ(s0)]
J\left( \theta \right) =\mathbb E_{s_0}\left[ V^{\pi _{\theta}}\left( s_0 \right) \right]
J(θ)=Es0[Vπθ(s0)]
=Eπθ′[Σ∞t=0γtVπθ(st)−Σ∞t=1γtVπθ(st)]
=\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^tV^{\pi _{\theta}}\left( s_t \right) -\underset{t=1}{\overset{\infty}{\varSigma}}\gamma ^tV^{\pi _{\theta}}\left( s_t \right) \right]
=Eπθ′[t=0Σ∞γtVπθ(st)−t=1Σ∞γtVπθ(st)]
=−Eπθ′[Σ∞t=0γt(γVπθ(st+1)−Vπθ(st))]
=-\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^t\left( \gamma V^{\pi _{\theta}}\left( s_{t+1} \right) -V^{\pi _{\theta}}\left( s_t \right) \right) \right]
=−Eπθ′[t=0Σ∞γt(γVπθ(st+1)−Vπθ(st))]
基于以上等式,我们可以推导新旧策略的目标函数之间的差距:
J(θ′)−J(θ)=Es0[Vπθ′(s0)]−E[Vπθ(s0)]
J\left( \theta ' \right) -J\left( \theta \right) =\mathbb E_{s_0}\left[ V^{\pi _{\theta '}}\left( s_0 \right) \right] -\mathbb E\left[ V^{\pi _{\theta}}\left( s_0 \right) \right]
J(θ′)−J(θ)=Es0[Vπθ′(s0)]−E[Vπθ(s0)]
=Eπθ′[Σ∞t=0γtr(st,at)]+Es0[Σ∞t=0γt(γVπθ(st+1)−Vπθ(st))]
=\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^tr\left( s_t,a_t \right) \right] +\mathbb E_{s_0}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^t\left( \gamma V^{\pi _{\theta}}\left( s_{t+1} \right) -V^{\pi _{\theta}}\left( s_t \right) \right) \right]
=Eπθ′[t=0Σ∞γtr(st,at)]+Es0[t=0Σ∞γt(γVπθ(st+1)−Vπθ(st))]
=Eπθ′[Σ∞t=0γt[r(st,at)+γVπθ(st+1)−Vπθ(st)]]
=\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^t\left[ r\left( s_t,a_t \right) +\gamma V^{\pi _{\theta}}\left( s_{t+1} \right) -V^{\pi _{\theta}}\left( s_t \right) \right] \right]
=Eπθ′[t=0Σ∞γt[r(st,at)+γVπθ(st+1)−Vπθ(st)]]
将时序差分残差定义为优势函数A:
=Eπθ′[Σ∞t=0γtAπθ(st,at)]
=\mathbb E_{\pi _{\theta '}}\left[ \underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^tA^{\pi _{\theta}}\left( s_t,a_t \right) \right]
=Eπθ′[t=0Σ∞γtAπθ(st,at)]
=Σ∞t=0γtEst Ptπθ′Eat πθ′(⋅∣s)[Aπθ(st,at)]
=\underset{t=0}{\overset{\infty}{\varSigma}}\gamma ^t\mathbb E_{s_t~P_{t}^{\pi _{\theta '}}}\mathbb E_{a_t~\pi _{\theta '}\left( \cdot |s \right)}\left[ A^{\pi _{\theta}}\left( s_t,a_t \right) \right]
=t=0Σ∞γtEst Ptπθ′Eat πθ′(⋅∣s)[Aπθ(st,at)]
=11−γEs νπθ′Ea πθ′(⋅∣s)[Aπθ(s,a)]
=\frac{1}{1-\gamma}\mathbb E_{s~\nu ^{\pi _{\theta '}}}\mathbb E_{a~\pi _{\theta '}\left( \cdot |s \right)}\left[ A^{\pi _{\theta}}\left( s,a \right) \right]
=1−γ1Es νπθ′Ea πθ′(⋅∣s)[Aπθ(s,a)]
最后一个等号的成立运用到了状态访问分布的定义:νπ(s)=(1−γ)Σt=0∞γtPtπ(s)\nu ^{\pi}\left( s \right) =\left( 1-\gamma \right) \varSigma _{t=0}^{\infty}\gamma ^tP_{t}^{\pi}\left( s \right) νπ(s)=(1−γ)Σt=0∞γtPtπ(s),所以只要我们能找到一个新的策略,使得Es νπθ′Ea πθ′(⋅∣s)[Aπθ(s,a)]>=0\mathbb E_{s~\nu ^{\pi _{\theta '}}}\mathbb E_{a~\pi _{\theta '}\left( \cdot |s \right)}\left[ A^{\pi _{\theta}}\left( s,a \right) \right]>=0Es νπθ′Ea πθ′(⋅∣s)[Aπθ(s,a)]>=0,就能保证策略性能单调递增,即J(θ′)≥J(θ)J(\theta')≥J(\theta)J(θ′)≥J(θ)。
但直接求解该式是非常困难的,因为πθ′\pi_{\theta'}πθ′是我们需要求解的策略,但我们又要用它来收集样本。把所有可能的新策略都拿来收集数据,然后判断哪个策略满足上述条件的做法显然是不现实的。于是TRPO做了一步近似操作,对状态访问分布进行了相应处理。具体而言,忽略两个策略之间的状态访问分布变化,直接采用旧的策略πθ\pi_\thetaπθ的状态分布,定义如下替代优化目标:
Lθ(θ′)=J(θ)+11−γEs νπθEa πθ′(⋅∣s)[Aπθ(s,a)]L_\theta(\theta')=J(\theta)+\frac{1}{1-\gamma}\mathbb E_{s~\nu ^{\pi _{\theta }}}\mathbb E_{a~\pi _{\theta' }\left( \cdot |s \right)}\left[ A^{\pi _{\theta}}\left( s,a \right) \right]Lθ(θ′)=J(θ)+1−γ1Es νπθEa πθ′(⋅∣s)[Aπθ(s,a)]
当新旧策略非常接近时,状态访问分布非常小,这么近似是合理的。其中,动作仍然用新策略πθ′\pi_\theta'πθ′采样得到,我们可以用重要性采样对动作分布进行处理:
Lθ(θ′)=J(θ)+Es νπθEa πθ′(⋅∣s)[πθ′(a∣s)πθ(a∣s)Aπθ(s,a)]L_\theta(\theta')=J(\theta)+\mathbb E_{s~\nu ^{\pi _{\theta }}}\mathbb E_{a~\pi _{\theta' }\left( \cdot |s \right)}\left[\frac{\pi_{\theta'}(a|s)}{\pi_\theta(a|s)} A^{\pi _{\theta}}\left( s,a \right) \right]Lθ(θ′)=J(θ)+Es νπθEa πθ′(⋅∣s)[πθ(a∣s)πθ′(a∣s)Aπθ(s,a)]
这样,我们就可以基于旧策略πθ\pi_\thetaπθ已经采样出的数据来估计并优化新策略πθ′\pi_\theta'πθ′了。为了保证新旧策略足够接近,TRPO使用了库尔贝克—莱布勒散度来衡量策略之间的距离,并给出了整体的优化公式:
maxθLθ(θ′)\underset{\theta}\max L_\theta(\theta')θmaxLθ(θ′)
s.t.Es νπθk[DKL(πθk(⋅∣s),πθ′(⋅∣s))]≤δs.t. \mathbb E_{s~\nu ^{\pi _{\theta _k}}}\left[ D_{KL}\left( \pi _{\theta _k}\left( \cdot |s \right) ,\pi _{\theta '}\left( \cdot |s \right) \right) \right] \le \delta s.t.Es νπθk[DKL(πθk(⋅∣s),πθ′(⋅∣s))]≤δ
这里不等式约束定义了策略空间中的一个KL球,被称为信任区域。在这个区域中,可以认为当前学习策略和环境交互的状态分布与上一轮策略最后采样的状态分布一致,进而可以基于一步行动的重要性采样方法使当前学习策略稳定提升。TRPO 背后的原理如图 11-1 所示。
左图表示当完全不设置信任区域时,策略的梯度更新可能导致策略的性能骤降;右图表示当设置了信任区域时,可以保证每次策略的梯度更新都能带来性能的提升。
11.3 近似求解
11.4 共轭梯度
一般来说,用神经网络表示的策略函数的参数数量都是成千上万的,计算和存储黑塞矩阵H的逆矩阵会耗费大量的内存资源和时间。TRPO通过共轭梯度回避了这个问题,它的核心思想是直接计算x=H−1gx=H^{-1}gx=H−1g,xxx即参数更新方向。假设满足KL距离约束的参数更新时的最大步长为β\betaβ,于是,根据KL距离约束条件,又12(βx)TH(βx)=δ\frac{1}{2}(\beta x)^TH(\beta x)=\delta21(βx)TH(βx)=δ。求解β\betaβ,得到β=2δxTHx\beta=\sqrt{\frac{2\delta}{x^THx}}β=xTHx2δ。因此,此时参数更新方式为
θk+1=θk+2δxTHxx\theta_{k+1}=\theta_{k}+\sqrt{\frac{2\delta}{x^THx}}xθk+1=θk+xTHx2δx
因此,只要可以直接计算x=H−1gx=H^{-1}gx=H−1g,就可以根据该式更新参数,问题转化为解Hx=gHx=gHx=g。实际上HHH为对称正定矩阵,所以我们可以使用共轭梯度法来求解。共轭梯度法的具体流程如下:
11.5 线性搜索
由于TRPO算法用到了泰勒展开的1阶和2阶近似,这并非精准求解,因此,θ′\theta'θ′可能未必比θk\theta_kθk好,或未必能满足KL的散度限制。TRPO在每次迭代的最后进行一次线性搜索,以确保能找到满足条件。具体来说,就是找到一个最小的非负整数iii,使得按照
θk+1=θk+αi2δxTHxx\theta_{k+1}=\theta_k+\alpha^i\sqrt{\frac{2\delta}{x^THx}}xθk+1=θk+αixTHx2δx
求出的θk\theta_kθk依然满足最初的KL散度限制,并且确实能够提升目标函数LθkL_{\theta_k}Lθk,这其中α∈(0,1)\alpha\in(0,1)α∈(0,1)是一个决定线性搜索长度的超参数。
至此,我们已经基本上清楚了TRPO算法的大致过程,他具体的算法流程如下:
- 初始化策略网络参数θ\thetaθ,价值网络参数ω\omegaω
- for序列e=1−−>Ee=1-->Ee=1−−>E do:
- 用当前策略πθ\pi_\thetaπθ采样轨迹{s1,a1,r1,s2,a2,r2,……}\left\{s_1,a_1,r_1,s_2,a_2,r_2,……\right\}{s1,a1,r1,s2,a2,r2,……}
- 根据收集到的数据和价值网络估计每个状态动作对的优势A(st,at)A(s_t,a_t)A(st,at)
- 计算策略目标函数的梯度ggg
- 用共轭梯度法计算x=H−1gx=H^{-1}gx=H−1g
- 用线性搜索找到一个iii值,并更新策略网络参数θk+1=θk+αi2δxTHxx\theta_{k+1}=\theta_k+\alpha^i\sqrt{\frac{2\delta}{x^THx}}xθk+1=θk+αixTHx2δx,其中i∈{1,2,3,……,K}i\in\left\{1,2,3,……,K\right\}i∈{1,2,3,……,K},为能提升策略并满足KL距离限制的最小整数
- 更新价值网络参数(与Actor-Critic中的更新方法相同)
- end for
11.6 广义优势估计
def compute_advantage(gamma, lmbda, td_delta):
td_delta = td_delta.detach().numpy()
advantage_list = []
advantage = 0.0
for delta in td_delta[::-1]:
advantage = gamma * lmbda * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(advantage_list, dtype=torch.float)
11.7 TRPO代码实践
本节将使用支持与离散和连续两种动作交互的环境来进行TRPO的实验。我们使用的第一个环境是车杆(CartPole),第二个环境是倒立摆(Inverted Pendulum)。
11.7.1 车杆环境(Cartpole)
import torch
import numpy as np
import gym
import matplotlib.pyplot as plt
import torch.nn.functional as F
import rl_utils
import copy
class PolicyNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim, action_dim):
super(PolicyNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, action_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
return F.softmax(self.fc2(x), dim=1)
class ValueNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim):
super(ValueNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
return self.fc2(x)
def compute_advantage(gamma, lmbda, td_delta):
# 优势函数
td_delta = td_delta.detach().numpy()
advantage_list = []
advantage = 0.0
for delta in td_delta[::-1]:
advantage = gamma * lmbda * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(advantage_list, dtype=torch.float)
class TRPO:
""" TRPO算法 """
def __init__(self, hidden_dim, state_space, action_space, lmbda, kl_constraint, alpha, critic_lr, gamma, device):
state_dim = state_space.shape[0]
action_dim = action_space.n
# 策略网络参数不需要优化器更新
self.actor = PolicyNet(state_dim, hidden_dim, action_dim).to(device)
self.critic = ValueNet(state_dim, hidden_dim).to(device)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)
self.gamma = gamma
self.lmbda = lmbda # GAE参数
self.kl_constraint = kl_constraint # KL距离最大限制
self.alpha = alpha # 线性搜索参数
self.device = device
def take_action(self, state): # 随机策略选择动作
state = torch.tensor([state], dtype=torch.float).to(self.device)
probs = self.actor(state)
action_dist = torch.distributions.Categorical(probs)
action = action_dist.sample()
return action.item()
def hessian_matrix_vector_product(self, states, old_action_dists, vector):
# 计算黑塞矩阵和一个向量的乘积
# print(old_action_dists)
new_action_dists = torch.distributions.Categorical(self.actor(states))
# print(new_action_dists)
kl = torch.mean(torch.distributions.kl.kl_divergence(old_action_dists, new_action_dists)) # 计算平均KL距离
# print(kl)
kl_grad = torch.autograd.grad(kl, self.actor.parameters(), create_graph=True)
kl_grad_vector = torch.cat([grad.view(-1) for grad in kl_grad])
# KL距离的梯度先和向量进行点积运算
kl_grad_vector_product = torch.dot(kl_grad_vector, vector)
grad2 = torch.autograd.grad(kl_grad_vector_product, self.actor.parameters())
grad2_vector = torch.cat([grad.view(-1) for grad in grad2])
# print(grad2_vector)
return grad2_vector
def conjugate_gradient(self, grad, states, old_action_dists): # 共轭梯度法求解方程
x = torch.zeros_like(grad)
r = grad.clone()
p = grad.clone()
rdotr = torch.dot(r, r) # 计算r和r的积
for i in range(10): # 共轭梯度主循环
Hp = self.hessian_matrix_vector_product(states, old_action_dists, p)
alpha = rdotr / torch.dot(p, Hp)
x += alpha * p
r -= alpha * Hp
new_rdotr = torch.dot(r, r)
if new_rdotr < 1e-10:
break
beta = new_rdotr / rdotr
p = r + beta * p
rdotr = new_rdotr
return x
def compute_surrogate_obj(self, states, actions, advantage, old_log_probs, actor): # 计算策略目标
log_probs = torch.log(actor(states).gather(1, actions))
ratio = torch.exp(log_probs - old_log_probs)
return torch.mean(ratio * advantage)
def line_search(self, states, actions, advantage, old_log_probs, old_action_dists, max_vec): # 线性搜索
old_para = torch.nn.utils.convert_parameters.parameters_to_vector(self.actor.parameters()) # 获取策略网络中的参数 并且将其转换为一维向量
old_obj = self.compute_surrogate_obj(states, actions, advantage, old_log_probs, self.actor)
for i in range(15): # 线性搜索主循环
coef = self.alpha**i # alpha的i次方
new_para = old_para + coef * max_vec
new_actor = copy.deepcopy(self.actor)
torch.nn.utils.convert_parameters.vector_to_parameters(new_para, new_actor.parameters())
new_action_dists = torch.distributions.Categorical(new_actor(states))
kl_div = torch.mean(torch.distributions.kl.kl_divergence(old_action_dists, new_action_dists))
new_obj = self.compute_surrogate_obj(states, actions, advantage, old_log_probs, new_actor)
if new_obj > old_obj and kl_div < self.kl_constraint:
return new_para
return old_para
def policy_learn(self, states, actions, old_action_dists, old_log_probs, advantage): # 更新策略函数
surrogate_obj = self.compute_surrogate_obj(states, actions, advantage, old_log_probs, self.actor)
grads = torch.autograd.grad(surrogate_obj, self.actor.parameters()) # 计算梯度
obj_grad = torch.cat([grad.view(-1) for grad in grads]).detach()
# 用共轭梯度法计算x = H^(-1)g
descent_direction = self.conjugate_gradient(obj_grad, states, old_action_dists)
Hd = self.hessian_matrix_vector_product(states, old_action_dists, descent_direction)
max_coef = torch.sqrt(2 * self.kl_constraint / (torch.dot(descent_direction, Hd) + 1e-8))
new_para = self.line_search(states, actions, advantage, old_log_probs, old_action_dists, descent_direction * max_coef) # 线性搜索
torch.nn.utils.convert_parameters.vector_to_parameters(new_para, self.actor.parameters()) # 用线性搜索后的参数更新策略
def update(self, transition_dict):
states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
actions = torch.tensor(transition_dict['actions']).view(-1, 1).to(self.device)
rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1, 1).to(self.device)
next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1, 1).to(self.device)
td_target = rewards + self.gamma * self.critic(next_states) * (1 - dones)
td_delta = td_target - self.critic(states)
advantage = compute_advantage(self.gamma, self.lmbda, td_delta.cpu()).to(self.device)
old_log_probs = torch.log(self.actor(states).gather(1, actions)).detach()
old_action_dists = torch.distributions.Categorical(self.actor(states).detach())
critic_loss = torch.mean(F.mse_loss(self.critic(states), td_target.detach()))
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step() # 更新价值函数
# 更新策略函数
self.policy_learn(states, actions, old_action_dists, old_log_probs, advantage)
num_episodes = 500
hidden_dim = 128
gamma = 0.98
lmbda = 0.95
critic_lr = 1e-2
kl_constraint = 0.0005
alpha = 0.5
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
env_name = 'CartPole-v0'
env = gym.make(env_name)
env.seed(0)
torch.manual_seed(0)
agent = TRPO(hidden_dim, env.observation_space, env.action_space, lmbda, kl_constraint, alpha, critic_lr, gamma, device)
return_list = rl_utils.train_on_policy_agent(env, agent, num_episodes)
episodes_list = list(range(len(return_list))) # 所有收益
plt.plot(episodes_list, return_list)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()
mv_return = rl_utils.moving_average(return_list, 9) # 平均值
plt.plot(episodes_list, mv_return)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()
本节代码和Actor-Critic算法最大的不同就是在策略参数更新方面,本节算法使用了一个信任区域,然后根据最优化方法找到信任区域中的最优策略
Iteration 0: 0%| | 0/50 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:38: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:201.)
Iteration 0: 100%|██████████| 50/50 [00:03<00:00, 15.71it/s, episode=50, return=139.200]
Iteration 1: 100%|██████████| 50/50 [00:03<00:00, 13.08it/s, episode=100, return=150.500]
Iteration 2: 100%|██████████| 50/50 [00:04<00:00, 11.57it/s, episode=150, return=184.000]
Iteration 3: 100%|██████████| 50/50 [00:06<00:00, 7.60it/s, episode=200, return=183.600]
Iteration 4: 100%|██████████| 50/50 [00:06<00:00, 7.17it/s, episode=250, return=183.500]
Iteration 5: 100%|██████████| 50/50 [00:04<00:00, 10.91it/s, episode=300, return=193.700]
Iteration 6: 100%|██████████| 50/50 [00:04<00:00, 10.70it/s, episode=350, return=199.500]
Iteration 7: 100%|██████████| 50/50 [00:04<00:00, 10.89it/s, episode=400, return=200.000]
Iteration 8: 100%|██████████| 50/50 [00:04<00:00, 10.80it/s, episode=450, return=200.000]
Iteration 9: 100%|██████████| 50/50 [00:04<00:00, 11.09it/s, episode=500, return=200.000]
11.7.2 倒立摆环境
倒立摆环境和车杆环境的区别是倒立摆环境的动作是连续的 而车杆环境是离散的
import torch
import numpy as np
import gym
import matplotlib.pyplot as plt
import torch.nn.functional as F
import rl_utils
import copy
class ValueNet(torch.nn.Module):
def __init__(self, state_dim, hidden_dim):
super(ValueNet, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc2 = torch.nn.Linear(hidden_dim, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
return self.fc2(x)
def compute_advantage(gamma, lmbda, td_delta):
# 优势函数
td_delta = td_delta.detach().numpy()
advantage_list = []
advantage = 0.0
for delta in td_delta[::-1]:
advantage = gamma * lmbda * advantage + delta
advantage_list.append(advantage)
advantage_list.reverse()
return torch.tensor(advantage_list, dtype=torch.float)
class PolicyNetContinuous(torch.nn.Module):
def __init__(self, state_dim, hidden_dim, action_dim):
super(PolicyNetContinuous, self).__init__()
self.fc1 = torch.nn.Linear(state_dim, hidden_dim)
self.fc_mu = torch.nn.Linear(hidden_dim, action_dim)
self.fc_std = torch.nn.Linear(hidden_dim, action_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
mu = 2.0 * torch.tanh(self.fc_mu(x))
std = F.softplus(self.fc_std(x))
return mu, std # 高斯分布的均值和标准差
class TRPOContinuous:
""" 处理连续动作的TRPO算法 """
def __init__(self, hidden_dim, state_space, action_space, lmbda,
kl_constraint, alpha, critic_lr, gamma, device):
state_dim = state_space.shape[0]
action_dim = action_space.shape[0]
self.actor = PolicyNetContinuous(state_dim, hidden_dim,
action_dim).to(device)
self.critic = ValueNet(state_dim, hidden_dim).to(device)
self.critic_optimizer = torch.optim.Adam(self.critic.parameters(),
lr=critic_lr)
self.gamma = gamma
self.lmbda = lmbda
self.kl_constraint = kl_constraint
self.alpha = alpha
self.device = device
def take_action(self, state):
state = torch.tensor([state], dtype=torch.float).to(self.device)
mu, std = self.actor(state)
action_dist = torch.distributions.Normal(mu, std) # 输出的直接是动作的值 将高斯分布的系数进行转换为相应的概率
action = action_dist.sample()
return [action.item()]
def hessian_matrix_vector_product(self, states, old_action_dists, vector, damping=0.1):
mu, std = self.actor(states)
new_action_dists = torch.distributions.Normal(mu, std)
kl = torch.mean(
torch.distributions.kl.kl_divergence(old_action_dists,
new_action_dists))
kl_grad = torch.autograd.grad(kl,
self.actor.parameters(),
create_graph=True)
kl_grad_vector = torch.cat([grad.view(-1) for grad in kl_grad])
kl_grad_vector_product = torch.dot(kl_grad_vector, vector)
grad2 = torch.autograd.grad(kl_grad_vector_product,
self.actor.parameters())
# print(grad2)
grad2_vector = torch.cat(
[grad.contiguous().view(-1) for grad in grad2]) # 将grad2中的所有数据向量进行拼接
print(grad2_vector)
return grad2_vector + damping * vector
def conjugate_gradient(self, grad, states, old_action_dists):
x = torch.zeros_like(grad)
r = grad.clone()
p = grad.clone()
rdotr = torch.dot(r, r)
for i in range(10):
Hp = self.hessian_matrix_vector_product(states, old_action_dists,
p)
alpha = rdotr / torch.dot(p, Hp)
x += alpha * p
r -= alpha * Hp
new_rdotr = torch.dot(r, r)
if new_rdotr < 1e-10:
break
beta = new_rdotr / rdotr
p = r + beta * p
rdotr = new_rdotr
return x
def compute_surrogate_obj(self, states, actions, advantage, old_log_probs,
actor):
mu, std = actor(states)
action_dists = torch.distributions.Normal(mu, std)
log_probs = action_dists.log_prob(actions)
ratio = torch.exp(log_probs - old_log_probs)
return torch.mean(ratio * advantage)
def line_search(self, states, actions, advantage, old_log_probs,
old_action_dists, max_vec):
old_para = torch.nn.utils.convert_parameters.parameters_to_vector(
self.actor.parameters())
old_obj = self.compute_surrogate_obj(states, actions, advantage,
old_log_probs, self.actor)
for i in range(15):
coef = self.alpha**i
new_para = old_para + coef * max_vec
new_actor = copy.deepcopy(self.actor)
torch.nn.utils.convert_parameters.vector_to_parameters(
new_para, new_actor.parameters())
mu, std = new_actor(states)
new_action_dists = torch.distributions.Normal(mu, std)
kl_div = torch.mean(
torch.distributions.kl.kl_divergence(old_action_dists,
new_action_dists))
new_obj = self.compute_surrogate_obj(states, actions, advantage,
old_log_probs, new_actor)
if new_obj > old_obj and kl_div < self.kl_constraint:
return new_para
return old_para
def policy_learn(self, states, actions, old_action_dists, old_log_probs,
advantage):
surrogate_obj = self.compute_surrogate_obj(states, actions, advantage,
old_log_probs, self.actor)
grads = torch.autograd.grad(surrogate_obj, self.actor.parameters())
obj_grad = torch.cat([grad.view(-1) for grad in grads]).detach()
descent_direction = self.conjugate_gradient(obj_grad, states,
old_action_dists)
Hd = self.hessian_matrix_vector_product(states, old_action_dists,
descent_direction)
max_coef = torch.sqrt(2 * self.kl_constraint /
(torch.dot(descent_direction, Hd) + 1e-8))
new_para = self.line_search(states, actions, advantage, old_log_probs,
old_action_dists,
descent_direction * max_coef)
torch.nn.utils.convert_parameters.vector_to_parameters(
new_para, self.actor.parameters())
def update(self, transition_dict):
states = torch.tensor(transition_dict['states'],
dtype=torch.float).to(self.device)
actions = torch.tensor(transition_dict['actions'],
dtype=torch.float).view(-1, 1).to(self.device)
rewards = torch.tensor(transition_dict['rewards'],
dtype=torch.float).view(-1, 1).to(self.device)
next_states = torch.tensor(transition_dict['next_states'],
dtype=torch.float).to(self.device)
dones = torch.tensor(transition_dict['dones'],
dtype=torch.float).view(-1, 1).to(self.device)
rewards = (rewards + 8.0) / 8.0 # 对奖励进行修改,方便训练
td_target = rewards + self.gamma * self.critic(next_states) * (1 -
dones)
td_delta = td_target - self.critic(states)
advantage = compute_advantage(self.gamma, self.lmbda,
td_delta.cpu()).to(self.device)
mu, std = self.actor(states)
old_action_dists = torch.distributions.Normal(mu.detach(),
std.detach())
old_log_probs = old_action_dists.log_prob(actions)
critic_loss = torch.mean(
F.mse_loss(self.critic(states), td_target.detach()))
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
self.policy_learn(states, actions, old_action_dists, old_log_probs,
advantage)
num_episodes = 2000
hidden_dim = 128
gamma = 0.9
lmbda = 0.9
critic_lr = 1e-2
kl_constraint = 0.00005
alpha = 0.5
device = torch.device("cuda") if torch.cuda.is_available() else torch.device(
"cpu")
env_name = 'Pendulum-v0'
env = gym.make(env_name)
env.seed(0)
torch.manual_seed(0)
agent = TRPOContinuous(hidden_dim, env.observation_space, env.action_space,
lmbda, kl_constraint, alpha, critic_lr, gamma, device)
return_list = rl_utils.train_on_policy_agent(env, agent, num_episodes)
episodes_list = list(range(len(return_list)))
plt.plot(episodes_list, return_list)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()
mv_return = rl_utils.moving_average(return_list, 9)
plt.plot(episodes_list, mv_return)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('TRPO on {}'.format(env_name))
plt.show()
ction_space,
lmbda, kl_constraint, alpha, critic_lr, gamma, device)
return_list = rl_utils.train_on_policy_agent(env, agent, num_episodes)
episodes_list = list(range(len(return_list)))
plt.plot(episodes_list, return_list)
plt.xlabel(‘Episodes’)
plt.ylabel(‘Returns’)
plt.title(‘TRPO on {}’.format(env_name))
plt.show()
mv_return = rl_utils.moving_average(return_list, 9)
plt.plot(episodes_list, mv_return)
plt.xlabel(‘Episodes’)
plt.ylabel(‘Returns’)
plt.title(‘TRPO on {}’.format(env_name))
plt.show()
## 11.8 总结
本章讲解了TRPO算法,并分别在离散动作和连续动作交互的环境进行了实验。TRPO算法属于在线策略学习方法,每次策略训练仅使用上一轮策略采样的数据,是基于策略的深度强化学习算法中十分有代表性的工作之一。直觉地理解,TRPO给出的观点是:由于策略的改变导致数据分布的改变,这大大影响深度模型实现的策略网络的学习效果,所以通过划定一个可信任的策略学习区域,保证策略学习的稳定性和有效性。