一、PPO优化
PPO的简介和实践可以看笔者之前的文章 强化学习_06_pytorch-PPO实践(Pendulum-v1)
针对之前的PPO做了主要以下优化:
- | 笔者-PPO | 笔者-PPO2 | ref |
---|---|---|---|
data collect | one episode | several episode(one batch) | |
activation | ReLU | Tanh | |
adv-compute | - | compute adv as one serires | |
adv-normalize | mini-batch normalize | servel envs-batch normalize | 影响PPO算法性能的10个关键技巧 |
Value Function Loss Clipping | - | L V = m a x [ ( V θ t − V t a r ) 2 , ( c l i p ( V θ t , V θ t − 1 − ϵ , V θ t − 1 + ϵ ) ) 2 ] L^{V}=max[(V_{\theta_t} - V_{tar})^2, (clip(V_{\theta_t}, V_{\theta_{t-1}}-\epsilon, V_{\theta_{t-1}}+\epsilon))^2] LV=max[(Vθt−Vtar)2,(clip(Vθt,Vθt−1−ϵ,Vθt−1+ϵ))2] | The 37 Implementation Details of Proximal Policy Optimization |
optimizer | actor-opt & critic-opt | use common opt | |
loss | actor-loss-backward & critic-loss-backward | loss weight sum | |
paramate-init | - | 1- hidden layer orthogonal initialization of weights 2 \sqrt{2} 2; 2- The policy output layer weights are initialized with the scale of 0.01; 3- The value output layer weights are initialized with the scale of 1.0 | The 37 Implementation Details of Proximal Policy Optimization |
training envs | single gym env | SyncVectorEnv |
相比于PPO2_old.py
这次实现了上述的全部优化,
1.1 PPO2 代码
详细可见 Github: PPO2.py
class PPO:
"""
PPO算法, 采用截断方式
"""
def __init__(self,
state_dim: int,
actor_hidden_layers_dim: typ.List,
critic_hidden_layers_dim: typ.List,
action_dim: int,
actor_lr: float,
critic_lr: float,
gamma: float,
PPO_kwargs: typ.Dict,
device: torch.device,
reward_func: typ.Optional[typ.Callable]=None
):
dist_type = PPO_kwargs.get('dist_type', 'beta')
self.dist_type = dist_type
self.actor = policyNet(state_dim, actor_hidden_layers_dim, action_dim, dist_type=dist_type).to(device)
self.critic = valueNet(state_dim, critic_hidden_layers_dim).to(device)
self.actor_lr = actor_lr
self.critic_lr = critic_lr
self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)
self.critic_opt = torch.optim.Adam<