强化学习_06_pytorch-PPO2实践(Humanoid-v4)

一、PPO优化

PPO的简介和实践可以看笔者之前的文章 强化学习_06_pytorch-PPO实践(Pendulum-v1)
针对之前的PPO做了主要以下优化:

- 笔者-PPO 笔者-PPO2 ref
data collect one episode several episode(one batch)
activation ReLU Tanh
adv-compute - compute adv as one serires
adv-normalize mini-batch normalize servel envs-batch normalize 影响PPO算法性能的10个关键技巧
Value Function Loss Clipping - L V = m a x [ ( V θ t − V t a r ) 2 , ( c l i p ( V θ t , V θ t − 1 − ϵ , V θ t − 1 + ϵ ) ) 2 ] L^{V}=max[(V_{\theta_t} - V_{tar})^2, (clip(V_{\theta_t}, V_{\theta_{t-1}}-\epsilon, V_{\theta_{t-1}}+\epsilon))^2] LV=max[(VθtVtar)2,(clip(Vθt,Vθt1ϵ,Vθt1+ϵ))2] The 37 Implementation Details of Proximal Policy Optimization
optimizer actor-opt & critic-opt use common opt
loss actor-loss-backward & critic-loss-backward loss weight sum
paramate-init - 1- hidden layer orthogonal initialization of weights 2 \sqrt{2} 2 ; 2- The policy output layer weights are initialized with the scale of 0.01; 3- The value output layer weights are initialized with the scale of 1.0 The 37 Implementation Details of Proximal Policy Optimization
training envs single gym env SyncVectorEnv

相比于PPO2_old.py 这次实现了上述的全部优化,

1.1 PPO2 代码

详细可见 Github: PPO2.py


class PPO:
    """
    PPO算法, 采用截断方式
    """
    def __init__(self,
                state_dim: int,
                actor_hidden_layers_dim: typ.List,
                critic_hidden_layers_dim: typ.List,
                action_dim: int,
                actor_lr: float,
                critic_lr: float,
                gamma: float,
                PPO_kwargs: typ.Dict,
                device: torch.device,
                reward_func: typ.Optional[typ.Callable]=None
                ):
        dist_type = PPO_kwargs.get('dist_type', 'beta')
        self.dist_type = dist_type
        self.actor = policyNet(state_dim, actor_hidden_layers_dim, action_dim, dist_type=dist_type).to(device)
        self.critic = valueNet(state_dim, critic_hidden_layers_dim).to(device)
        self.actor_lr = actor_lr
        self.critic_lr = critic_lr
        self.actor_opt = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)
        self.critic_opt = torch.optim.Adam<
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Scc_hy

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值