[RL 10] Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)

最新推荐文章于 2024-01-18 16:08:13 发布

xyp99

最新推荐文章于 2024-01-18 16:08:13 发布

阅读量637

点赞数

CC 4.0 BY-SA版权

分类专栏： DRL 算法

本文链接：https://blog.youkuaiyun.com/xyp99/article/details/109381081

16 篇文章

订阅专栏

本文探讨了深度强化学习中PPO和TRPO两种算法的代码层面优化对行为和性能的影响，特别是奖励归一化、学习率调整和网络初始化。研究发现这些优化措施对TRPO的精确KL信任区域约束有显著效果，而PPO在保持KL控制上有挑战。代码优化对PPO性能提升尤为关键，但PPO的梯度剪辑并未起作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO (ICLR, 2020)

references for brittle DRL
motivation
- how do the multitude of mechanisms used in deep RL training algorithms impact agent behavior thus the performance?

Code-level optimizations (a number of nine)
Results on the first four
- reward normalization, Adam annealing, and network initialization each significantly impact
PS
- learning rate is needed to be tuned before eperiences start
- these optimizations can be implemented for any policy gradient method

Algorithm core
Enforcing a trust region is a core algorithmic property of different policy gradient methods.
Trust Region in TRPO and PPO
1. TRPO
  constrains the KL divergence between successive policies
2. PPO
  enforce a trust region with a different objective
  1. the trust region enforced is heavily dependent on the method with which the clipped PPO objective is ptimized, rather than on the objective itself.
  2. If non-zero objective is selected then there is no trust region constrain, thus the size of the step we take is determined solely by the steepness of the surrogate landscape
  3. Finally, policy can end up moving arbitrarily far from the trust region
Results
1. TRPO
  - precisely enforces this KL trust region
2. PPO
  1. both PPO and PPO-M fail to maintain a ratio-based trust region
  2. both PPO and PPO-M constraint the KL well
  3. KL trust region enforced differs between PPO and PPO-M
    - while PPO-M KL trends up as the number of iterations increases, PPO KL peaks halfway through training before trending down again.