Lecture 2_Extra Optimization for Deep Learning
文章目录
- New Optimizers for Deep Learning
- What you have known before?
- Some Notations
- What is Optimization about?
- On-line vs Off-line
- Optimizers: Real Application
- Adam vs SGDM
- Towards Improving Adam
- Towards Improving SGDM
- Does Adam need warm-up?
- k k k step forward, 1 1 1 step back
- More than momentum
- Adam in the future
- Do you really know your optimizer?
- Something helps optimization
- Summary
New Optimizers for Deep Learning
What you have known before?

- SGD

- SGD with momentum (SGDM)

- Adagrad

- RMSProp

- Adam

Some Notations

What is Optimization about?

On-line vs Off-line


Optimizers: Real Application



Adam vs SGDM





Towards Improving Adam
Simply combine Adam with SGDM?

Trouble shooting
怎么样让 Adam 收敛得又快又好?


AMSGrad [Reddi, et al., ICLR’18]


AMSGrad only handles large learning rates
AdaBound [Luo, et al., ICLR’19]

Towards Improving SGDM

LR range test [Smith, WACV’17]

Cyclical LR [Smith, WACV’17]

SGDR [Loshchilov, et al., ICLR’17]

One-cycle LR [Smith, et al., arXiv’17]

Does Adam need warm-up?


RAdam [Liu, et al., ICLR’20]

RAdam vs SWATS

k k k step forward, 1 1 1 step back
Lookahead [Zhang, et al., arXiv’19]


More than momentum
Nesterov accelerated gradient (NAG) [Nesterov, jour Dokl. Akad. Nauk SSSR’83]


Adam in the future

Do you really know your optimizer?
A story of L 2 L_2 L2 regularization

AdamW & SGDW with momentum [Loshchilov, arXiv’17]

Something helps optimization



Summary


Advices
