Optimization for Deep Learning Background Knowledge New Optimizers for Deep Learning SGD SGDM Adagrad RMSProp Adam Optimizers:Real Application Does Adam need warm-up? 一种warm-up的方法 RAdam vs SWATS One-step back More than momentum NAG Nadam Do you really know your optimizer? Something helps optimization Advices Background Knowledge μ − s t r o n g c o n v e x i t y \mu-strong \ convexity μ−strong convexity L i p s c h i t z c o n t i n u i t y Lipschitz \ continuity Lipschitz continuit</