Reading notes of the paper "Distributed Optimization and Statistical Learning via ADMM" by Boyd, Parikh, Chu, Peleato and Eckstein.
Introduction
- ADMM : developped in the 70s with roots in the 50s. Proved to be highly related to other methods like Douglas-Rachford splitting, Spingarn's method of partial inverse, Proximal methods, etc
- Why ADMM today: with the arriving of the big data era and the need of ML algorithms, ADMM is proved to be well suited to solve large scale optimization problems, distributionally.
- What big data brings to us: with big data, simple methods can be shown as very effective to solve complex pb
- ADMM can be seen as a blend of Dual Decomposition and Augmented Lagrangian Methods. The latter is more robust and has a better convergence but cannot be decompose directly as in DD.
- ADMM can decompose by example or by features. [To be explored in later chapters]
- Note that even used in serial mode, ADMM is still comparable to others methods and often converge in tens of iterations.
Precursors
- What is conjugate function exactly?
- Dual ascent and Dual subgradient methods. If the stepsize is chosen appropriately and some other assumptions hold. They converge.
- Why augemented lagrangian:
- More robust, less assumption(strict convexity, finiteness of f) : in pratice some convergence assumptions are not met for dual ascent, the constraint may be affine (e.x. Min x s.t. x>10) and the dual pb become unbounded.
- For equality constraints, augmented version has a faster convergence. This can be viewed from the penalty method's point of view.
- Dual Decomposition: relax the connecting contraints so that the pb can be decomposed. This naturally invovles parallel computation.
- The pho in Augmented Lag is actually the ste