文章目录
LASSO by Proximal Gradient Descent
Prepare:
准备:
from itertools import cycle
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path, enet_path
from sklearn import datasets
from copy import deepcopy
X = np.random.randn(100,10)
y = np.dot(X,[1,2,3,4,5,6,7,8,9,10])
Proximal Gradient Descent Framework近端梯度下降算法框架
- randomly set β ( 0 ) \beta^{(0)} β(0) for iteration 0
- For k k kth iteration:
----Compute gradient ∇ f ( β ( k − 1 ) ) \nabla f(\beta^{(k-1)}) ∇f(β(k−1))
----Set z = β ( k − 1 ) − 1 L ∇ f ( β ( k − 1 ) ) z = \beta^{(k-1)} - \frac{1}{L} \nabla f(\beta^{(k-1)}) z=β(k−1)−L1∇f(β(k−1))
----Update β ( k ) = sgn ( z ) ⋅ max [ ∣ z ∣ − λ L , 0 ] \beta^{(k)} = \text{sgn}(z)\cdot \text{max}[|z|-\frac{\lambda}{L}, \; 0] β(k)=sgn(z)⋅max[∣z∣−Lλ,0]
----Check convergence: if yes, end algorithm; else continue update
Endfor
Here f ( β ) = 1 2 N ( Y − X β ) T ( Y − X β ) f(\beta) = \frac{1}{2N}(Y-X\beta)^T (Y-X\beta) f(β)=2N1(Y−Xβ)T(Y−Xβ), and ∇ f ( β ) = − 1 N X T ( Y − X β ) \nabla f(\beta) = -\frac{1}{N}X^T(Y-X\beta) ∇f(β)=−N1XT(Y−Xβ),
where the size of X X X, Y Y Y, β \beta β is N × p N\times p N×p, N × 1 N\times 1 N×1, p × 1 p\times 1 p×1, which means N N N samples样本 and p p p features特征. Parameter L ≥ 1 L \ge 1 L≥1 can be chosen, and 1 L \frac{1}{L} L1 can be considered as step size步长.
Proximal Gradient Descent Details近端梯度下降细节推导
Consider optimization problem:
现考虑优化问题:
min x f ( x ) + λ ⋅ g ( x ) , \text{min}_x {f(x) + \lambda \cdot g(x)}, minxf(x)+λ⋅g(x),
where x ∈ R p × 1 x\in \mathbb{R}^{p\times 1} x∈Rp×1, f ( x ) ∈ R f(x) \in \mathbb{R} f(x)∈R. And f ( x ) f(x) f(x) is differentiable convex function, and g ( x ) g(x) g(x) is convex but may not differentiable.
f ( x ) f(x) f(x)是可微凸函数, g ( x ) g(x) g(x)是凸函数但不一定可微。
For f ( x ) f(x) f(x), according to Lipschitz continuity, for ∀ x , y \forall x, y ∀x,y, there always exists a constant L L L s.t.
对于 f ( x ) f(x) f(x),根据利普希茨连续性,对于任意 x , y x,y x,y,一定存在常数 L L L使得满足
∣ f ′ ( y ) − f ′ ( x ) ∣ ≤ L ∣ y − x ∣ . |f'(y) - f'(x)| \le L|y-x|. ∣f′(y)−f′(x)∣≤L∣y−x∣.
Then this problem can be solved using Proximal Gradient Descent.
可以用近似梯度下降来解决这种问题。
Denote x ( k ) x^{(k)} x(k) as the k k kth updated result for x x x, then for x → x ( k ) x\to x^{(k)} x→