Introduction
Linear regression is perhaps the most fundamental algorithm in machine learning. In this setting, given a dataset D = { ( x i , y i ) ∣ x i ∈ R n , y i ∈ R } i = 1 m D=\{(x^i,y^i)|x^i\in \mathbb{R}^n, y^i\in\mathbb{R} \}_{i=1}^m D={(xi,yi)∣xi∈Rn,yi∈R}i=1m (x is feature, y is label) we fit a model of the form h θ ( x ) = θ T ϕ ( x ) h_\theta(x) = \theta^T\phi(x) hθ(x)=θTϕ(x), where θ \theta θ is the parameter vector, ϕ ( x ) \phi(x) ϕ(x) is a transformed vector (for example, ϕ ( x ) = [ 1 , x 1 , x 2 , . . . , x 1 x 2 , . . . , x n x n − 1 ] \phi(x) = [1,x_1,x_2,...,x_1x_2,...,x_nx_{n-1}] ϕ(x)=[1,x1,x2,...,x1x2,...,xnxn−1]). That is, the model is linear IN TERMS OF parameters instead of input vector x x x, as feature transformation is allowed.
Our goal is to fit the model h θ ( x ) = θ T ϕ ( x ) h_\theta(x) = \theta^T\phi(x) hθ(x)=θTϕ(x) as good as possible. That is, after tuning our parameters, given an unseen x ∗ x^* x∗, we should be able to make h θ ( x ∗ ) → y ∗ h_\theta(x^*)\to y^* hθ(x∗)→y∗. In a nutshell, find the BEST θ \theta θ.
Sometimes, our model might fit the training dataset well, yet failed to generalize to unseen data. This introduces the problem of OVERFITTING. To address this, we could use robust linear regression, ridge regression, lasso regression.
In what follows, I will derive the various linear regression (standard, robust, ridge, lasso) from 2 perspectives (deterministic and probabilistic). Also, generalized linear regression will be discussed.
Deterministic perspective
Intuitively, we could let our cost function to be J ( θ ) = 1 2 ∑ i m ( h θ ( x i ) − y i ) 2 J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2 J(θ)=21∑im(hθ(xi)−yi)2, another name for it is residual sum of squares (RSS) or sum of squared errors (SSE). Clearly, J is a convex function.
Then, the (standard) linear regression is formulated as θ ∗ : = arg min θ J ( θ ) \theta^* := \arg \min_\theta J(\theta) θ∗:=argminθJ(θ) [How to solve it? 1. gradient descent algorithm; 2. Analytically set ∂ J / ∂ θ = 0 \partial J/\partial \theta=0 ∂J/∂θ=0. We have a particular nice solution if x ˉ = [ 1 , x ] , h θ ( x ) = θ T x ˉ ⇒ ∂ J / ∂ θ = ∑ i m ( x ˉ i T − y i ) x ˉ i = X T X θ − X T y = 0 ⇒ θ ∗ = ( X T X ) − 1 X T y \bar{x} = [1,x], h_\theta(x)=\theta^T\bar{x} \Rightarrow \partial J/\partial \theta = \sum_i^m(\bar{x}_i^T-y_i)\bar{x}_i=X^TX\theta - X^Ty=0\Rightarrow \theta^* = (X^TX)^{-1}X^Ty xˉ=[1,x],hθ(x)=θTxˉ⇒∂J/∂θ=∑im(xˉiT−yi)xˉi=XTXθ−XTy=0⇒θ∗=(XTX)−1XTy]
One drawback of standard linear regression is that it is SENSITIVE to outliers. If we modify the cost function using L1 norm, i.e. J ( θ ) = J ( θ ) = 1 2 ∑ i m ∣ h θ ( x i ) − y i ∣ J(\theta) = J(\theta)=\frac{1}{2}\sum_i^m |h_\theta(x^i)-y^i| J(θ)=J(θ)=21∑im∣hθ(xi)−yi∣, we kind of reduced the influence of outliers, in this case, we get robust linear regression.
Another cases might be that the weights of parameter is too large, resulting overfitting. One example would be fitting a high order polynomial model ∑ i = 0 100 a i x i \sum_{i=0}^{100}a_ix^{i} ∑i=0100aixi, it might fit the training data perfectly, yet the curve is “wiggly”. If we control the parameters, the curve could become smoother and enhance generalization capability. How to control the parameters? If we modify the cost function to be J ( θ ) = 1 2 ∑ i m ( h θ ( x i ) − y i ) 2 + λ ∣ ∣ θ ∣ ∣ 2 2 J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2 + \lambda ||\theta||_2^2 J(θ)=21∑im(hθ(xi)−yi)2+λ∣∣θ∣∣22, we get ridge regression. If we modify the cost function to be J ( θ ) = 1 2 ∑ i m ( h θ ( x i ) − y i ) 2 + λ ∣ ∣ θ ∣ ∣ 1 J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2 + \lambda ||\theta||_1 J(θ)=21∑im(hθ(xi)−yi)2+λ∣∣θ∣∣1, we get lasso regression.
[Special Remark: the biased term should not be penalized.]
Probabilisitc Perspective
Alternatively, we could derive them from a probabilistic perspective. To be more specific, we want to find θ \theta θ that maximizes the posterior (This is called MAP approach. Its difference with MLE approach is that it EXPLICITLY assumed a prior distribution for θ \theta θ). θ ∗ = arg max θ p ( θ ∣ x ) = arg max θ p ( x ∣ θ ) p ( θ ) / p ( x ) = arg max θ p ( x ∣ θ ) p ( θ ) \theta^* = \arg \max_\theta p(\theta|x) = \arg \max_\theta p(x|\theta)p(\theta) /p(x) = \arg \max_\theta p(x|\theta)p(\theta) θ∗=argmaxθp(θ∣x)=argmaxθp(x∣θ)p(θ)/p(x)=argmaxθp(x∣θ)p(θ). In fact, we want to find θ \theta θ such that L ( θ ∣ D ) p ( θ ) L(\theta|D)p(\theta) L(θ∣D)p(θ) is maximized, where L is the likelihood.
The essence is what we would like to assume for the distributions of the conditional probability p ( y ∣ x , θ ) p(y|x,\theta) p(y∣x,θ) and prior p ( θ ) p(\theta) p(θ).
(1) For example, if we assume the relationship between x and y is y = θ T h θ ( x ) + e , e ∼ N ( 0 , σ 2 ) y=\theta^Th_\theta(x)+e, e \sim N(0,\sigma^2) y=θThθ(x)+e,e∼N(0,σ2), where e is called observation noise or residual error, e is independent with x. Therefore, p ( y ∣ x , θ ) = N ( θ T h θ ( x ) , σ 2 ) = 1 2 π σ exp ( − ( y − θ T x ) 2 2 σ 2 ) p(y|x,\theta) =N(\theta^Th_\theta(x), \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma} \exp (-\frac{(y-\theta^T x)^2}{2\sigma^2}) p(y∣x,θ)=N(θThθ(x),σ2)=2πσ1exp(−2σ2(y−θTx)2). If we assume the parameters are uniformly distributed, i.e. equally likely in a range, say [-M, M]. Our problem becomes θ ∗ = arg max θ log L ( θ ∣ D ) ( some constant ) = arg max θ log L ( θ ∣ D ) \theta^* = \arg \max_\theta \log L(\theta|D)(\text{some constant}) = \arg \max_\theta \log L(\theta|D) θ∗=argmaxθlogL(θ∣D)(some constant)=argmaxθlogL(θ∣D)
As log L ( θ ∣ D ) = ∑ i m log p ( y i ∣ x , θ ) = ∑ i m log N ( θ T x , σ 2 ) = − log ( σ 2 ( 2 π ) m / 2 ) − 1 2 σ 2 ∑ i m ( y i − θ T x i ) 2 \log L(\theta|D)=\sum_i^m \log p(y_i|x,\theta)=\sum_i^m \log N(\theta^Tx,\sigma^2) = -\log(\sigma^2 (2\pi)^{m/2}) - \frac{1}{2\sigma^2} \sum_i^m (y_i-\theta^T x_i)^2 logL(θ∣D)=∑imlogp(yi∣x,θ)=∑imlogN(θTx,σ2)=−log(σ2(2π)m/2)−2σ21∑im(yi−θTxi)2, that is, θ ∗ = arg min θ 1 2 ∑ i m ( y i − θ T x i ) 2 \theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m (y_i - \theta^T x_i)^2 θ∗=argminθ21∑im(yi−θTxi)2 (Oops! This is just what we derived for the standard linear regression from a cost function / deterministic perspective!)
(2) If we assume the noise e follows a laplace distribution, i.e. p ( y ∣ x , θ ) = L a p ( θ T x , σ 2 ) = 1 b exp ( − ∣ y − θ T x ∣ b ) p(y|x,\theta) = Lap(\theta^T x, \sigma^2) = \frac{1}{b} \exp(-\frac{|y-\theta^T x|}{b}) p(y∣x,θ)=Lap(θTx,σ2)=b1exp(−b∣y−θTx∣), the parameters distributed uniformly, then after some derivation we see θ ∗ = arg min θ 1 2 ∑ i m ∣ y i − θ T x i ∣ \theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m |y_i - \theta^T x_i| θ∗=argminθ21∑im∣yi−θTxi∣ (Oops Again! This is just what we derived for the robust linear regression by setting L1 norm in cost function)
(3) If on the basis of (1) we assume the parameters are of Gaussian distribution ( 0 , k 2 ) (0, k^2) (0,k2), i.e. parameters being small stands a higher chance. Then we get something like θ ∗ = arg min θ 1 2 ∑ i m ( y i − θ T x i ) 2 + 1 2 k 2 ∣ ∣ θ ∣ ∣ 2 2 \theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m (y_i - \theta^T x_i)^2 + \frac{1}{2k^2}||\theta||_2^2 θ∗=argminθ21∑im(yi−θTxi)2+2k21∣∣θ∣∣22, since k is kindly of arbitrary, setting 1 2 k 2 = λ \frac{1}{2k^2} = \lambda 2k21=λ we get our ridge regression. (So, if we let λ \lambda λ large, we implicitly assumed the Gaussian distribution if “higher and narrower”)
(4) If on the basis of (1) we assume the parameters are of Laplace distribution ( 0 , b ) (0,b) (0,b), then we get something like θ ∗ = arg min θ 1 2 ∑ i m ( y i − θ T x i ) 2 + 1 b ∣ ∣ θ ∣ ∣ 1 \theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m (y_i - \theta^T x_i)^2 + \frac{1}{b}||\theta||_1 θ∗=argminθ21∑im(yi−θTxi)2+b1∣∣θ∣∣1. Again by setting 1 / b = λ 1/b = \lambda 1/b=λ, we get lasso regression.
In summary, in deterministic perspective we design a suitable cost function; in probabilistic perspective we use MAP approach and assume different distribution for p ( y ∣ x , θ ) p(y|x,\theta) p(y∣x,θ) and the prior.
Generalized Linear Regression
Till now, you might think the linear regression is pretty basic. True. But it is a very fundamental thing. If we generalize it a bit, we could get many interesting variations.
In symbolic language, the generalized linear regression is:
μ
(
x
∣
θ
)
=
g
−
1
(
θ
T
ϕ
(
x
)
)
y
(
x
∣
θ
)
∼
f
(
μ
(
x
∣
θ
)
)
\mu(x|\theta)=g^{-1}(\theta^T \phi(x)) \\ y(x|\theta) \sim f(\mu(x|\theta))
μ(x∣θ)=g−1(θTϕ(x))y(x∣θ)∼f(μ(x∣θ))
Clearly, by letting
g
(
x
)
=
x
,
f
μ
(
y
)
=
N
(
y
∣
μ
,
σ
2
)
g(x)=x, f_\mu(y) = N(y|\mu, \sigma^2)
g(x)=x,fμ(y)=N(y∣μ,σ2) we arrive at the standard linear regression. So linear regression is really just a special case of generalized linear regression.
Further, if μ = g − 1 ( θ T x ) , where g − 1 ( x ) = 1 1 + e − x \mu = g^{-1}(\theta^T x), \text{where} g^{-1}(x) = \frac{1}{1+e^{-x}} μ=g−1(θTx),whereg−1(x)=1+e−x1 and f μ ( y ) = p ( y = 1 ) = B e r n o u l l i ( μ ) f_\mu(y)=p(y=1)=Bernoulli(\mu) fμ(y)=p(y=1)=Bernoulli(μ)
Reference
https://machinelearningmastery.com/maximum-a-posteriori-estimation/
https://www.cnblogs.com/easoncheng/archive/2012/11/08/2760675.html
https://www.datasciencecentral.com/profiles/blogs/explaining-logistic-regression-as-generalized-linear-model-in-use