ML(1) Linear Regression

线性回归是机器学习中最基础的算法之一。本文从确定性和概率性两个角度探讨标准、鲁棒、岭回归和拉索回归,并介绍了广义线性回归。通过调整损失函数,可以解决过拟合问题,例如使用L1范数降低异常值影响,使用L2范数或L1范数控制参数权重,以避免过拟合。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Introduction

Linear regression is perhaps the most fundamental algorithm in machine learning. In this setting, given a dataset D = { ( x i , y i ) ∣ x i ∈ R n , y i ∈ R } i = 1 m D=\{(x^i,y^i)|x^i\in \mathbb{R}^n, y^i\in\mathbb{R} \}_{i=1}^m D={(xi,yi)xiRn,yiR}i=1m (x is feature, y is label) we fit a model of the form h θ ( x ) = θ T ϕ ( x ) h_\theta(x) = \theta^T\phi(x) hθ(x)=θTϕ(x), where θ \theta θ is the parameter vector, ϕ ( x ) \phi(x) ϕ(x) is a transformed vector (for example, ϕ ( x ) = [ 1 , x 1 , x 2 , . . . , x 1 x 2 , . . . , x n x n − 1 ] \phi(x) = [1,x_1,x_2,...,x_1x_2,...,x_nx_{n-1}] ϕ(x)=[1,x1,x2,...,x1x2,...,xnxn1]). That is, the model is linear IN TERMS OF parameters instead of input vector x x x, as feature transformation is allowed.

Our goal is to fit the model h θ ( x ) = θ T ϕ ( x ) h_\theta(x) = \theta^T\phi(x) hθ(x)=θTϕ(x) as good as possible. That is, after tuning our parameters, given an unseen x ∗ x^* x, we should be able to make h θ ( x ∗ ) → y ∗ h_\theta(x^*)\to y^* hθ(x)y. In a nutshell, find the BEST θ \theta θ.

Sometimes, our model might fit the training dataset well, yet failed to generalize to unseen data. This introduces the problem of OVERFITTING. To address this, we could use robust linear regression, ridge regression, lasso regression.

In what follows, I will derive the various linear regression (standard, robust, ridge, lasso) from 2 perspectives (deterministic and probabilistic). Also, generalized linear regression will be discussed.

Deterministic perspective

Intuitively, we could let our cost function to be J ( θ ) = 1 2 ∑ i m ( h θ ( x i ) − y i ) 2 J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2 J(θ)=21im(hθ(xi)yi)2, another name for it is residual sum of squares (RSS) or sum of squared errors (SSE). Clearly, J is a convex function.

Then, the (standard) linear regression is formulated as θ ∗ : = arg ⁡ min ⁡ θ J ( θ ) \theta^* := \arg \min_\theta J(\theta) θ:=argminθJ(θ) [How to solve it? 1. gradient descent algorithm; 2. Analytically set ∂ J / ∂ θ = 0 \partial J/\partial \theta=0 J/θ=0. We have a particular nice solution if x ˉ = [ 1 , x ] , h θ ( x ) = θ T x ˉ ⇒ ∂ J / ∂ θ = ∑ i m ( x ˉ i T − y i ) x ˉ i = X T X θ − X T y = 0 ⇒ θ ∗ = ( X T X ) − 1 X T y \bar{x} = [1,x], h_\theta(x)=\theta^T\bar{x} \Rightarrow \partial J/\partial \theta = \sum_i^m(\bar{x}_i^T-y_i)\bar{x}_i=X^TX\theta - X^Ty=0\Rightarrow \theta^* = (X^TX)^{-1}X^Ty xˉ=[1,x],hθ(x)=θTxˉJ/θ=im(xˉiTyi)xˉi=XTXθXTy=0θ=(XTX)1XTy]

One drawback of standard linear regression is that it is SENSITIVE to outliers. If we modify the cost function using L1 norm, i.e. J ( θ ) = J ( θ ) = 1 2 ∑ i m ∣ h θ ( x i ) − y i ∣ J(\theta) = J(\theta)=\frac{1}{2}\sum_i^m |h_\theta(x^i)-y^i| J(θ)=J(θ)=21imhθ(xi)yi, we kind of reduced the influence of outliers, in this case, we get robust linear regression.

Another cases might be that the weights of parameter is too large, resulting overfitting. One example would be fitting a high order polynomial model ∑ i = 0 100 a i x i \sum_{i=0}^{100}a_ix^{i} i=0100aixi, it might fit the training data perfectly, yet the curve is “wiggly”. If we control the parameters, the curve could become smoother and enhance generalization capability. How to control the parameters? If we modify the cost function to be J ( θ ) = 1 2 ∑ i m ( h θ ( x i ) − y i ) 2 + λ ∣ ∣ θ ∣ ∣ 2 2 J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2 + \lambda ||\theta||_2^2 J(θ)=21im(hθ(xi)yi)2+λθ22, we get ridge regression. If we modify the cost function to be J ( θ ) = 1 2 ∑ i m ( h θ ( x i ) − y i ) 2 + λ ∣ ∣ θ ∣ ∣ 1 J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2 + \lambda ||\theta||_1 J(θ)=21im(hθ(xi)yi)2+λθ1, we get lasso regression.

[Special Remark: the biased term should not be penalized.]

Probabilisitc Perspective

Alternatively, we could derive them from a probabilistic perspective. To be more specific, we want to find θ \theta θ that maximizes the posterior (This is called MAP approach. Its difference with MLE approach is that it EXPLICITLY assumed a prior distribution for θ \theta θ). θ ∗ = arg ⁡ max ⁡ θ p ( θ ∣ x ) = arg ⁡ max ⁡ θ p ( x ∣ θ ) p ( θ ) / p ( x ) = arg ⁡ max ⁡ θ p ( x ∣ θ ) p ( θ ) \theta^* = \arg \max_\theta p(\theta|x) = \arg \max_\theta p(x|\theta)p(\theta) /p(x) = \arg \max_\theta p(x|\theta)p(\theta) θ=argmaxθp(θx)=argmaxθp(xθ)p(θ)/p(x)=argmaxθp(xθ)p(θ). In fact, we want to find θ \theta θ such that L ( θ ∣ D ) p ( θ ) L(\theta|D)p(\theta) L(θD)p(θ) is maximized, where L is the likelihood.

The essence is what we would like to assume for the distributions of the conditional probability p ( y ∣ x , θ ) p(y|x,\theta) p(yx,θ) and prior p ( θ ) p(\theta) p(θ).

(1) For example, if we assume the relationship between x and y is y = θ T h θ ( x ) + e , e ∼ N ( 0 , σ 2 ) y=\theta^Th_\theta(x)+e, e \sim N(0,\sigma^2) y=θThθ(x)+e,eN(0,σ2), where e is called observation noise or residual error, e is independent with x. Therefore, p ( y ∣ x , θ ) = N ( θ T h θ ( x ) , σ 2 ) = 1 2 π σ exp ⁡ ( − ( y − θ T x ) 2 2 σ 2 ) p(y|x,\theta) =N(\theta^Th_\theta(x), \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma} \exp (-\frac{(y-\theta^T x)^2}{2\sigma^2}) p(yx,θ)=N(θThθ(x),σ2)=2π σ1exp(2σ2(yθTx)2). If we assume the parameters are uniformly distributed, i.e. equally likely in a range, say [-M, M]. Our problem becomes θ ∗ = arg ⁡ max ⁡ θ log ⁡ L ( θ ∣ D ) ( some constant ) = arg ⁡ max ⁡ θ log ⁡ L ( θ ∣ D ) \theta^* = \arg \max_\theta \log L(\theta|D)(\text{some constant}) = \arg \max_\theta \log L(\theta|D) θ=argmaxθlogL(θD)(some constant)=argmaxθlogL(θD)

As log ⁡ L ( θ ∣ D ) = ∑ i m log ⁡ p ( y i ∣ x , θ ) = ∑ i m log ⁡ N ( θ T x , σ 2 ) = − log ⁡ ( σ 2 ( 2 π ) m / 2 ) − 1 2 σ 2 ∑ i m ( y i − θ T x i ) 2 \log L(\theta|D)=\sum_i^m \log p(y_i|x,\theta)=\sum_i^m \log N(\theta^Tx,\sigma^2) = -\log(\sigma^2 (2\pi)^{m/2}) - \frac{1}{2\sigma^2} \sum_i^m (y_i-\theta^T x_i)^2 logL(θD)=imlogp(yix,θ)=imlogN(θTx,σ2)=log(σ2(2π)m/2)2σ21im(yiθTxi)2, that is, θ ∗ = arg ⁡ min ⁡ θ 1 2 ∑ i m ( y i − θ T x i ) 2 \theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m (y_i - \theta^T x_i)^2 θ=argminθ21im(yiθTxi)2 (Oops! This is just what we derived for the standard linear regression from a cost function / deterministic perspective!)

(2) If we assume the noise e follows a laplace distribution, i.e. p ( y ∣ x , θ ) = L a p ( θ T x , σ 2 ) = 1 b exp ⁡ ( − ∣ y − θ T x ∣ b ) p(y|x,\theta) = Lap(\theta^T x, \sigma^2) = \frac{1}{b} \exp(-\frac{|y-\theta^T x|}{b}) p(yx,θ)=Lap(θTx,σ2)=b1exp(byθTx), the parameters distributed uniformly, then after some derivation we see θ ∗ = arg ⁡ min ⁡ θ 1 2 ∑ i m ∣ y i − θ T x i ∣ \theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m |y_i - \theta^T x_i| θ=argminθ21imyiθTxi (Oops Again! This is just what we derived for the robust linear regression by setting L1 norm in cost function)

(3) If on the basis of (1) we assume the parameters are of Gaussian distribution ( 0 , k 2 ) (0, k^2) (0,k2), i.e. parameters being small stands a higher chance. Then we get something like θ ∗ = arg ⁡ min ⁡ θ 1 2 ∑ i m ( y i − θ T x i ) 2 + 1 2 k 2 ∣ ∣ θ ∣ ∣ 2 2 \theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m (y_i - \theta^T x_i)^2 + \frac{1}{2k^2}||\theta||_2^2 θ=argminθ21im(yiθTxi)2+2k21θ22, since k is kindly of arbitrary, setting 1 2 k 2 = λ \frac{1}{2k^2} = \lambda 2k21=λ we get our ridge regression. (So, if we let λ \lambda λ large, we implicitly assumed the Gaussian distribution if “higher and narrower”)

(4) If on the basis of (1) we assume the parameters are of Laplace distribution ( 0 , b ) (0,b) (0,b), then we get something like θ ∗ = arg ⁡ min ⁡ θ 1 2 ∑ i m ( y i − θ T x i ) 2 + 1 b ∣ ∣ θ ∣ ∣ 1 \theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m (y_i - \theta^T x_i)^2 + \frac{1}{b}||\theta||_1 θ=argminθ21im(yiθTxi)2+b1θ1. Again by setting 1 / b = λ 1/b = \lambda 1/b=λ, we get lasso regression.

In summary, in deterministic perspective we design a suitable cost function; in probabilistic perspective we use MAP approach and assume different distribution for p ( y ∣ x , θ ) p(y|x,\theta) p(yx,θ) and the prior.

Generalized Linear Regression

Till now, you might think the linear regression is pretty basic. True. But it is a very fundamental thing. If we generalize it a bit, we could get many interesting variations.

In symbolic language, the generalized linear regression is:
μ ( x ∣ θ ) = g − 1 ( θ T ϕ ( x ) ) y ( x ∣ θ ) ∼ f ( μ ( x ∣ θ ) ) \mu(x|\theta)=g^{-1}(\theta^T \phi(x)) \\ y(x|\theta) \sim f(\mu(x|\theta)) μ(xθ)=g1(θTϕ(x))y(xθ)f(μ(xθ))
Clearly, by letting g ( x ) = x , f μ ( y ) = N ( y ∣ μ , σ 2 ) g(x)=x, f_\mu(y) = N(y|\mu, \sigma^2) g(x)=x,fμ(y)=N(yμ,σ2) we arrive at the standard linear regression. So linear regression is really just a special case of generalized linear regression.

Further, if μ = g − 1 ( θ T x ) , where g − 1 ( x ) = 1 1 + e − x \mu = g^{-1}(\theta^T x), \text{where} g^{-1}(x) = \frac{1}{1+e^{-x}} μ=g1(θTx),whereg1(x)=1+ex1 and f μ ( y ) = p ( y = 1 ) = B e r n o u l l i ( μ ) f_\mu(y)=p(y=1)=Bernoulli(\mu) fμ(y)=p(y=1)=Bernoulli(μ)

Reference

https://machinelearningmastery.com/maximum-a-posteriori-estimation/

https://www.cnblogs.com/easoncheng/archive/2012/11/08/2760675.html

https://www.datasciencecentral.com/profiles/blogs/explaining-logistic-regression-as-generalized-linear-model-in-use

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值