ML(1) Linear Regression_deterministic perspective-优快云博客

本文链接：https://blog.youkuaiyun.com/ZJ_11701/article/details/114288853

线性回归是机器学习中最基础的算法之一。本文从确定性和概率性两个角度探讨标准、鲁棒、岭回归和拉索回归，并介绍了广义线性回归。通过调整损失函数，可以解决过拟合问题，例如使用L1范数降低异常值影响，使用L2范数或L1范数控制参数权重，以避免过拟合。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Introduction

Linear regression is perhaps the most fundamental algorithm in machine learning. In this setting, given a dataset $D=\{(x^i,y^i)|x^i\in \mathbb{R}^n, y^i\in\mathbb{R} \}_{i=1}^m$ (x is feature, y is label) we fit a model of the form $h_\theta(x) = \theta^T\phi(x)$ , where $\theta$ is the parameter vector, $\phi(x)$ is a transformed vector (for example, $\phi(x) = [1,x_1,x_2,...,x_1x_2,...,x_nx_{n-1}]$ ). That is, the model is linear IN TERMS OF parameters instead of input vector $x$ , as feature transformation is allowed.

Our goal is to fit the model $h_\theta(x) = \theta^T\phi(x)$ as good as possible. That is, after tuning our parameters, given an unseen $x^*$ , we should be able to make $h_\theta(x^*)\to y^*$ . In a nutshell, find the BEST $\theta$ .

Sometimes, our model might fit the training dataset well, yet failed to generalize to unseen data. This introduces the problem of OVERFITTING. To address this, we could use robust linear regression, ridge regression, lasso regression.

In what follows, I will derive the various linear regression (standard, robust, ridge, lasso) from 2 perspectives (deterministic and probabilistic). Also, generalized linear regression will be discussed.

Deterministic perspective

Intuitively, we could let our cost function to be $J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2$ , another name for it is residual sum of squares (RSS) or sum of squared errors (SSE). Clearly, J is a convex function.

Then, the (standard) linear regression is formulated as $\theta^* := \arg \min_\theta J(\theta)$ [How to solve it? 1. gradient descent algorithm; 2. Analytically set $\partial J/\partial \theta=0$ . We have a particular nice solution if $\bar{x} = [1,x], h_\theta(x)=\theta^T\bar{x} \Rightarrow \partial J/\partial \theta = \sum_i^m(\bar{x}_i^T-y_i)\bar{x}_i=X^TX\theta - X^Ty=0\Rightarrow \theta^* = (X^TX)^{-1}X^Ty$ ]

One drawback of standard linear regression is that it is SENSITIVE to outliers. If we modify the cost function using L1 norm, i.e. $J(\theta) = J(\theta)=\frac{1}{2}\sum_i^m |h_\theta(x^i)-y^i|$ , we kind of reduced the influence of outliers, in this case, we get robust linear regression.

Another cases might be that the weights of parameter is too large, resulting overfitting. One example would be fitting a high order polynomial model $\sum_{i=0}^{100}a_ix^{i}$ , it might fit the training data perfectly, yet the curve is “wiggly”. If we control the parameters, the curve could become smoother and enhance generalization capability. How to control the parameters? If we modify the cost function to be $J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2 + \lambda ||\theta||_2^2$ , we get ridge regression. If we modify the cost function to be $J(\theta)=\frac{1}{2}\sum_i^m (h_\theta(x^i)-y^i)^2 + \lambda ||\theta||_1$ , we get lasso regression.

[Special Remark: the biased term should not be penalized.]

Probabilisitc Perspective

Alternatively, we could derive them from a probabilistic perspective. To be more specific, we want to find $\theta$ that maximizes the posterior (This is called MAP approach. Its difference with MLE approach is that it EXPLICITLY assumed a prior distribution for $\theta$ ). $\theta^* = \arg \max_\theta p(\theta|x) = \arg \max_\theta p(x|\theta)p(\theta) /p(x) = \arg \max_\theta p(x|\theta)p(\theta)$ . In fact, we want to find $\theta$ such that $L(\theta|D)p(\theta)$ is maximized, where L is the likelihood.

The essence is what we would like to assume for the distributions of the conditional probability $p(y|x,\theta)$ and prior $p(\theta)$ .

(1) For example, if we assume the relationship between x and y is $y=\theta^Th_\theta(x)+e, e \sim N(0,\sigma^2)$ , where e is called observation noise or residual error, e is independent with x. Therefore, $p(y|x,\theta) =N(\theta^Th_\theta(x), \sigma^2)=\frac{1}{\sqrt{2\pi}\sigma} \exp (-\frac{(y-\theta^T x)^2}{2\sigma^2})$ . If we assume the parameters are uniformly distributed, i.e. equally likely in a range, say [-M, M]. Our problem becomes $\theta^* = \arg \max_\theta \log L(\theta|D)(\text{some constant}) = \arg \max_\theta \log L(\theta|D)$

As $\log L(\theta|D)=\sum_i^m \log p(y_i|x,\theta)=\sum_i^m \log N(\theta^Tx,\sigma^2) = -\log(\sigma^2 (2\pi)^{m/2}) - \frac{1}{2\sigma^2} \sum_i^m (y_i-\theta^T x_i)^2$ , that is, $\theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m (y_i - \theta^T x_i)^2$ (Oops! This is just what we derived for the standard linear regression from a cost function / deterministic perspective!)

(2) If we assume the noise e follows a laplace distribution, i.e. $p(y|x,\theta) = Lap(\theta^T x, \sigma^2) = \frac{1}{b} \exp(-\frac{|y-\theta^T x|}{b})$ , the parameters distributed uniformly, then after some derivation we see $\theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m |y_i - \theta^T x_i|$ (Oops Again! This is just what we derived for the robust linear regression by setting L1 norm in cost function)

(3) If on the basis of (1) we assume the parameters are of Gaussian distribution $0, k^2)$ , i.e. parameters being small stands a higher chance. Then we get something like $\theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m (y_i - \theta^T x_i)^2 + \frac{1}{2k^2}||\theta||_2^2$ , since k is kindly of arbitrary, setting $\frac{1}{2k^2} = \lambda$ we get our ridge regression. (So, if we let $\lambda$ large, we implicitly assumed the Gaussian distribution if “higher and narrower”)

(4) If on the basis of (1) we assume the parameters are of Laplace distribution $(0, b)$ , then we get something like $\theta^* = \arg \min_\theta \frac{1}{2}\sum_i^m (y_i - \theta^T x_i)^2 + \frac{1}{b}||\theta||_1$ . Again by setting $\lambda$ , we get lasso regression.

In summary, in deterministic perspective we design a suitable cost function; in probabilistic perspective we use MAP approach and assume different distribution for $p(y|x,\theta)$ and the prior.

Generalized Linear Regression

Till now, you might think the linear regression is pretty basic. True. But it is a very fundamental thing. If we generalize it a bit, we could get many interesting variations.

In symbolic language, the generalized linear regression is:
$\mu(x|\theta)=g^{-1}(\theta^T \phi(x)) \\ y(x|\theta) \sim f(\mu(x|\theta))$
Clearly, by letting $f_\mu(y) = N(y|\mu, \sigma^2)$ we arrive at the standard linear regression. So linear regression is really just a special case of generalized linear regression.