Linear Regression Basic

最新推荐文章于 2021-11-18 14:03:24 发布

原创最新推荐文章于 2021-11-18 14:03:24 发布 · 251 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#machine-learning

machine-learning 专栏收录该内容

1 篇文章

订阅专栏

本文探讨了如何通过最大似然估计法来确定线性回归模型的参数。假设目标变量与输入特征间存在线性关系，并且误差项遵循高斯分布，通过构建似然函数并最大化该函数来找到最优参数。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Maxmize Liklihood Linear Regression

Suppose we have data set $S=\{(x^{(i)},y^{(i)}),i=1,\dots,m\}$ where $x^{(i)}\in\mathbb R^n$ such that x has $n$ features with $m$ training examples. Let us assume that the target variables and the inputs are related via a linear equation.

y (i) = θ T x (i) + ϵ (i)

$y{(i)}=\theta^Tx{(i)}+\epsilon{(i)}$
Where

ϵ(i) $\epsilon^{(i)}$ is an error term that captures either un-model effects or random noise. Let’s assume that the

ϵ(i) $\epsilon^{(i)}$ ’s are distribute i.i.d.(independently and identically distributed) according to Gaussian Distribution with mean zero and variance

σ2 $\sigma^2$ . Which can be written as

ϵ(i)∼N(0,σ2) $\epsilon^{(i)}\sim N(0, \sigma^2)$ . And the pdf of

ϵ(i) $\epsilon^{(i)}$ is given by

p (ϵ (i)) = 1 2 π ‾ ‾ ‾ \sqrt σ (- ( ϵ ( i ) ) 2 2 σ 2)

$p(\epsilon^{(i)})=\frac{1}{\sqrt{2\pi}\sigma}\Big( -\frac{(\epsilon^{(i)})^2}{2\sigma^2} \Big)$
Because of

ϵ(i)=y(i)−θTx(i) $\epsilon^{(i)}=y^{(i)}-\theta^Tx^{(i)}$ , the pdf also can be given as

p (y (i) | x (i); θ) = 1 2 π ‾ ‾ ‾ \sqrt σ (- ( y ( i ) - θ T x ( i ) ) 2 2 σ 2)

$p(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\sigma}\Big(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\Big)$
Notice that the notation ‘

p(y(i)|x(i);θ) $p(y^{(i)}|x^{(i)};\theta)$ ’ indicates that this is the distribution of

y(i) $y^{(i)}$ given

x(i) $x^{(i)}$ is parameterized by

θ $\theta$ and

θ $\theta$ is not a random variable, the formula is not a probability consition on

θ $\theta$ . We can write the distribution as ‘

y(i)|x(i);θ∼N(θTx(i),σ2) $y^{(i)}|x^{(i)};\theta\sim N(\theta^Tx^{(i)},\sigma^2)$ ’. Given an input matrix

X=(x(1),x(2),…,x(m))T $X=(x^{(1)},x^{(2)},\dots,x^{(m)})^T$ and

θ $\theta$ , what the distribution of

y(i) $y^{(i)}$ ’s is given by

p(y→|X;θ) $p(\overrightarrow{y}|X;\theta)$ . When we wish to explicity view this as a function of

θ $\theta$ , we call it the likelihood function:

L (θ) = L (θ; X, y \to) = p (y \to | X; θ)

$L(\theta)=L(\theta;X,\overrightarrow{y})=p(\overrightarrow{y}|X;\theta)$
Note that by the independence assumption on the

ϵ(i) $\epsilon^{(i)}$ ’s, this can be written by

L (θ) = = \prod i = 1 m p (y (i) | x (i); θ) \prod i = 1 m 1 2 π ‾ ‾ ‾ \sqrt σ exp (- ( y ( i ) - θ T x ( i ) ) 2 ) 2 σ 2)

$\begin{equation} \begin{split} L(\theta) =& \prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta)\\ =& \prod_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma}\exp\Bigg(- \frac{(y^{(i)}-\theta^Tx^{(i)})^2)}{2\sigma^2}\Bigg) \end{split} \end{equation}$
Now, given this probabilistic model relating the

y(i) $y^{(i)}$ ’s and the

x(i) $x^{(i)}$ ’s. The principal of maximum likelihood says that we should should choose

θ $\theta$ so as to make the data as high probability as possible. So We are facing an optimization problem.

max θ L (θ)

$\max_{\theta} L(\theta)$
We define a new likelihood function called log likelihood:

ℓ (θ) = log L (θ) = log \prod i = 1 m 1 2 π ‾ ‾ ‾ \sqrt σ exp (- ( y ( i ) - θ T x ( i ) ) 2 ) 2 σ 2) = \sum i = 1 m log 1 2 π ‾ ‾ ‾ \sqrt σ exp (- ( y ( i ) - θ T x ( i ) ) 2 ) 2 σ 2) = m log 1 2 π ‾ ‾ ‾ \sqrt σ - 1 2 σ 2 \sum i = 1 m (y (i) - θ T x (i)) 2

$\begin{equation} \begin{split} \ell(\theta) &= \log L(\theta)\\ &= \log \prod_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma}\exp\Big(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2)}{2\sigma^2}\Big)\\ &= \sum_{i=1}^{m}\log\frac{1}{\sqrt{2\pi}\sigma}\exp\Big(-\frac{(y^{(i)}-\theta^Tx^{(i)})^2)}{2\sigma^2}\Big)\\ &= m\log\frac{1}{\sqrt{2\pi}\sigma}-\frac{1}{2\sigma^2}\sum_{i=1}^{m}(y^{(i)}-\theta^Tx^{(i)})^2 \end{split} \end{equation}$
When we scale the loss function the estimation of

θ=argminθ∑mi=1logp(x(i);θ) $\theta=\arg\min_\theta\sum_{i=1}^m\log p(x^{(i)};\theta)$ will not change. We could use the expectation to be the standard.

θ = arg min θ