线性回归的基本形式
给定由d个属性描述的示例
x
=
(
x
1
,
x
2
,
.
.
.
.
.
x
d
)
x=(x_1,x_2,.....x_d)
x=(x1,x2,.....xd),xi是x在第i个属性上的取值,线性模型试图学得一个通过属性的线性组合来进行预测的函数,即:
f
(
x
)
=
w
1
x
1
+
w
2
x
2
+
.
.
.
+
w
d
x
d
+
b
f(x)=w_1x_1+w_2x_2+...+w_dx_d+b
f(x)=w1x1+w2x2+...+wdxd+b
其实线性回归就是概率论中的线性回归,做物理试验时也经常用。
一元线性回归就是输入的属性只有一个。
对于离散的属性,如果存在顺序关系,即可以分别大小等,可以转化成连续值。例如:人好看是1,不好看是0;如果不存在有序关系,可转化为k维向量,每个属性对应位置上是1。
一元线性回归
最小二乘估计
线性回归试图学得
f
(
x
i
)
=
w
x
i
+
b
f(x_i)=wx_i+b
f(xi)=wxi+b。求w和b。
基于均方误差最小化来进行模型求解的方法称为“最小二乘法”
E
(
w
,
b
)
=
∑
i
=
1
m
(
y
i
−
f
(
x
i
)
)
2
=
∑
i
=
1
m
(
y
i
−
(
w
x
i
+
b
)
)
2
=
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
\begin{aligned} E_{(w, b)} &=\sum_{i=1}^{m}\left(y_{i}-f\left(x_{i}\right)\right)^{2} \\ &=\sum_{i=1}^{m}\left(y_{i}-\left(w x_{i}+b\right)\right)^{2} \\ &=\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2} \end{aligned}
E(w,b)=i=1∑m(yi−f(xi))2=i=1∑m(yi−(wxi+b))2=i=1∑m(yi−wxi−b)2
最终求得的是使后边的式子值最小的w和b的值。
最小二乘法试图找到一条直线,使所有样本到直线上的欧式距离最小。
极大似然估计
极大似然估计用来估计概率分布的参数值。
方法是:方法:对于离散型(连续型)随机变量
X
X
X,假设其概率质量函数为
P
(
x
;
θ
)
P(x ; \theta)
P(x;θ)(概率密度函数为
p
(
x
;
θ
)
)
p(x ; \theta))
p(x;θ))),其中为待估计的参数值
θ
\theta
θ(可以有多个)。现有
x
1
,
x
2
,
x
3
,
…
,
x
n
x_{1}, x_{2}, x_{3}, \ldots, x_{n}
x1,x2,x3,…,xn是
来自
X
X
X的个独立同分布的样本,它们的联合概率为
L
(
θ
)
=
∏
i
=
1
n
P
(
x
i
;
θ
)
L(\theta)=\prod_{i=1}^{n} P\left(x_{i} ; \theta\right)
L(θ)=∏i=1nP(xi;θ)
使得
L
(
θ
)
=
∏
i
=
1
n
P
(
x
i
;
θ
)
L(\theta)=\prod_{i=1}^{n} P\left(x_{i} ; \theta\right)
L(θ)=∏i=1nP(xi;θ)取得最大值得
θ
\theta
θ即为所求的参数值。
由于是连乘运算,可以取对数变成加法运算,然后求导得到取值。
推导代价函数
对于线性回归模型,引入一个误差参数有:
y
=
w
x
+
b
+
ϵ
y=w x+b+\epsilon
y=wx+b+ϵ
其中
ϵ
\epsilon
ϵ 为不受控制的随机误差,通常假设其服从均值为0的正态分布
ϵ
∼
N
(
0
,
σ
2
)
\epsilon \sim N\left(0, \sigma^{2}\right)
ϵ∼N(0,σ2) (高斯 提出的,也可以用中心极限定理解释),所以
ϵ
\epsilon
ϵ 的概率密度函数为
p
(
ϵ
)
=
1
2
π
σ
exp
(
−
ϵ
2
2
σ
2
)
p(\epsilon)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\epsilon^{2}}{2 \sigma^{2}}\right)
p(ϵ)=2πσ1exp(−2σ2ϵ2)
若将
ϵ
\epsilon
ϵ 用
y
−
(
w
x
+
b
)
y-(w x+b)
y−(wx+b) 等价替换可得
p
(
y
)
=
1
2
π
σ
exp
(
−
(
y
−
(
w
x
+
b
)
)
2
2
σ
2
)
p(y)=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{(y-(w x+b))^{2}}{2 \sigma^{2}}\right)
p(y)=2πσ1exp(−2σ2(y−(wx+b))2)
上式显然可以看作
y
∼
N
(
w
x
+
b
,
σ
2
)
y \sim N\left(w x+b, \sigma^{2}\right)
y∼N(wx+b,σ2), 下面便可以用极大似然估计来估计
w
w
w 和
b
b
b 的 值, 似然函数为
L
(
w
,
b
)
=
∏
i
=
1
m
p
(
y
i
)
=
∏
i
=
1
m
1
2
π
σ
exp
(
−
(
y
i
−
(
w
x
i
+
b
)
)
2
2
σ
2
)
ln
L
(
w
,
b
)
=
∑
i
=
1
m
ln
1
2
π
σ
exp
(
−
(
y
i
−
w
x
i
−
b
)
2
2
σ
2
)
=
∑
i
=
1
m
ln
1
2
π
σ
+
∑
i
=
1
m
ln
exp
(
−
(
y
i
−
w
x
i
−
b
)
2
2
σ
2
)
\begin{aligned} L(w, b)=& \prod_{i=1}^{m} p\left(y_{i}\right)=\prod_{i=1}^{m} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y_{i}-\left(w x_{i}+b\right)\right)^{2}}{2 \sigma^{2}}\right) \\ \ln L(w, b) &=\sum_{i=1}^{m} \ln \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{\left(y_{i}-w x_{i}-b\right)^{2}}{2 \sigma^{2}}\right) \\ &=\sum_{i=1}^{m} \ln \frac{1}{\sqrt{2 \pi} \sigma}+\sum_{i=1}^{m} \ln \exp \left(-\frac{\left(y_{i}-w x_{i}-b\right)^{2}}{2 \sigma^{2}}\right) \end{aligned}
L(w,b)=lnL(w,b)i=1∏mp(yi)=i=1∏m2πσ1exp(−2σ2(yi−(wxi+b))2)=i=1∑mln2πσ1exp(−2σ2(yi−wxi−b)2)=i=1∑mln2πσ1+i=1∑mlnexp(−2σ2(yi−wxi−b)2)
最终有
(
w
∗
,
b
∗
)
=
arg
max
(
w
,
b
)
ln
L
(
w
,
b
)
=
arg
min
(
w
,
b
)
∑
i
=
1
m
(
y
i
−
w
x
i
−
b
)
2
\left(w^{*}, b^{*}\right)=\underset{(w, b)}{\arg \max } \ln L(\boldsymbol{w}, b)=\underset{(w, b)}{\arg \min } \sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2}
(w∗,b∗)=(w,b)argmaxlnL(w,b)=(w,b)argmin∑i=1m(yi−wxi−b)2
和最小二乘法殊途同归。
求解w和b
- 凸集和凸函数
凸集:设集合
D
⊂
R
n
D \subset \mathbb{R}^{n}
D⊂Rn, 如果对任意的
x
,
y
∈
D
\boldsymbol{x}, \boldsymbol{y} \in D
x,y∈D 与任意的
α
∈
[
0
,
1
]
\alpha \in[0,1]
α∈[0,1], 有
α
x
+
(
1
−
α
)
y
∈
D
\alpha \boldsymbol{x}+(1-\alpha) \boldsymbol{y} \in D
αx+(1−α)y∈D
则称集合
D
D
D 是凸集。凸集的几何意义是:若两个点属于此集合,则这两点连线上的任意 一点均属于此集合(此处应该有图)。常见的凸集有空集\varnothing,
n
\mathrm{n}
n 维欧式空间
R
n
\mathbb{R}^{n}
Rn 凸函数:设
D
D
D 是非空凸集,
f
f
f 是定义在
D
D
D 上的函数,如果对任意的
x
1
,
x
2
∈
D
,
α
∈
\boldsymbol{x}^{1}, \boldsymbol{x}^{2} \in D, \alpha \in
x1,x2∈D,α∈
(
0
,
1
)
(0,1)
(0,1), 均有
f
(
α
x
1
+
(
1
−
α
)
x
2
)
⩽
α
f
(
x
1
)
+
(
1
−
α
)
f
(
x
2
)
f\left(\alpha \boldsymbol{x}^{1}+(1-\alpha) \boldsymbol{x}^{2}\right) \leqslant \alpha f\left(\boldsymbol{x}^{1}\right)+(1-\alpha) f\left(\boldsymbol{x}^{2}\right)
f(αx1+(1−α)x2)⩽αf(x1)+(1−α)f(x2)
-
梯度和海塞矩阵
函数对变量的偏导写成向量形式就是梯度。
函数对变量求二阶偏导写成矩阵形式。
∇ 2 f ( x ) = [ ∂ 2 f ( x ) ∂ x 1 2 ∂ 2 f ( x ) ∂ x 1 ∂ x 2 ⋯ ∂ 2 f ( x ) ∂ x 1 ∂ x n ∂ 2 f ( x ) ∂ x 2 ∂ x 1 ∂ 2 f ( x ) ∂ x 2 2 ⋯ ∂ 2 f ( x ) ∂ x 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ 2 f ( x ) ∂ x n ∂ x 1 ∂ 2 f ( x ) ∂ x n ∂ x 2 ⋯ ∂ 2 f ( x ) ∂ x n 2 ] \nabla^{2} f(\boldsymbol{x})=\left[\begin{array}{cccc}\frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1}^{2}} & \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1} \partial x_{2}} & \cdots & \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{1} \partial x_{n}} \\ \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2}^{2}} & \cdots & \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{2} \partial x_{n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n} \partial x_{1}} & \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n} \partial x_{2}} & \cdots & \frac{\partial^{2} f(\boldsymbol{x})}{\partial x_{n}^{2}}\end{array}\right] ∇2f(x)=⎣⎢⎢⎢⎢⎢⎡∂x12∂2f(x)∂x2∂x1∂2f(x)⋮∂xn∂x1∂2f(x)∂x1∂x2∂2f(x)∂x22∂2f(x)⋮∂xn∂x2∂2f(x)⋯⋯⋱⋯∂x1∂xn∂2f(x)∂x2∂xn∂2f(x)⋮∂xn2∂2f(x)⎦⎥⎥⎥⎥⎥⎤
定理:设 D ⊂ R n D \subset \mathbb{R}^{n} D⊂Rn 是非空开凸集, f : D ⊂ R n → R f: D \subset \mathbb{R}^{n} \rightarrow \mathbb{R} f:D⊂Rn→R, 且 f ( x ) f(\boldsymbol{x}) f(x) 在 D D D 上二阶连续可微,
如果 f ( x ) f(\boldsymbol{x}) f(x) 的Hessian(海塞)矩阵在 D D D 上是半正定的,则 f ( x ) f(\boldsymbol{x}) f(x) 是 D D D 上的凸函数。(类比 一元函数判断凹凸性)
因此只需证明海塞矩阵是半正定的,那么 E ( w , b ) E(w,b) E(w,b)就是关于w和b的凸函数。
∂ E ( w , b ) ∂ w = ∂ ∂ w [ ∑ i = 1 m ( y i − w x i − b ) 2 ] = ∑ i = 1 m ∂ ∂ w ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − x i ) = 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w} &=\frac{\partial}{\partial w}\left[\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial w}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot\left(-x_{i}\right) \\ &=2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right) \end{aligned} ∂w∂E(w,b)=∂w∂[i=1∑m(yi−wxi−b)2]=i=1∑m∂w∂(yi−wxi−b)2=i=1∑m2⋅(yi−wxi−b)⋅(−xi)=2(wi=1∑mxi2−i=1∑m(yi−b)xi)∂ E ( w , b ) ∂ w 2 = ∂ ∂ w ( ∂ E ( w , b ) ∂ w ) = ∂ ∂ w [ 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) ] = ∂ ∂ w ( 2 w ∑ i = 1 m x i 2 ) = 2 ∑ i = 1 m x i 2 \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w^{2}} &=\frac{\partial}{\partial w}\left(\frac{\partial E_{(w, b)}}{\partial w}\right) \\ &=\frac{\partial}{\partial w}\left[2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)\right] \\ &=\frac{\partial}{\partial w}\left(2 w \sum_{i=1}^{m} x_{i}^{2}\right) \\ &=2 \sum_{i=1}^{m} x_{i}^{2} \end{aligned} ∂w2∂E(w,b)=∂w∂(∂w∂E(w,b))=∂w∂[2(wi=1∑mxi2−i=1∑m(yi−b)xi)]=∂w∂(2wi=1∑mxi2)=2i=1∑mxi2
∂ E ( w , b ) ∂ w ∂ b = ∂ ∂ b ( ∂ E ( w , b ) ∂ w ) = ∂ ∂ b [ 2 ( w ∑ i = 1 m x i 2 − ∑ i = 1 m ( y i − b ) x i ) ] = ∂ ∂ b [ − 2 ∑ i = 1 m ( y i − b ) x i ] = ∂ ∂ b ( − 2 ∑ i = 1 m y i x i + 2 ∑ i = 1 m b x i ) = 2 ∑ i = 1 m x i \begin{aligned} \frac{\partial E_{(w, b)}}{\partial w \partial b} &=\frac{\partial}{\partial b}\left(\frac{\partial E_{(w, b)}}{\partial w}\right) \\ &=\frac{\partial}{\partial b}\left[2\left(w \sum_{i=1}^{m} x_{i}^{2}-\sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right)\right] \\ &=\frac{\partial}{\partial b}\left[-2 \sum_{i=1}^{m}\left(y_{i}-b\right) x_{i}\right] \\ &=\frac{\partial}{\partial b}\left(-2 \sum_{i=1}^{m} y_{i} x_{i}+2 \sum_{i=1}^{m} b x_{i}\right) \\ &=2 \sum_{i=1}^{m} x_{i} \end{aligned} ∂w∂b∂E(w,b)=∂b∂(∂w∂E(w,b))=∂b∂[2(wi=1∑mxi2−i=1∑m(yi−b)xi)]=∂b∂[−2i=1∑m(yi−b)xi]=∂b∂(−2i=1∑myixi+2i=1∑mbxi)=2i=1∑mxi
∂ E ( w , b ) ∂ b = ∂ ∂ b [ ∑ i = 1 m ( y i − w x i − b ) 2 ] = ∑ i = 1 m ∂ ∂ b ( y i − w x i − b ) 2 = ∑ i = 1 m 2 ⋅ ( y i − w x i − b ) ⋅ ( − 1 ) = 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b} &=\frac{\partial}{\partial b}\left[\sum_{i=1}^{m}\left(y_{i}-w x_{i}-b\right)^{2}\right] \\ &=\sum_{i=1}^{m} \frac{\partial}{\partial b}\left(y_{i}-w x_{i}-b\right)^{2} \\ &=\sum_{i=1}^{m} 2 \cdot\left(y_{i}-w x_{i}-b\right) \cdot(-1) \\ &=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right) \text { } \end{aligned} ∂b∂E(w,b)=∂b∂[i=1∑m(yi−wxi−b)2]=i=1∑m∂b∂(yi−wxi−b)2=i=1∑m2⋅(yi−wxi−b)⋅(−1)=2(mb−i=1∑m(yi−wxi))
∂ E ( w , b ) ∂ b ∂ w = ∂ ∂ w ( ∂ E ( w , b ) ∂ b ) = ∂ ∂ w [ 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) ] = ∂ ∂ w ( 2 ∑ i = 1 m w x i ) = 2 ∑ i = 1 m x i \begin{aligned} \frac{\partial E_{(w, b)}}{\partial b \partial w} &=\frac{\partial}{\partial w}\left(\frac{\partial E_{(w, b)}}{\partial b}\right) \\ &=\frac{\partial}{\partial w}\left[2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right)\right] \\ &=\frac{\partial}{\partial w}\left(2 \sum_{i=1}^{m} w x_{i}\right) \\ &=2 \sum_{i=1}^{m} x_{i} \end{aligned} ∂b∂w∂E(w,b)=∂w∂(∂b∂E(w,b))=∂w∂[2(mb−i=1∑m(yi−wxi))]=∂w∂(2i=1∑mwxi)=2i=1∑mxi
∂ 2 E ( w , b ) ∂ b 2 = ∂ ∂ b ( ∂ E ( w , b ) ∂ b ) = ∂ ∂ b [ 2 ( m b − ∑ i = 1 m ( y i − w x i ) ) ] = 2 m \begin{aligned} \frac{\partial^{2} E_{(w, b)}}{\partial b^{2}} &=\frac{\partial}{\partial b}\left(\frac{\partial E_{(w, b)}}{\partial b}\right) \\ &=\frac{\partial}{\partial b}\left[2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right)\right] \\ &=2 m \end{aligned} ∂b2∂2E(w,b)=∂b∂(∂b∂E(w,b))=∂b∂[2(mb−i=1∑m(yi−wxi))]=2m
∇ 2 E ( w , b ) = [ ∂ 2 E ( w , b ) ∂ w 2 ∂ 2 E ( w , b ) ∂ w ∂ b ∂ 2 E ( w , b ) ∂ b ∂ w ∂ 2 E ( u , b ) ∂ b 2 ] = [ 2 ∑ i = 1 m x i 2 2 ∑ i = 1 m x i 2 ∑ i = 1 m x i 2 m ] \nabla^{2} E_{(w, b)}=\left[\begin{array}{cc}\frac{\partial^{2} E_{(w, b)}}{\partial w^{2}} & \frac{\partial^{2} E_{(w, b)}}{\partial w \partial b} \\ \frac{\partial^{2} E_{(w, b)}}{\partial b \partial w} & \frac{\partial^{2} E_{(u, b)}}{\partial b^{2}}\end{array}\right]=\left[\begin{array}{cc}2 \sum_{i=1}^{m} x_{i}^{2} & 2 \sum_{i=1}^{m} x_{i} \\ 2 \sum_{i=1}^{m} x_{i} & 2 m\end{array}\right] ∇2E(w,b)=[∂w2∂2E(w,b)∂b∂w∂2E(w,b)∂w∂b∂2E(w,b)∂b2∂2E(u,b)]=[2∑i=1mxi22∑i=1mxi2∑i=1mxi2m]
经过证明,该矩阵是半正定的。(每个顺序子矩阵都≥0)
根据凸充分性定理,偏导等于0时的w和b即为所求。
所以:
∂
E
(
w
,
b
)
∂
b
=
2
(
m
b
−
∑
i
=
1
m
(
y
i
−
w
x
i
)
)
=
0
\frac{\partial E_{(w, b)}}{\partial b}=2\left(m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)\right)=0
∂b∂E(w,b)=2(mb−∑i=1m(yi−wxi))=0
m
b
−
∑
i
=
1
m
(
y
i
−
w
x
i
)
=
0
m b-\sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)=0
mb−∑i=1m(yi−wxi)=0
b
=
1
m
∑
i
=
1
m
(
y
i
−
w
x
i
)
b=\frac{1}{m} \sum_{i=1}^{m}\left(y_{i}-w x_{i}\right)
b=m1∑i=1m(yi−wxi)
b = 1 m ∑ i = 1 m y i − w ⋅ 1 m ∑ i = 1 m x i = y ˉ − w x ˉ b=\frac{1}{m} \sum_{i=1}^{m} y_{i}-w \cdot \frac{1}{m} \sum_{i=1}^{m} x_{i}=\bar{y}-w \bar{x} b=m1∑i=1myi−w⋅m1∑i=1mxi=yˉ−wxˉ
w = ∑ i = 1 m y i x i − x ˉ ∑ i = 1 m y i ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 = ∑ i = 1 m y i ( x i − x ˉ ) ∑ i = 1 m x i 2 − 1 m ( ∑ i = 1 m x i ) 2 w=\frac{\sum_{i=1}^{m} y_{i} x_{i}-\bar{x} \sum_{i=1}^{m} y_{i}}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}}=\frac{\sum_{i=1}^{m} y_{i}\left(x_{i}-\bar{x}\right)}{\sum_{i=1}^{m} x_{i}^{2}-\frac{1}{m}\left(\sum_{i=1}^{m} x_{i}\right)^{2}} w=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myixi−xˉ∑i=1myi=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−xˉ)
机器学习三要素
- 模型:根据具体问题,确定假设空间。
- 策略:根据评价标准,确定选取最优模型的策略(通常会产出一个“损失函数”)。
- 算法:求解损失函数,确定最优模型。
多元线性回归
样本由多个属性描述时,试图学得:
f
(
x
i
)
=
w
T
x
i
+
b
f\left(\boldsymbol{x}_{i}\right)=\boldsymbol{w}^{\mathrm{T}} \boldsymbol{x}_{i}+b
f(xi)=wTxi+b
即:
f
(
x
i
)
=
(
w
1
w
2
⋯
w
d
)
(
x
i
1
x
i
2
⋮
x
i
d
)
+
b
f
(
x
i
)
=
w
1
x
i
1
+
w
2
x
i
2
+
…
+
w
d
x
i
d
+
b
\begin{gathered} f\left(\boldsymbol{x}_{i}\right)=\left(\begin{array}{cccc} w_{1} & w_{2} & \cdots & w_{d} \end{array}\right)\left(\begin{array}{c} x_{i 1} \\ x_{i 2} \\ \vdots \\ x_{i d} \end{array}\right)+b \\ f\left(\boldsymbol{x}_{i}\right)=w_{1} x_{i 1}+w_{2} x_{i 2}+\ldots+w_{d} x_{i d}+b \end{gathered}
f(xi)=(w1w2⋯wd)⎝⎜⎜⎜⎛xi1xi2⋮xid⎠⎟⎟⎟⎞+bf(xi)=w1xi1+w2xi2+…+wdxid+b
令
b
=
w
d
+
1
∗
1
b =w_{d+1*1}
b=wd+1∗1
有:
f
(
x
i
)
=
w
1
x
i
1
+
w
2
x
i
2
+
…
+
w
d
x
i
d
+
w
d
+
1
⋅
1
f\left(\boldsymbol{x}_{i}\right)=w_{1} x_{i 1}+w_{2} x_{i 2}+\ldots+w_{d} x_{i d}+w_{d+1} \cdot 1
f(xi)=w1xi1+w2xi2+…+wdxid+wd+1⋅1
f
(
x
i
)
=
(
w
1
w
2
⋯
w
d
w
d
+
1
)
(
x
i
1
x
i
2
⋮
x
i
d
1
)
f
(
x
^
i
)
=
w
^
T
x
^
i
\begin{gathered} f\left(\boldsymbol{x}_{i}\right)=\left(\begin{array}{lllll} w_{1} & w_{2} & \cdots & w_{d} & w_{d+1} \end{array}\right)\left(\begin{array}{c} x_{i 1} \\ x_{i 2} \\ \vdots \\ x_{i d} \\ 1 \end{array}\right) \\ f\left(\hat{\boldsymbol{x}}_{i}\right)=\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i} \end{gathered}
f(xi)=(w1w2⋯wdwd+1)⎝⎜⎜⎜⎜⎜⎛xi1xi2⋮xid1⎠⎟⎟⎟⎟⎟⎞f(x^i)=w^Tx^i
由最小二乘法:
E
w
^
=
∑
i
=
1
m
(
y
i
−
f
(
x
^
i
)
)
2
=
∑
i
=
1
m
(
y
i
−
w
^
T
x
^
i
)
2
E_{\hat{\boldsymbol{w}}}=\sum_{i=1}^{m}\left(y_{i}-f\left(\hat{\boldsymbol{x}}_{i}\right)\right)^{2}=\sum_{i=1}^{m}\left(y_{i}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}\right)^{2}
Ew^=i=1∑m(yi−f(x^i))2=i=1∑m(yi−w^Tx^i)2
向量化:
E
w
^
=
∑
i
=
1
m
(
y
i
−
w
^
T
x
^
i
)
2
=
(
y
1
−
w
^
T
x
^
1
)
2
+
(
y
2
−
w
^
T
x
^
2
)
2
+
…
+
(
y
m
−
w
^
T
x
^
m
)
2
E
w
^
=
(
y
1
−
w
^
T
x
^
1
y
2
−
w
^
T
x
^
2
⋯
y
m
−
w
^
T
x
^
m
)
(
y
1
−
w
^
T
x
^
1
y
2
−
w
^
T
x
^
2
⋮
y
m
−
w
^
T
x
^
m
)
\begin{aligned} &E_{\hat{\boldsymbol{w}}}=\sum_{i=1}^{m}\left(y_{i}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{i}\right)^{2}=\left(y_{1}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{1}\right)^{2}+\left(y_{2}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{2}\right)^{2}+\ldots+\left(y_{m}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{m}\right)^{2} \\ &E_{\hat{\boldsymbol{w}}}=\left(\begin{array}{cccc} y_{1}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{1} & y_{2}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{2} & \cdots & y_{m}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{m} \end{array}\right)\left(\begin{array}{c} y_{1}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{1} \\ y_{2}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{2} \\ \vdots \\ y_{m}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{m} \end{array}\right) \end{aligned}
Ew^=i=1∑m(yi−w^Tx^i)2=(y1−w^Tx^1)2+(y2−w^Tx^2)2+…+(ym−w^Tx^m)2Ew^=(y1−w^Tx^1y2−w^Tx^2⋯ym−w^Tx^m)⎝⎜⎜⎜⎛y1−w^Tx^1y2−w^Tx^2⋮ym−w^Tx^m⎠⎟⎟⎟⎞
其中
(
y
1
−
w
^
T
x
^
1
y
2
−
w
^
T
x
^
2
⋮
y
m
−
w
^
T
x
^
m
)
=
(
y
1
y
2
⋮
y
m
)
−
(
w
^
T
x
^
1
w
^
T
x
^
2
⋮
w
^
T
x
^
m
)
=
(
y
1
y
2
⋮
y
m
)
−
(
x
^
1
T
w
^
x
^
2
T
w
^
⋮
x
^
m
T
w
^
)
y
=
(
y
1
y
2
⋮
y
m
)
,
(
x
^
1
T
w
^
x
^
2
T
w
^
⋮
x
^
m
T
w
^
)
=
(
x
^
1
T
x
^
2
T
⋮
x
^
m
T
)
⋅
w
^
=
(
x
1
T
1
x
2
T
1
⋮
⋮
x
m
T
1
)
⋅
w
^
=
X
⋅
w
^
\begin{gathered} \left(\begin{array}{c} y_{1}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{1} \\ y_{2}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{2} \\ \vdots \\ y_{m}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{m} \end{array}\right)=\left(\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{array}\right)-\left(\begin{array}{c} \hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{1} \\ \hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{2} \\ \vdots \\ \hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{m} \end{array}\right)=\left(\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{array}\right)-\left(\begin{array}{c} \hat{\boldsymbol{x}}_{1}^{\mathrm{T}} \hat{\boldsymbol{w}} \\ \hat{\boldsymbol{x}}_{2}^{\mathrm{T}} \hat{\boldsymbol{w}} \\ \vdots \\ \hat{\boldsymbol{x}}_{m}^{\mathrm{T}} \hat{\boldsymbol{w}} \end{array}\right) \\ \boldsymbol{y}=\left(\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{array}\right), & \left(\begin{array}{c} \hat{\boldsymbol{x}}_{1}^{\mathrm{T}} \hat{\boldsymbol{w}} \\ \hat{\boldsymbol{x}}_{2}^{\mathrm{T}} \hat{\boldsymbol{w}} \\ \vdots \\ \hat{\boldsymbol{x}}_{m}^{\mathrm{T}} \hat{\boldsymbol{w}} \end{array}\right)=\left(\begin{array}{c} \hat{\boldsymbol{x}}_{1}^{\mathrm{T}} \\ \hat{\boldsymbol{x}}_{2}^{\mathrm{T}} \\ \vdots \\ \hat{\boldsymbol{x}}_{m}^{\mathrm{T}} \end{array}\right) \cdot \hat{\boldsymbol{w}}=\left(\begin{array}{cc} \boldsymbol{x}_{1}^{\mathrm{T}} & 1 \\ \boldsymbol{x}_{2}^{\mathrm{T}} & 1 \\ \vdots & \vdots \\ \boldsymbol{x}_{m}^{\mathrm{T}} & 1 \end{array}\right) \cdot \hat{\boldsymbol{w}}=\mathbf{X} \cdot \hat{\boldsymbol{w}} \end{gathered}
⎝⎜⎜⎜⎛y1−w^Tx^1y2−w^Tx^2⋮ym−w^Tx^m⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎛w^Tx^1w^Tx^2⋮w^Tx^m⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎛x^1Tw^x^2Tw^⋮x^mTw^⎠⎟⎟⎟⎞y=⎝⎜⎜⎜⎛y1y2⋮ym⎠⎟⎟⎟⎞,⎝⎜⎜⎜⎛x^1Tw^x^2Tw^⋮x^mTw^⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛x^1Tx^2T⋮x^mT⎠⎟⎟⎟⎞⋅w^=⎝⎜⎜⎜⎛x1Tx2T⋮xmT11⋮1⎠⎟⎟⎟⎞⋅w^=X⋅w^
所以
E
w
^
=
(
y
1
−
w
^
T
x
^
1
y
2
−
w
^
T
x
^
2
⋯
y
m
−
w
^
T
x
^
m
)
(
y
1
−
w
^
T
x
^
1
y
2
−
w
^
T
x
^
2
⋮
y
m
−
w
^
T
x
^
m
)
E_{\hat{\boldsymbol{w}}}=\left(\begin{array}{cccc} y_{1}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{1} & y_{2}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{2} & \cdots & y_{m}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{m} \end{array}\right)\left(\begin{array}{c} y_{1}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{1} \\ y_{2}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{2} \\ \vdots \\ y_{m}-\hat{\boldsymbol{w}}^{\mathrm{T}} \hat{\boldsymbol{x}}_{m} \end{array}\right)
Ew^=(y1−w^Tx^1y2−w^Tx^2⋯ym−w^Tx^m)⎝⎜⎜⎜⎛y1−w^Tx^1y2−w^Tx^2⋮ym−w^Tx^m⎠⎟⎟⎟⎞
则:
E
w
^
=
(
y
−
X
w
^
)
T
(
y
−
X
w
^
)
E_{\hat{\boldsymbol{w}}}=(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})^{\mathrm{T}}(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})
Ew^=(y−Xw^)T(y−Xw^)
然后对w一尖求偏导,对向量求偏导可以查询数学手册。
∂
E
w
^
∂
w
^
=
∂
∂
w
^
[
(
y
−
X
w
^
)
T
(
y
−
X
w
^
)
]
=
∂
∂
w
^
[
(
y
T
−
w
^
T
X
T
)
(
y
−
X
w
^
)
]
=
∂
∂
w
^
[
y
T
y
−
y
T
X
w
^
−
w
^
T
X
T
y
+
w
^
T
X
T
X
w
^
]
=
∂
∂
w
^
[
−
y
T
X
w
^
−
w
^
T
X
T
y
+
w
^
T
X
T
X
w
^
]
=
−
∂
y
T
X
w
^
∂
w
^
−
∂
w
^
T
X
T
y
∂
w
^
+
∂
w
^
T
X
T
X
w
^
∂
w
^
\begin{aligned} \frac{\partial E_{\hat{\boldsymbol{w}}}}{\partial \hat{\boldsymbol{w}}} &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})^{\mathrm{T}}(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\left(\boldsymbol{y}^{\mathrm{T}}-\hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}}\right)(\boldsymbol{y}-\mathbf{X} \hat{\boldsymbol{w}})\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[\boldsymbol{y}^{\mathrm{T}} \boldsymbol{y}-\boldsymbol{y}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}-\hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \boldsymbol{y}+\hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}\right] \\ &=\frac{\partial}{\partial \hat{\boldsymbol{w}}}\left[-\boldsymbol{y}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}-\hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \boldsymbol{y}+\hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}\right] \\ &=-\frac{\partial \boldsymbol{y}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{\boldsymbol{w}}}-\frac{\partial \hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \boldsymbol{y}}{\partial \hat{\boldsymbol{w}}}+\frac{\partial \hat{\boldsymbol{w}}^{\mathrm{T}} \mathbf{X}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}}{\partial \hat{\boldsymbol{w}}} \end{aligned}
∂w^∂Ew^=∂w^∂[(y−Xw^)T(y−Xw^)]=∂w^∂[(yT−w^TXT)(y−Xw^)]=∂w^∂[yTy−yTXw^−w^TXTy+w^TXTXw^]=∂w^∂[−yTXw^−w^TXTy+w^TXTXw^]=−∂w^∂yTXw^−∂w^∂w^TXTy+∂w^∂w^TXTXw^
根据矩阵微分公式
∂
x
T
a
∂
x
=
∂
a
T
x
∂
x
=
a
,
∂
x
T
A
x
∂
x
=
(
A
+
A
T
)
x
可得:
∂
E
w
^
∂
w
^
=
−
X
T
y
−
X
T
y
+
(
X
T
X
+
X
T
X
)
w
^
=
2
X
T
(
X
w
^
−
y
)
\begin{aligned} &\text { 根据矩阵微分公式 } \frac{\partial \boldsymbol{x}^{\mathrm{T}} \boldsymbol{a}}{\partial \boldsymbol{x}}=\frac{\partial \boldsymbol{a}^{\mathrm{T}} \boldsymbol{x}}{\partial \boldsymbol{x}}=\boldsymbol{a}, \frac{\partial \boldsymbol{x}^{\mathrm{T}} \mathbf{A} \boldsymbol{x}}{\partial \boldsymbol{x}}=\left(\mathbf{A}+\mathbf{A}^{\mathrm{T}}\right) \boldsymbol{x} \text { 可得: }\\ &\begin{aligned} \frac{\partial E_{\hat{\boldsymbol{w}}}}{\partial \hat{\boldsymbol{w}}} &=-\mathbf{X}^{T} \boldsymbol{y}-\mathbf{X}^{T} \boldsymbol{y}+\left(\mathbf{X}^{T} \mathbf{X}+\mathbf{X}^{T} \mathbf{X}\right) \hat{\boldsymbol{w}} \\ &=2 \mathbf{X}^{\mathrm{T}}(\mathbf{X} \hat{\boldsymbol{w}}-\boldsymbol{y}) \text { } \end{aligned} \end{aligned}
根据矩阵微分公式 ∂x∂xTa=∂x∂aTx=a,∂x∂xTAx=(A+AT)x 可得: ∂w^∂Ew^=−XTy−XTy+(XTX+XTX)w^=2XT(Xw^−y)
令其为0:
∂
E
w
^
∂
w
^
=
2
X
T
(
X
w
^
−
y
)
=
0
2
X
T
X
w
^
−
2
X
T
y
=
0
2
X
T
X
w
^
=
2
X
T
y
w
^
=
(
X
T
X
)
−
1
X
T
y
\begin{aligned} &\frac{\partial E_{\hat{\boldsymbol{w}}}}{\partial \hat{\boldsymbol{w}}}=2 \mathbf{X}^{\mathrm{T}}(\mathbf{X} \hat{\boldsymbol{w}}-\boldsymbol{y})=0 \\ &2 \mathbf{X}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}-2 \mathbf{X}^{\mathrm{T}} \boldsymbol{y}=0 \\ &2 \mathbf{X}^{\mathrm{T}} \mathbf{X} \hat{\boldsymbol{w}}=2 \mathbf{X}^{\mathrm{T}} \boldsymbol{y} \\ &\hat{\boldsymbol{w}}=\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}\right)^{-1} \mathbf{X}^{\mathrm{T}} \boldsymbol{y} \end{aligned}
∂w^∂Ew^=2XT(Xw^−y)=02XTXw^−2XTy=02XTXw^=2XTyw^=(XTX)−1XTy
所以,最终得模型为:
f
(
x
^
i
)
=
x
^
i
T
(
X
T
X
)
−
1
X
T
y
f\left(\hat{\boldsymbol{x}}_{i}\right)=\hat{\boldsymbol{x}}_{i}^{\mathrm{T}}\left(\mathbf{X}^{\mathrm{T}} \mathbf{X}\right)^{-1} \mathbf{X}^{\mathrm{T}} \boldsymbol{y}
f(x^i)=x^iT(XTX)−1XTy