标准方程的证明
线性回归模型公式(第i个实例的预测值
y
i
^
\hat{y_i}
yi^):
y
i
^
=
θ
0
+
θ
1
x
i
,
1
+
θ
2
x
i
,
2
+
.
.
.
+
θ
n
x
i
,
n
\hat{y_i}=\theta_0+\theta_1 x_{i,1}+\theta_2 x_{i,2} + ... + \theta_n x_{i,n}
yi^=θ0+θ1xi,1+θ2xi,2+...+θnxi,n
转化成矩阵:
y
i
^
=
[
1
x
i
,
1
x
i
,
2
⋯
x
i
,
n
]
[
θ
0
θ
1
θ
2
⋮
θ
n
]
\hat{y_i}= \begin{bmatrix} 1 & x_{i,1} & x_{i,2} & \cdots & x_{i,n} \end{bmatrix} \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}
yi^=[1xi,1xi,2⋯xi,n]⎣⎢⎢⎢⎢⎢⎡θ0θ1θ2⋮θn⎦⎥⎥⎥⎥⎥⎤
简化为:
y
i
^
=
x
i
T
θ
\hat{y_i}=\mathbf{x_i}^{T}\theta
yi^=xiTθ
误差公式为:
M
S
E
(
θ
)
=
1
m
∑
i
=
1
m
(
y
^
i
−
y
i
)
2
=
1
m
∑
i
=
1
m
(
x
i
T
θ
−
y
i
)
2
MSE(\mathbf{\theta})=\frac{1}{m} \sum_{i=1}^m (\hat{y}_i-y_i)^{2} =\frac{1}{m}\sum_{i=1}^{m}(\mathbf{x_i}^{T} \mathbf{\theta} -y_i)^{2}
MSE(θ)=m1i=1∑m(y^i−yi)2=m1i=1∑m(xiTθ−yi)2
设:
c
=
[
x
1
T
θ
−
y
1
x
2
T
θ
−
y
2
⋮
x
m
T
θ
−
y
m
]
=
[
x
1
T
θ
x
2
T
θ
⋮
x
m
T
θ
]
−
[
y
1
y
2
⋮
y
m
]
=
[
x
1
T
x
2
T
⋮
x
m
T
]
θ
−
y
=
X
θ
−
y
\mathbf c = \begin{bmatrix} \mathbf{x_1}^{T} \mathbf{\theta} -y_1 \\ \mathbf{x_2}^{T} \mathbf{\theta} -y_2 \\ \vdots \\ \mathbf{x_m}^{T} \mathbf{\theta} -y_m \\ \end{bmatrix}= \begin{bmatrix} \mathbf{x_1}^{T} \mathbf{\theta} \\ \mathbf{x_2}^{T} \mathbf{\theta} \\ \vdots \\ \mathbf{x_m}^{T} \mathbf{\theta} \\ \end{bmatrix}- \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_m \end{bmatrix}= \begin{bmatrix} \mathbf{x_1}^{T} \\ \mathbf{x_2}^{T} \\ \vdots \\ \mathbf{x_m}^{T} \\ \end{bmatrix} \mathbf{\theta} -\mathbf{y} =\mathbf{X}\mathbf{\theta}-\mathbf{y}
c=⎣⎢⎢⎢⎡x1Tθ−y1x2Tθ−y2⋮xmTθ−ym⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡x1Tθx2Tθ⋮xmTθ⎦⎥⎥⎥⎤−⎣⎢⎢⎢⎡y1y2⋮ym⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡x1Tx2T⋮xmT⎦⎥⎥⎥⎤θ−y=Xθ−y
则:
M
S
E
(
θ
)
=
1
m
∥
c
∥
2
=
1
m
∥
X
θ
−
y
∥
2
MSE(\mathbf{\theta})=\frac{1}{m} \left \| \mathbf{c} \right \|^{2} =\frac{1}{m} \left \| \mathbf{X}\mathbf{\theta}-\mathbf{y} \right \|^{2}
MSE(θ)=m1∥c∥2=m1∥Xθ−y∥2
M
S
E
(
θ
)
MSE(\mathbf{\theta})
MSE(θ)要取到最小值,则对
M
S
E
(
θ
)
=
M
S
E
(
θ
0
,
θ
1
,
⋯
,
θ
n
)
=
E
MSE(\mathbf{\theta})=MSE(\theta_0,\theta_1,\cdots,\theta_n)=E
MSE(θ)=MSE(θ0,θ1,⋯,θn)=E,相当于求解该多变量函数梯度为0的点,梯度向量为E函数对
θ
\mathbf{\theta}
θ的偏导数:
∂
E
∂
θ
=
[
∂
E
∂
θ
0
∂
E
∂
θ
1
⋯
∂
E
∂
θ
n
]
\frac{\partial{E}}{\partial{\mathbf{\theta}}}= \begin{bmatrix} \frac{\partial{E}}{\partial{\theta_0}} & \frac{\partial{E}}{\partial{\theta_1}} & \cdots & \frac{\partial{E}}{\partial{\theta_n}} & \end{bmatrix}
∂θ∂E=[∂θ0∂E∂θ1∂E⋯∂θn∂E]
由矩阵的求导法则及下一节证明出的公式可证:
设
g
(
θ
)
=
X
θ
−
y
=
u
g(\mathbf \theta)=\mathbf X \mathbf \theta - \mathbf y=\mathbf u
g(θ)=Xθ−y=u,则
f
(
u
)
=
M
S
E
(
θ
)
=
1
m
∥
g
(
θ
)
∥
2
=
1
m
∥
u
∥
2
f(\mathbf u)=MSE(\mathbf \theta)=\frac{1}{m}\left\| g(\mathbf \theta) \right\|^2=\frac{1}{m} \left\| \mathbf u \right\|^2
f(u)=MSE(θ)=m1∥g(θ)∥2=m1∥u∥2
∂
M
S
E
(
θ
)
∂
θ
=
∂
f
(
u
)
∂
θ
=
∂
f
(
u
)
∂
u
∂
u
∂
θ
=
∂
1
m
∥
u
∥
2
∂
u
∂
X
θ
−
y
∂
θ
=
1
m
∂
u
T
u
∂
u
X
=
2
m
u
T
X
\frac{\partial MSE(\mathbf \theta)}{\partial \mathbf \theta}=\frac{\partial f(\mathbf u)}{\partial \mathbf \theta}=\frac{\partial f(\mathbf u)}{\partial \mathbf u} \frac{\partial \mathbf u}{\partial \mathbf \theta}=\frac{\partial \frac{1}{m} \left\| \mathbf u \right\|^2}{\partial \mathbf u} \frac{\partial \mathbf X \mathbf \theta - \mathbf y}{\partial \mathbf \theta} =\frac{1}{m}\frac{\partial \mathbf u^T\mathbf u}{\partial \mathbf u}\mathbf X=\frac{2}{m}\mathbf u^T\mathbf X
∂θ∂MSE(θ)=∂θ∂f(u)=∂u∂f(u)∂θ∂u=∂u∂m1∥u∥2∂θ∂Xθ−y=m1∂u∂uTuX=m2uTX
则求解梯度全为0时
θ
\mathbf \theta
θ的值
θ
^
\hat{\mathbf \theta}
θ^:
2
m
(
X
θ
^
−
y
)
T
X
=
0
\frac{2}{m}\left( \mathbf X\hat{\mathbf \theta}-\mathbf y \right)^T\mathbf X=\mathbf 0
m2(Xθ^−y)TX=0
θ
^
T
X
T
X
−
y
T
X
=
0
\hat{\mathbf \theta}^T \mathbf X^T \mathbf X-\mathbf y^T \mathbf X=\mathbf 0
θ^TXTX−yTX=0
θ
^
T
=
y
T
X
(
X
T
X
)
−
1
\hat{\mathbf \theta}^T=\mathbf y^T\mathbf X\left( \mathbf X^T\mathbf X \right)^{-1}
θ^T=yTX(XTX)−1
θ
^
=
(
X
T
X
)
−
1
X
T
y
\hat{\mathbf \theta}=\left( \mathbf X^T\mathbf X \right)^{-1}\mathbf X^T\mathbf y
θ^=(XTX)−1XTy
本质上来说是矩阵求导的应用,特殊多项式求最小值,该计算涉及到求逆操作,对n×n矩阵的求逆的计算复杂度通常为 O ( n 2.4 ) O(n^{2.4}) O(n2.4)到 O ( n 3 ) O(n^{3}) O(n3)之间,当特征数量比较大时(例如100000时),标准方程的计算会极其缓慢