多元线性模型
y
=
β
0
+
β
1
x
1
+
β
2
x
2
+
⋯
+
β
n
x
n
(1)
y=\beta_0+\beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n\tag{1}
y=β0+β1x1+β2x2+⋯+βnxn(1)
对于m个样本来说,可以用线性方程组来表示:
y
1
=
β
0
+
β
1
x
11
+
β
2
x
12
+
⋯
+
β
n
x
1
n
y_1=\beta_0+\beta_1 x_{11} + \beta_2 x_{12} + \cdots + \beta_n x_{1n}
y1=β0+β1x11+β2x12+⋯+βnx1n
y
2
=
β
0
+
β
1
x
21
+
β
2
x
22
+
⋯
+
β
n
x
2
n
y_2=\beta_0+\beta_1 x_{21} + \beta_2 x_{22} + \cdots + \beta_n x_{2n}
y2=β0+β1x21+β2x22+⋯+βnx2n
⋯
\cdots
⋯
y
m
=
β
0
+
β
1
x
m
1
+
β
2
x
m
2
+
⋯
+
β
n
x
m
n
y_m=\beta_0+\beta_1 x_{m1} + \beta_2 x_{m2} + \cdots + \beta_n x_{mn}
ym=β0+β1xm1+β2xm2+⋯+βnxmn
用矩阵来表示为:
( 1 x 11 ⋯ x 1 n 1 x 21 ⋯ x 2 n ⋮ ⋮ ⋱ ⋮ 1 x m 1 ⋯ x m n ) ( β 0 β 1 ⋮ β n ) = ( y 1 y 2 ⋮ y m ) \begin{pmatrix}1&x_{11}&\cdots&x_{1n}\\1&x_{21}&\cdots&x_{2n}\\\vdots&\vdots&\ddots&\vdots\\1&x_{m1}&\cdots&x_{mn}\end{pmatrix} \begin{pmatrix}\beta_0\\\beta_1\\\vdots\\\beta_n\end{pmatrix}=\begin{pmatrix}y_1\\y_2\\\vdots\\y_m\end{pmatrix} ⎝ ⎛11⋮1x11x21⋮xm1⋯⋯⋱⋯x1nx2n⋮xmn⎠ ⎞⎝ ⎛β0β1⋮βn⎠ ⎞=⎝ ⎛y1y2⋮ym⎠ ⎞
A β = Y (2) A\beta = Y\tag{2} Aβ=Y(2)
对于最小范式来说,误差最小化的矩阵表达形式为:
min
∣
∣
A
β
−
Y
∣
∣
2
2
\text{min}\vert\vert A\beta - Y\vert\vert_2^2
min∣∣Aβ−Y∣∣22
下标的2代表向量范数的欧几里得范数
min ∣ ∣ A β − Y ∣ ∣ 2 2 = ( A β − Y ) T ( A β − Y ) = ( β T A T − Y T ) ( A β − Y ) = β T A T A β − β T A T Y − Y T A β + Y T Y \begin{align*}\text{min}\vert\vert A\beta - Y\vert\vert_2^2&=(A\beta - Y)^T(A\beta-Y)\\&=(\beta^TA^T-Y^T)(A\beta - Y)\\&=\beta^TA^TA\beta - \beta^TA^TY - Y^TA\beta+Y^TY\end{align*} min∣∣Aβ−Y∣∣22=(Aβ−Y)T(Aβ−Y)=(βTAT−YT)(Aβ−Y)=βTATAβ−βTATY−YTAβ+YTY
β T A T Y \beta^TA^TY βTATY和 Y T A β Y^TA\beta YTAβ都是标量
min ∣ ∣ A β − Y ∣ ∣ 2 2 = β T A T A β − 2 β T A T Y + Y T Y (3) \text{min}\vert\vert A\beta - Y\vert\vert_2^2=\beta^TA^TA\beta - 2\beta^TA^TY + Y^TY\tag{3} min∣∣Aβ−Y∣∣22=βTATAβ−2βTATY+YTY(3)
(3)式对 β \beta β进行求导,
向量积求导法则:
d ( u T v ) d x = d ( u T ) d x ⋅ v + d ( v T ) d x ⋅ u (1*) \frac{\text{d}(\textbf{u}^T\textbf{v})}{\text{d}\textbf{x}}=\frac{\text{d}(\textbf{u}^T)}{\text{d}\textbf{x}}\cdot \textbf{v} + \frac{\text{d}(\textbf{v}^T)}{\text{d}\textbf{x}}\cdot \textbf{u}\tag{1*} dxd(uTv)=dxd(uT)⋅v+dxd(vT)⋅u(1*)
d ( x T x ) d x = d ( x T ) d x ⋅ x + d ( x T ) d x ⋅ x = 2 x (2*) \frac{\text{d}(\textbf{x}^T\textbf{x})}{\text{d}\textbf{x}}=\frac{\text{d}(\textbf{x}^T)}{\text{d}\textbf{x}}\cdot \textbf{x} + \frac{\text{d}(\textbf{x}^T)}{\text{d}\textbf{x}}\cdot \textbf{x}=2\textbf{x}\tag{2*} dxd(xTx)=dxd(xT)⋅x+dxd(xT)⋅x=2x(2*)
d ( x T Ax ) d x = d ( x T ) d x ⋅ Ax + d ( x T A T ) d x ⋅ x = ( A + A T ) x (3*) \frac{\text{d}(\textbf{x}^T\textbf{A}\textbf{x})}{\text{d}\textbf{x}}=\frac{\text{d}(\textbf{x}^T)}{\text{d}\textbf{x}}\cdot \textbf{Ax} + \frac{\text{d}(\textbf{x}^T\textbf{A}^T)}{\text{d}\textbf{x}}\cdot \textbf{x}=(\textbf{A}+\textbf{A}^T)\textbf{x}\tag{3*} dxd(xTAx)=dxd(xT)⋅Ax+dxd(xTAT)⋅x=(A+AT)x(3*)
所以
∂ ( min ∣ ∣ A β − Y ∣ ∣ 2 2 ) ∂ β = ∂ ( β T A T A β − 2 β T A T Y + Y T Y ) ∂ β = ∂ ( β T A T A β − 2 β T A T Y ) ∂ β = A T A β + A T A β − 2 A T Y = 2 ( A T A β − A T Y ) (4) \begin{align*}\frac{\partial (\text{min}\vert\vert A\beta - Y\vert\vert_2^2)}{\partial \beta}&=\frac{\partial (\beta^TA^TA\beta - 2\beta^TA^TY + Y^TY)}{\partial \beta}\\&=\frac{\partial (\beta^TA^TA\beta - 2\beta^TA^TY)}{\partial \beta}\\&=A^TA\beta+A^TA\beta - 2A^TY\\&=2(A^TA\beta - A^T Y)\end{align*}\tag{4} ∂β∂(min∣∣Aβ−Y∣∣22)=∂β∂(βTATAβ−2βTATY+YTY)=∂β∂(βTATAβ−2βTATY)=ATAβ+ATAβ−2ATY=2(ATAβ−ATY)(4)
令(4)为0,则有:
β
=
(
A
T
A
)
−
1
A
T
Y
(5)
\beta = (A^TA)^{-1}A^TY\tag{5}
β=(ATA)−1ATY(5)