均方误差函数:f(w)=∑i=1m(yi−xiTw)2f(w) = { \sum_{i=1}^m {(y_i - x_i^Tw)^2} }f(w)=i=1∑m(yi−xiTw)2
f(w)分别对w1,w2,...wdw_1, w_2, ...w_dw1,w2,...wd求偏导
−∂f(w)∂w1=2(y1−x1T⋅w)⋅x11+2(y2−x2T⋅w)⋅x21+⋯+2(ym−xmT⋅w)⋅xm1-{{\partial f(w)}\over\partial{w_1} } = 2(y_1 - x_1^T\cdot w)\cdot x_1^1 + 2(y_2 - x_2^T\cdot w)\cdot x_2^1+\cdots +2(y_m - x_m^T\cdot w)\cdot x_m^1−∂w1∂f(w)=2(y1−x1T⋅w)⋅x11+2(y2−x2T⋅w)⋅x21+⋯+2(ym−xmT⋅w)⋅xm1
−∂f(w)∂w2=2(y1−x1T⋅w)⋅x12+2(y2−x2T⋅w)⋅x22+⋯+2(ym−xmT⋅w)⋅xm2-{{\partial f(w)}\over\partial{w_2} } = 2(y_1 - x_1^T\cdot w)\cdot x_1^2 + 2(y_2 - x_2^T\cdot w)\cdot x_2^2+\cdots +2(y_m- x_m^T\cdot w)\cdot x_m^2−∂w2∂f(w)=2(y1−x1T⋅w)⋅x12+2(y2−x2T⋅w)⋅x22+⋯+2(ym−xmT⋅w)⋅xm2
⋯\cdots⋯
−∂f(w)∂wd=2(y1−x1T⋅w)⋅x1d+2(y2−x2T⋅w)⋅x2d+⋯+2(ym−xmT⋅w)⋅xmd-{{\partial f(w)}\over\partial{w_d} } = 2(y_1 - x_1^T\cdot w)\cdot x_1^d + 2(y_2 - x_2^T\cdot w)\cdot x_2^d+\cdots +2(y_m- x_m^T\cdot w)\cdot x_m^d−∂wd∂f(w)=2(y1−x1T⋅w)⋅x1d+2(y2−x2T⋅w)⋅x2d+⋯+2(ym−xmT⋅w)⋅xmd
注意我们的样本矩阵X的形式:X=[x11x12⋯x1dx21x22⋯x2d⋯⋯⋯⋯xm1xm2⋯xmd]X =\left[ { \begin{matrix}
x_1^1 & x_1^2 & \cdots &x_1^d \\
x_2^1 & x_2^2 & \cdots &x_2^d \\
\cdots&\cdots&\cdots&\cdots \\
x_m^1 & x_m^2 & \cdots &x_m^d
\end{matrix} }\right]X=x11x21⋯xm1x12x22⋯xm2⋯⋯⋯⋯x1dx2d⋯xmd
还有真实值Y矩阵(向量):
Y=[y1y2⋯ym]Y=\left[ {\begin{matrix}
y_1\\y_2\\
\cdots
\\ y_m
\end{matrix}} \right]Y=y1y2⋯ym
我们把各个偏导写成矩阵形式,也就是梯度向量:
[w1′w2′⋯wd′]=2[x11x21⋯xm1x12x22⋯xm2⋯⋯⋯⋯x1dx2d⋯xmd]⋅([y1y2⋯ym]−[x11x12⋯x1dx21x22⋯x2d⋯⋯⋯⋯xm1xm2⋯xmd]⋅[w1w2⋯wd])=2XT(Y−X⋅w)\left[ { \begin{matrix} w_1' \\w_2'\\\cdots\\w_d'\end{matrix} } \right]=2
\left[ { \begin{matrix}
x_1^1 & x_2^1 & \cdots &x_m^1 \\
x_1^2 & x_2^2 & \cdots &x_m^2 \\
\cdots&\cdots&\cdots&\cdots \\
x_1^d & x_2^d & \cdots &x_m^d
\end{matrix} }\right] \cdot \left(\left[ {\begin{matrix}y_1\\y_2\\\cdots \\y_m\end{matrix}} \right] - \left[ { \begin{matrix}
x_1^1 & x_1^2 & \cdots &x_1^d \\
x_2^1 & x_2^2 & \cdots &x_2^d \\
\cdots&\cdots&\cdots&\cdots \\
x_m^1 & x_m^2 & \cdots &x_m^d
\end{matrix} }\right]\cdot \left[ \begin{matrix} w_1\\w_2\\\cdots\\w_d\end{matrix}\right] \right)
= 2X^T (Y-X\cdot w) w1′w2′⋯wd′=2x11x12⋯x1dx21x22⋯x2d⋯⋯⋯⋯xm1xm2⋯xmd⋅y1y2⋯ym−x11x21⋯xm1x12x22⋯xm2⋯⋯⋯⋯x1dx2d⋯xmd⋅w1w2⋯wd=2XT(Y−X⋅w)
我们可以使用已有的数据求出梯度向量,所以就可以使用梯度下降法优化w,使f(w)最小。不过,我们像求函数极值一样,让所有偏导都等于0,可以得到:
2XT(Y−Xw)=2XTY−2XTXw=02X^T(Y - X w) = 2X^T Y - 2X^T X w = 02XT(Y−Xw)=2XTY−2XTXw=0
2XTXw=2XTY2X^TXw =2 X^TY2XTXw=2XTY
w^=(XTX)−1XTY\hat{w} = (X^TX)^{-1}X^TYw^=(XTX)−1XTY
但是这里的w不是真实w,只是利用统计数据估计出来的,所以使用w^\hat{w}w^表示w的最佳值。
值得注意的是,XTXX^TXXTX并不总是可逆,所以在优化函数中,要添加可逆判断增加优化函数健壮性。矩阵奇异值分解也可以解决此问题。
本文详细介绍了如何通过计算均方误差函数的梯度来优化权重向量w,包括了从原始公式出发逐步推导出梯度下降法的具体步骤,并最终给出了求解最佳权重向量w的闭式解。
4202

被折叠的 条评论
为什么被折叠?



