数据集格式
在机器学习里数据集格式一般如下:
第
i
i
i个样本特征和标签写作:
x
i
=
(
x
1
i
,
x
2
i
,
x
3
i
,
.
.
.
,
x
d
i
)
T
∈
R
d
y
i
∈
R
x^i=(x_1^i,x_2^i,x_3^i,...,x_d^i)^T \in R^d \\ y^i \in R
xi=(x1i,x2i,x3i,...,xdi)T∈Rdyi∈R
完整的数据集可以写作:
X
=
[
x
1
,
x
2
,
…
,
x
n
]
=
[
x
1
1
x
1
2
…
x
1
n
x
2
1
x
2
2
…
x
2
n
⋮
⋮
⋱
⋮
x
d
1
x
d
2
…
x
d
n
]
∈
R
d
∗
n
y
=
[
y
1
,
y
2
,
…
,
y
n
]
∈
R
n
X=[ x^1,x^2,\ldots,x^n] \\ = \begin{bmatrix} x^1_1& x^2_1 &\ldots &x^n_1 \\ x^1_2& x^2_2 &\ldots &x^n_2 \\ \vdots& \vdots & \ddots & \vdots\\ x^1_d& x^2_d &\ldots &x^n_d \\ \end{bmatrix} \in R^{d*n} \\ y=[ y^1,y^2,\ldots,y^n] \in R^n
X=[x1,x2,…,xn]=
x11x21⋮xd1x12x22⋮xd2……⋱…x1nx2n⋮xdn
∈Rd∗ny=[y1,y2,…,yn]∈Rn
线性回归表达式
线性回归的权重
w
w
w,默认写作
w
=
[
w
1
,
w
2
,
…
,
w
d
]
T
∈
R
d
w=[w_1,w_2,\ldots,w_d]^T \in R^d
w=[w1,w2,…,wd]T∈Rd
对于第
i
i
i个样本,线性回归模型可写作:
f
(
x
)
=
w
T
x
+
b
=
w
1
x
1
+
w
2
x
2
+
…
+
w
d
x
d
+
b
f(x) = w^Tx+b \\ = w_1x_1+ w_2x_2+\ldots+w_dx_d+b
f(x)=wTx+b=w1x1+w2x2+…+wdxd+b
对于特征集合
X
X
X,预测值
Y
^
\hat{Y}
Y^ 可以通过矩阵-向量乘法表示为
Y
^
=
w
T
X
+
b
\hat{Y} = w^TX+b
Y^=wTX+b
如果以均方误差MSE (Mean Squared Error)作为损失函数的话,则损失函数(loss function or cost function)可以写作
L
=
1
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
2
=
1
n
∑
i
=
1
n
(
w
1
x
1
i
+
w
2
x
2
i
+
…
+
w
d
x
d
i
+
b
−
y
i
)
2
L =\frac{1}{n} \sum_{i=1}^{n}(f(x^i)-y^i)^2 \\ =\frac{1}{n} \sum_{i=1}^{n}(w_1x_1^i+ w_2x_2^i+\ldots+w_dx_d^i+b-y^i)^2
L=n1i=1∑n(f(xi)−yi)2=n1i=1∑n(w1x1i+w2x2i+…+wdxdi+b−yi)2
通用范式
引入梯度符号
∇
\nabla
∇,则
∇
L
(
w
)
=
[
∂
L
∂
w
1
∂
L
∂
w
2
⋮
∂
L
∂
w
d
]
∇
L
(
b
)
=
[
∂
L
∂
b
]
=
∂
L
∂
b
\nabla L(w)=\begin{bmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_d} \end{bmatrix} \\ \nabla L(b) = [\frac{\partial L}{\partial b} ] =\frac{\partial L}{\partial b}
∇L(w)=
∂w1∂L∂w2∂L⋮∂wd∂L
∇L(b)=[∂b∂L]=∂b∂L
因此梯度下降一般化更新公式可写作
w
←
w
−
η
∇
L
(
w
)
b
←
b
−
η
∇
L
(
b
)
=
b
−
η
∂
L
∂
b
w \leftarrow w - \eta \nabla L(w) \\ b \leftarrow b - \eta \nabla L(b) =b - \eta \frac{\partial L}{\partial b}
w←w−η∇L(w)b←b−η∇L(b)=b−η∂b∂L
其中
η
\eta
η为学习率。
权重 w w w的梯度下降
对于权重
w
j
w_j
wj,求梯度可得
∂
L
∂
w
j
=
2
n
∑
i
=
1
n
(
w
1
x
1
i
+
w
2
x
2
i
+
…
+
w
d
x
d
i
+
b
−
y
i
)
x
j
i
=
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
x
j
i
\frac{\partial L}{\partial w_j}=\frac{2}{n} \sum_{i=1}^{n}(w_1x_1^i+ w_2x_2^i+\ldots+w_dx_d^i+b-y^i)x_j^i \\ = \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_j^i
∂wj∂L=n2i=1∑n(w1x1i+w2x2i+…+wdxdi+b−yi)xji=n2i=1∑n(f(xi)−yi)xji
则
w
j
w_j
wj的梯度下降更新可写作
w
j
=
w
j
−
∂
L
∂
w
j
=
w
j
−
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
x
j
i
w_j=w_j-\frac{\partial L}{\partial w_j} \\ = w_j -\frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_j^i
wj=wj−∂wj∂L=wj−n2i=1∑n(f(xi)−yi)xji
同理,
w
1
w_1
w1,
w
2
w_2
w2,…
w
d
w_d
wd分别可写作
w
1
=
w
1
−
∂
L
∂
w
1
=
w
1
−
η
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
x
1
i
w
2
=
w
2
−
∂
L
∂
w
2
=
w
2
−
η
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
x
2
i
⋮
w
d
=
w
d
−
∂
L
∂
w
2
=
w
d
−
η
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
x
d
i
w_1=w_1-\frac{\partial L}{\partial w_1} \\ = w_1 -\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_1^i \\ w_2=w_2-\frac{\partial L}{\partial w_2} \\ = w_2 -\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_2^i \\ \vdots \\ w_d=w_d-\frac{\partial L}{\partial w_2} \\ = w_d -\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_d^i \\
w1=w1−∂w1∂L=w1−ηn2i=1∑n(f(xi)−yi)x1iw2=w2−∂w2∂L=w2−ηn2i=1∑n(f(xi)−yi)x2i⋮wd=wd−∂w2∂L=wd−ηn2i=1∑n(f(xi)−yi)xdi
观察上述式子可知,
(
f
(
x
i
)
−
y
i
)
(f(x^i)-y^i)
(f(xi)−yi)为实数,而
w
=
[
w
1
,
w
2
,
…
,
w
d
]
T
∈
R
d
x
i
=
(
x
1
i
,
x
2
i
,
x
3
i
,
.
.
.
,
x
d
i
)
T
w=[w_1,w_2,\ldots,w_d]^T \in R^d \\ x^i=(x_1^i,x_2^i,x_3^i,...,x_d^i)^T
w=[w1,w2,…,wd]T∈Rdxi=(x1i,x2i,x3i,...,xdi)T
则权重
w
w
w的梯度下降公式可写作
[
w
1
w
2
⋮
w
d
]
=
[
w
1
w
2
⋮
w
d
]
−
η
[
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
x
1
i
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
x
2
i
⋮
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
x
d
i
]
=
[
w
1
w
2
⋮
w
d
]
−
η
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
[
x
1
i
x
2
i
⋮
x
d
i
]
\begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix} = \begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix} - \eta \begin{bmatrix} \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_1^i\\ \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_2^i\\ \vdots\\ \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)x_d^i\\ \end{bmatrix} \\ = \begin{bmatrix} w_1\\ w_2\\ \vdots \\ w_d\\ \end{bmatrix} - \eta\frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) \begin{bmatrix} x_1^i\\ x_2^i\\ \vdots\\ x_d^i\\ \end{bmatrix} \\
w1w2⋮wd
=
w1w2⋮wd
−η
n2∑i=1n(f(xi)−yi)x1in2∑i=1n(f(xi)−yi)x2i⋮n2∑i=1n(f(xi)−yi)xdi
=
w1w2⋮wd
−ηn2i=1∑n(f(xi)−yi)
x1ix2i⋮xdi
因此上述式子可写作
w
=
w
−
η
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
x
i
w=w-\eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) x^i
w=w−ηn2i=1∑n(f(xi)−yi)xi
亦即
w
=
w
−
η
2
n
∑
i
=
1
n
(
w
T
x
+
b
−
y
i
)
x
i
w=w-\eta \frac{2}{n} \sum_{i=1}^{n}(w^Tx+b-y^i) x^i
w=w−ηn2i=1∑n(wTx+b−yi)xi
权重 b b b的梯度下降
同理可得权重
b
b
b的梯度
∂
L
∂
b
=
2
n
∑
i
=
1
n
(
w
1
x
1
i
+
w
2
x
2
i
+
…
+
w
d
x
d
i
+
b
−
y
i
)
=
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
\frac{\partial L}{\partial b}=\frac{2}{n} \sum_{i=1}^{n}(w_1x_1^i+ w_2x_2^i+\ldots+w_dx_d^i+b-y^i)\\ = \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)
∂b∂L=n2i=1∑n(w1x1i+w2x2i+…+wdxdi+b−yi)=n2i=1∑n(f(xi)−yi)
则梯度更新公式为
b
=
b
−
η
∂
L
∂
b
=
b
−
η
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
b = b -\eta \frac{\partial L}{\partial b} =b - \eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i)
b=b−η∂b∂L=b−ηn2i=1∑n(f(xi)−yi)
亦即:
b
=
b
−
η
2
n
∑
i
=
1
n
(
f
(
x
i
)
−
y
i
)
=
b
−
η
(
w
T
x
+
b
−
y
i
)
b=b - \eta \frac{2}{n} \sum_{i=1}^{n}(f(x^i)-y^i) = b - \eta (w^Tx+b-y^i)
b=b−ηn2i=1∑n(f(xi)−yi)=b−η(wTx+b−yi)
参考文献
Zhang, Aston and Lipton, Zachary C. and Li, Mu and Smola, Alexander J , DiveintoDeepLearning