概念;
神经网络是一种人工智能思想方法,从人脑的角度来学习如何智能化。
模型:
我们把第一部分叫作输入层,第二部分叫作隐藏层,第三部分叫作输出层
每一条一直到底的线都是一个运算过程
我们可以得出如下:
a
1
(
2
)
=
g
(
θ
10
(
1
)
x
0
+
θ
11
(
1
)
x
1
+
θ
12
(
1
)
x
2
+
θ
13
(
1
)
x
3
)
a
2
(
2
)
=
g
(
θ
20
(
1
)
x
0
+
θ
21
(
1
)
x
1
+
θ
22
(
1
)
x
2
+
θ
23
(
1
)
x
3
)
a
3
(
2
)
=
g
(
θ
30
(
1
)
x
0
+
θ
31
(
1
)
x
1
+
θ
32
(
1
)
x
2
+
θ
33
(
1
)
x
3
)
h
θ
(
x
)
=
a
1
(
3
)
=
g
(
θ
10
(
1
)
a
0
(
2
)
+
θ
11
(
1
)
a
1
(
2
)
+
θ
12
(
1
)
a
2
(
2
)
+
θ
13
(
1
)
a
3
(
2
)
)
a_1^{(2)} = g(\theta_{10}^{(1)}x_0+\theta_{11}^{(1)}x_1+\theta_{12}^{(1)}x_2+\theta_{13}^{(1)}x_3) \\ a_2^{(2)} = g(\theta_{20}^{(1)}x_0+\theta_{21}^{(1)}x_1+\theta_{22}^{(1)}x_2+\theta_{23}^{(1)}x_3) \\ a_3^{(2)} = g(\theta_{30}^{(1)}x_0+\theta_{31}^{(1)}x_1+\theta_{32}^{(1)}x_2+\theta_{33}^{(1)}x_3) \\ h_\theta(x) = a_1^{(3)} = g(\theta_{10}^{(1)}a_0^{(2)}+\theta_{11}^{(1)}a_1^{(2)}+\theta_{12}^{(1)}a_2^{(2)}+\theta_{13}^{(1)}a_3^{(2)})
a1(2)=g(θ10(1)x0+θ11(1)x1+θ12(1)x2+θ13(1)x3)a2(2)=g(θ20(1)x0+θ21(1)x1+θ22(1)x2+θ23(1)x3)a3(2)=g(θ30(1)x0+θ31(1)x1+θ32(1)x2+θ33(1)x3)hθ(x)=a1(3)=g(θ10(1)a0(2)+θ11(1)a1(2)+θ12(1)a2(2)+θ13(1)a3(2))
即每一条线都代表一个权重/参数,而单纯看另一边的即隐藏层和输出层,其实就是一个线性回归。
我们作如下假设
$$
x = [x_0,x_1,x_2,x_3] \
\theta = [\theta_0,\theta_1,\theta_2,\theta_3]\
z = \theta^Tx\
a = [a_1,a_2,a_3]\
a_1 = g(z)
$$
我们称这个过程为前向传播,第一层的数据传给第二层,第二层的数据传给第三层。
简化模型:
g(x)其实就是sigmod函数,在之前的文章已经阐述过了,这里不赘述。
我们假设一个二维模型,有
h
θ
(
x
)
=
g
(
θ
0
+
θ
1
x
1
+
θ
2
x
2
)
h_\theta(x) = g(\theta_0+\theta_1x_1+\theta_2x_2)
hθ(x)=g(θ0+θ1x1+θ2x2)
我们列出如下真值表:
代入里面就可以得到
g
(
θ
0
)
g
(
θ
0
+
θ
2
)
g
(
θ
0
+
θ
1
)
g
(
θ
0
+
θ
1
+
θ
2
)
g(\theta_0)\ g(\theta_0+\theta_2)\ g(\theta_0+\theta_1)\ g(\theta_0+\theta_1+\theta_2)
g(θ0) g(θ0+θ2) g(θ0+θ1) g(θ0+θ1+θ2)
我们可以根据具体的数据来得到一个逻辑表达式。
多分类:
例如我们识别四种不同的事物(图片),我们可以想到输入是三种情况(rgb),输出呢就是四种情况(对应着四种事物)
在二分类中,我们把输出y定义为{0,1}或者{-1,+1},那在这里我们能否把它构建成{1,2,3,4}呢?
显然是不行的,根据上面的知识,输出一般为0,1,那么同为3的情况就有很多种,所以我们使用二进制编码来表示,即
[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]这四种情况。我们可以得到它的基本神经网络:
我们定义一个神经网络的式子为 h θ ( x ) ∈ R k h_\theta(x) \in R^k hθ(x)∈Rk 其中k = 4(即我们要判断的分类总数)
我们可以知道
y
=
[
[
1
,
0
,
0
,
0
]
,
[
0
,
1
,
0
,
0
]
,
[
0
,
0
,
1
,
0
]
,
[
0
,
0
,
0
,
1
]
]
y = [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]]
y=[[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]]
由此开始构造我们的损失函数:
根据上面的知识,我们有 h θ ( x ) = g ( θ 10 ( 3 ) a 1 ( 3 ) + θ 20 ( 3 ) a 2 ( 3 ) + θ 30 ( 3 ) a 3 ( 3 ) + θ 40 ( 3 ) a 4 ( 3 ) + θ 50 ( 3 ) a 5 ( 3 ) ) h_\theta(x) = g(\theta_{10} ^{(3)} a_1^{(3)} + \theta_{20} ^{(3)} a_2^{(3)}+\theta_{30} ^{(3)} a_3^{(3)}+\theta_{40} ^{(3)} a_4^{(3)}+\theta_{50} ^{(3)} a_5^{(3)} ) hθ(x)=g(θ10(3)a1(3)+θ20(3)a2(3)+θ30(3)a3(3)+θ40(3)a4(3)+θ50(3)a5(3))
根据线性回归的结论,有
C
o
s
t
(
h
θ
(
x
)
,
y
)
=
y
i
l
o
g
(
h
θ
(
x
i
)
)
−
(
1
−
y
i
)
(
1
−
l
o
g
(
1
−
h
θ
(
x
i
)
)
)
J
(
θ
)
=
−
1
m
∑
C
o
s
t
(
h
θ
(
x
)
,
y
)
Cost(h_\theta(x),y) = y^{i}log(h_\theta(x^{i})) - (1-y^{i})(1-log(1-h_\theta(x^{i}))) \\ J(\theta) = -\frac{1}{m}\sum Cost(h_\theta(x), y)
Cost(hθ(x),y)=yilog(hθ(xi))−(1−yi)(1−log(1−hθ(xi)))J(θ)=−m1∑Cost(hθ(x),y)
同时每一条线都是一条单独的神经元。故最终应对每个神经元进行叠加
J
(
θ
)
=
−
1
m
∑
k
=
1
K
∑
i
=
1
m
y
k
i
l
o
g
(
h
θ
(
x
i
)
)
k
−
(
1
−
y
k
i
)
(
1
−
l
o
g
(
1
−
h
θ
(
x
i
)
)
k
J(\theta) = -\frac{1}{m} \sum^{K}_{k=1} \sum^m_{i=1} y_k^{i}log(h_\theta(x^{i}))_k - (1-y_k^{i})(1-log(1-h_\theta(x^{i}))_k
J(θ)=−m1k=1∑Ki=1∑mykilog(hθ(xi))k−(1−yki)(1−log(1−hθ(xi))k
此外,我们引入正则项防止过拟合,即:
λ
∑
θ
j
2
\lambda \sum \theta_j^2
λ∑θj2
故,在这里我们有:
r
e
g
u
r
i
z
a
t
i
o
n
=
λ
2
∑
l
=
1
L
−
1
∑
i
=
1
s
l
∑
j
=
1
s
l
+
1
(
θ
j
i
(
l
)
)
2
J
(
θ
)
=
−
1
m
∑
k
=
1
K
∑
i
=
1
m
y
k
i
l
o
g
(
h
θ
(
x
i
)
)
k
−
(
1
−
y
k
i
)
(
1
−
l
o
g
(
1
−
h
θ
(
x
i
)
)
k
+
λ
2
∑
l
=
1
L
−
1
∑
i
=
1
s
l
∑
j
=
1
s
l
+
1
(
θ
j
i
(
l
)
)
2
regurization = \frac{\lambda}{2} \sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\theta_{ji}^{(l)})^2 \\ J(\theta) = -\frac{1}{m} \sum^{K}_{k=1} \sum^m_{i=1} y_k^{i}log(h_\theta(x^{i}))_k - (1-y_k^{i})(1-log(1-h_\theta(x^{i}))_k+\frac{\lambda}{2} \sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\theta_{ji}^{(l)})^2
regurization=2λl=1∑L−1i=1∑slj=1∑sl+1(θji(l))2J(θ)=−m1k=1∑Ki=1∑mykilog(hθ(xi))k−(1−yki)(1−log(1−hθ(xi))k+2λl=1∑L−1i=1∑slj=1∑sl+1(θji(l))2
可知
θ
j
i
(
l
)
\theta_{ji}^{(l)}
θji(l)指的是第l层的第j个神经元的第i个权重
反向传播算法:
根据已知的
J
(
θ
)
J(\theta)
J(θ)函数,可知通过调节
h
θ
(
x
)
h_\theta(x)
hθ(x)来控制
J
(
θ
)
J(\theta)
J(θ),而
h
θ
(
x
)
h_\theta(x)
hθ(x)与权重矩阵
θ
\theta
θ与偏置向量
b
b
b有关,根据梯度下降的规则我们需要更新这两个量如下:
θ
l
=
θ
l
−
α
∂
E
∂
θ
l
b
l
=
b
l
−
α
∂
E
∂
b
l
\theta^l = \theta^l - \alpha \frac{\partial E}{\partial \theta^l} \\ b^l = b^l - \alpha \frac{\partial E}{\partial b^l}
θl=θl−α∂θl∂Ebl=bl−α∂bl∂E
依旧是以上图的神经网络为例,我们已知label 即y,和
a
(
4
)
a^{(4)}
a(4),那么我们可以定义:
δ
i
(
l
)
=
∂
E
∂
z
i
l
E
=
1
2
∣
∣
y
−
a
∣
∣
=
1
2
∑
(
y
k
−
a
k
)
2
∂
E
∂
θ
11
4
=
−
(
y
1
−
a
1
(
4
)
)
∂
a
1
(
4
)
∂
θ
11
4
=
−
(
y
1
−
a
1
(
4
)
)
g
′
(
z
1
(
4
)
)
a
1
(
3
)
=
∂
E
∂
z
1
3
∂
z
1
3
∂
θ
11
4
=
δ
i
(
4
)
a
1
(
3
)
\delta_i^{(l)} = \frac{\partial E}{\partial z_i^l} \\ E = \frac{1}{2} ||y-a|| = \frac{1}{2}\sum(y_k-a_k)^2 \\ \frac{\partial E}{\partial \theta^4_{11}} = -(y_1 - a_1^{(4)}) \frac{\partial a_1^{(4)}}{\partial \theta^4_{11}} = -(y_1 - a_1^{(4)})g'(z_1^{(4)})a_1^{(3)} = \frac{\partial E}{\partial z_1^3} \frac{\partial z_1^3}{\partial \theta_{11}^4} = \delta_i^{(4)}a_1^{(3)}
δi(l)=∂zil∂EE=21∣∣y−a∣∣=21∑(yk−ak)2∂θ114∂E=−(y1−a1(4))∂θ114∂a1(4)=−(y1−a1(4))g′(z1(4))a1(3)=∂z13∂E∂θ114∂z13=δi(4)a1(3)
下面我们推导第l+1层的公式
δ
i
(
l
)
=
∂
E
∂
z
i
(
l
)
=
∑
j
=
1
n
l
+
1
∂
E
∂
z
i
(
l
+
1
)
∂
z
i
(
l
+
1
)
∂
z
i
(
l
)
=
∑
j
=
1
n
l
+
1
δ
(
l
+
1
)
∂
z
i
(
l
+
1
)
∂
z
i
(
l
)
=
∑
j
=
1
n
l
+
1
δ
(
l
+
1
)
θ
j
i
l
+
1
g
′
(
z
i
l
)
\delta^{(l)}_i = \frac{\partial E}{\partial z_i^{(l)}} =\sum_{j=1}^{n_{l+1}} \frac{\partial E}{\partial z_i^{(l+1)}} \frac{\partial z_i^{(l+1)}}{\partial z_i^{(l)}} =\sum_{j=1}^{n_{l+1}} \delta^{(l+1)}\frac{\partial z_i^{(l+1)}}{\partial z_i^{(l)}} = \sum_{j=1}^{n_{l+1}} \delta^{(l+1)} \theta_{ji}^{l+1} g'(z_i^l)
δi(l)=∂zi(l)∂E=j=1∑nl+1∂zi(l+1)∂E∂zi(l)∂zi(l+1)=j=1∑nl+1δ(l+1)∂zi(l)∂zi(l+1)=j=1∑nl+1δ(l+1)θjil+1g′(zil)
对于偏置参数b,我们有:
∂
E
∂
b
i
(
l
)
=
∂
E
∂
z
i
(
l
)
∂
z
i
(
l
)
∂
b
i
(
l
)
=
δ
(
l
)
\frac{\partial E}{\partial b_i^{(l)}} = \frac{\partial E}{\partial z_i^{(l)}} \frac{\partial z_i^{(l)}}{\partial b_i^{(l)}} = \delta^{(l)}
∂bi(l)∂E=∂zi(l)∂E∂bi(l)∂zi(l)=δ(l)
总结有:
δ
(
L
)
=
−
(
y
−
a
(
L
)
)
g
′
(
z
(
L
)
)
δ
(
l
)
=
δ
(
l
+
1
)
θ
(
l
+
1
)
.
∗
g
′
(
z
(
l
)
)
\delta^{(L)} = -(y-a^{(L)})g'(z^{(L)}) \\ \delta^{(l)} = \delta^{(l+1)}\theta^{(l+1)}.*g'(z^{(l)})
δ(L)=−(y−a(L))g′(z(L))δ(l)=δ(l+1)θ(l+1).∗g′(z(l))
其实
δ
\delta
δ根本不是误差,而是这个参数影响整个式子的程度。
利用神经网络实现多分类
加载数据:
input_layer_size = 400
hidden_layer_size = 45
num_labels = 10
# part 1 Loading and Visualizing data
data = sio.loadmat('ex3data1.mat')
X = data["X"]
y = data["y"]
m = X.shape[0]
随机抽取数据可视化:
def displayData(x):
example_width = int(np.round(np.sqrt(np.size(x, 1))))
m, n = x.shape
example_height = int(n / example_width)
display_rows = int(np.floor(np.sqrt(m)))
display_cols = int(np.ceil(m / display_rows))
pad = 1
display_array = - np.ones((pad + display_rows * (example_height + pad), pad + display_cols * (example_width + pad)))
curr_ex = 0
for j in range(display_rows):
for i in range(display_cols):
if curr_ex > m:
break
max_vals = np.max(np.abs(X[curr_ex, :]))
display_array[pad + j * (example_height + pad):pad + j * (example_height + pad) + example_height,
pad + i * (example_width + pad):pad + i * (example_width + pad) + example_width] \
= x[curr_ex, :].reshape((example_height, example_width)) / max_vals
curr_ex += 1
if curr_ex > m:
break
plt.figure()
plt.imshow(display_array.T, cmap='gray', extent=[-1, 1, -1, 1])
plt.axis('off')
plt.show()
rand_indices = np.random.permutation(m)
sel = X[rand_indices[0:100], :]
displayData(sel)
_ = input('Press [enter] to continue')
加载参数:
print('loading Saved Neutral Network Parameters')
para = sio.loadmat('ex3weights.mat')
theta1 = para['Theta1']
theta2 = para['Theta2']
预测:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def predict(thata1, theta2, X):
m = X.shape[0]
num_labels = theta2.shape[0]
p = np.zeros(m)
X = np.c_[np.ones(m), X]
a2 = sigmoid(X.dot(theta1.T))
a2 = np.c_[np.ones(a2.shape[0]), a2]
a3 = sigmoid(a2.dot(theta2.T))
p = np.argmax(a3, axis=1)
return p
p = predict(theta1, theta2, X)
print('Training Set Accuracy: ', np.mean(np.double(p+1 == y.flatten())) * 100)
结果:
Training Set Accuracy: 97.52