1. 神经网络

输入[x1,x2,...,xn][x_1, x_2,...,x_n][x1,x2,...,xn],输出[y1,y2,...,yk][y_1, y_2,...,y_k][y1,y2,...,yk]。
当输出分类k>2k>2k>2时,使用
[10...0],[01...0],[0...10],[00...1]\begin{bmatrix}1\\0\\... \\0\end{bmatrix},\begin{bmatrix}0\\1\\...\\0\end{bmatrix},\begin{bmatrix}0\\...\\1\\0\end{bmatrix},\begin{bmatrix}0\\0\\...\\1\end{bmatrix}
⎣⎢⎢⎡10...0⎦⎥⎥⎤,⎣⎢⎢⎡01...0⎦⎥⎥⎤,⎣⎢⎢⎡0...10⎦⎥⎥⎤,⎣⎢⎢⎡00...1⎦⎥⎥⎤
作为输出。
2. 代价函数
J(Θ)=−1m[∑i=1m∑k=1Kyk(i)log(hΘ(x(i)))k+(1−yk(i))log(1−(hΘ(x(i)))k)]+λ2m∑l=1L−1∑i=1sl∑j=1s(l+1)(Θji(l))2J(\Theta)=-\frac{1}{m}\left[\sum_{i=1}^{m}\sum_{k=1}^{K}y_k^{(i)}log(h_{\Theta}(x^{(i)}))_k+(1-y_k^{(i)})log(1-(h_{\Theta}(x^{(i)}))_k)\right]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{(l+1)}}(\Theta_{ji}^{(l)})^2 J(Θ)=−m1[i=1∑mk=1∑Kyk(i)log(hΘ(x(i)))k+(1−yk(i))log(1−(hΘ(x(i)))k)]+2mλl=1∑L−1i=1∑slj=1∑s(l+1)(Θji(l))2
3. 前向传播
a(1)=xa(l)=hΘ(l)(a(l−1)^)=g(Θ(l−1)[1a(l−1)]),1<l≤LΘ(l)∈Rsl+1×(sl+1) a^{(1)}=x \\a^{(l)} = h_{\Theta^{(l)}}(\widehat{a^{(l-1)}})=g\left({\Theta^{(l-1)}}\begin{bmatrix}1\\a^{(l-1)}\end{bmatrix}\right),1<l\le L \\\Theta^{(l)} \in \Bbb R^{s_{l+1}\times(s_l+1)}a(1)=xa(l)=hΘ(l)(a(l−1))=g(Θ(l−1)[1a(l−1)]),1<l≤LΘ(l)∈Rsl+1×(sl+1)
4. 后向传播
δ(L)=(a(L)−y)a(L)(1−a(L))δ(l)=(Θ(l)^)Tδ(l+1).∗a(l).∗(1−a(l)),1<l<LΘ(l)^为不包含θ0的Θ(l)\delta^{(L)} = (a^{(L)}-y)a^{(L)}(1-a^{(L)}) \\ \delta^{(l)} = (\widehat{\Theta^{(l)}})^T\delta^{(l+1)}.*a^{(l)}.*(1-a^{(l)}), 1<l<L \\\widehat{\Theta^{(l)}}为不包含\theta_0的\Theta^{(l)}δ(L)=(a(L)−y)a(L)(1−a(L))δ(l)=(Θ(l))Tδ(l+1).∗a(l).∗(1−a(l)),1<l<LΘ(l)为不包含θ0的Θ(l)
5. 后向传播推导
由前向传播可得:
[θ1,0(l)θ1,1(l)...θ1,sl(l)θ2,0(l)θ2,1(l)...θ2,sl(l)............θs(l+1),0(l)θs(l+1),1(l)...θs(l+1),sl(l)][1a1(l)a2(l)...asl(l)]=[z1(l+1)z2(l+1)...zs(l+1)(l+1)]→g(x)→[a1(l+1)a2(l+1)...as(l+1)(l+1)]\begin{bmatrix}\theta_{1,0}^{(l)} & \theta_{1,1}^{(l)} & ... & \theta_{1,s_l}^{(l)}\\\theta_{2,0}^{(l)} & \theta_{2,1}^{(l)} & ... & \theta_{2,s_l}^{(l)}\\... & ... & ... & ...\\\theta_{s_{(l+1)},0}^{(l)} & \theta_{s_{(l+1)},1}^{(l)} & ... & \theta_{s_{(l+1)},s_l}^{(l)}\\\end{bmatrix}\begin{bmatrix}1 \\a_1^{(l)} \\a_2^{(l)} \\... \\a_{s_l}^{(l)} \\\end{bmatrix}=\begin{bmatrix}z_1^{(l+1)} \\z_2^{(l+1)} \\... \\z_{s_{(l+1)}}^{(l+1)} \\\end{bmatrix}\to g(x) \to\begin{bmatrix}a_1^{(l+1)} \\a_2^{(l+1)} \\... \\a_{s_{(l+1)}}^{(l+1)} \\\end{bmatrix}\\
⎣⎢⎢⎢⎡θ1,0(l)θ2,0(l)...θs(l+1),0(l)θ1,1(l)θ2,1(l)...θs(l+1),1(l)............θ1,sl(l)θ2,sl(l)...θs(l+1),sl(l)⎦⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎡1a1(l)a2(l)...asl(l)⎦⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎡z1(l+1)z2(l+1)...zs(l+1)(l+1)⎦⎥⎥⎥⎤→g(x)→⎣⎢⎢⎢⎡a1(l+1)a2(l+1)...as(l+1)(l+1)⎦⎥⎥⎥⎤
ai(l+1)=g(zi(l+1))=g(θi,0(l)+θi,1(l)a1(l)+θi,2(l)a2(l)+...+θi,sl(l)as1(l))a_i^{(l+1)} = g(z_i^{(l+1)}) = g(\theta_{i,0}^{(l)}+\theta_{i,1}^{(l)}a_1^{(l)}+\theta_{i,2}^{(l)}a_2^{(l)}+...+\theta_{i,s_l}^{(l)}a_{s_1}^{(l)})\\
ai(l+1)=g(zi(l+1))=g(θi,0(l)+θi,1(l)a1(l)+θi,2(l)a2(l)+...+θi,sl(l)as1(l))
设输出对激励输入的偏导数为当前输入的误差,则:
dJ(x,θ)=δi(l+1)d(zi(l+1))=δi(l+1)d(∑k=0slθi,k(l)ak(l))(1)
dJ(x,\theta)=\delta_i^{(l+1)}d(z_i^{(l+1)}) = \delta_i^{(l+1)}d(\sum_{k=0}^{s_l}\theta_{i, k}^{(l)}a_k^{(l)}) \tag{1}
dJ(x,θ)=δi(l+1)d(zi(l+1))=δi(l+1)d(k=0∑slθi,k(l)ak(l))(1)
所以有:
dJ(x,θ)d(zm(l))=dJ(x,θ)d(z(l+1))d(z(l+1))d(zm(l))=∑i=1sl+1δi(l+1)d(∑k=0slθi,k(l)ak(l))/d(zm(l))=∑i=1sl+1δi(l+1)∑k=0slθi,k(l)d(ak(l))d(zm(l))=∑i=1sl+1δi(l+1)θi,m(l)am(l)(1−am(l))
\begin{aligned}
\frac{dJ(x,\theta)}{d(z_m^{(l)})} &= \frac{dJ(x,\theta)}{d(z^{(l+1)})}\frac{d(z^{(l+1)})}{d(z_m^{(l)})}=\sum_{i=1}^{s_{l+1}}\delta_i^{(l+1)}d(\sum_{k=0}^{s_l}\theta_{i, k}^{(l)}a_k^{(l)})/d(z_m^{(l)}) \\&=\sum_{i=1}^{s_{l+1}}\delta_i^{(l+1)}\sum_{k=0}^{s_l}\theta_{i, k}^{(l)}\frac{d(a_k^{(l)})}{d(z_m^{(l)})}=\sum_{i=1}^{s_{l+1}}\delta_i^{(l+1)}\theta_{i, m}^{(l)}a_m^{(l)}(1-a_m^{(l)})\\
\end{aligned}
d(zm(l))dJ(x,θ)=d(z(l+1))dJ(x,θ)d(zm(l))d(z(l+1))=i=1∑sl+1δi(l+1)d(k=0∑slθi,k(l)ak(l))/d(zm(l))=i=1∑sl+1δi(l+1)k=0∑slθi,k(l)d(zm(l))d(ak(l))=i=1∑sl+1δi(l+1)θi,m(l)am(l)(1−am(l))
得到:
δm(l)=am(l)(1−am(l))∑i=1sl+1δi(l+1)θi,m(l)
\delta_m^{(l)}=a_m^{(l)}(1-a_m^{(l)})\sum_{i=1}^{s_{l+1}}\delta_i^{(l+1)}\theta_{i, m}^{(l)}\\
δm(l)=am(l)(1−am(l))i=1∑sl+1δi(l+1)θi,m(l)
向量扩充:
[δ1(l)δ2(l)...δsl(l)]=[a1(l)a2(l)...asl(l)].∗(1−[a1(l)a2(l)...asl(l)]).∗([θ1,1(l)θ2,1(l)...θsl+1,1(l)θ1,2(l)θ2,2(l)...θsl+1,2(l)...θ1,sl(l)θ2,sl(l)...θsl+1,sl(l)][δ1(l+1)δ2(l+1)...δsl+1(l+1)])\begin{bmatrix}\delta_1^{(l)} \\ \delta_2^{(l)} \\ ... \\ \delta_{s_l}^{(l)}\end{bmatrix}=\begin{bmatrix}a_1^{(l)} \\ a_2^{(l)} \\ ... \\ a_{s_l}^{(l)}\end{bmatrix}.*\left (1-\begin{bmatrix}a_1^{(l)} \\ a_2^{(l)} \\ ... \\ a_{s_l}^{(l)}\end{bmatrix}\right).*\left(\begin{bmatrix}\theta_{1, 1}^{(l)} &\theta_{2, 1}^{(l)} &...&\theta_{s_{l+1}, 1}^{(l)}\\ \theta_{1, 2}^{(l)} &\theta_{2, 2}^{(l)} &...&\theta_{s_{l+1}, 2}^{(l)} \\ ... \\ \theta_{1, s_l}^{(l)} &\theta_{2, s_l}^{(l)} &...&\theta_{s_{l+1}, s_l}^{(l)}\end{bmatrix}\begin{bmatrix}\delta_1^{(l+1)} \\ \delta_2^{(l+1)} \\ ... \\ \delta_{s_{l+1}}^{(l+1)}\end{bmatrix}\right) \\
⎣⎢⎢⎢⎡δ1(l)δ2(l)...δsl(l)⎦⎥⎥⎥⎤=⎣⎢⎢⎢⎡a1(l)a2(l)...asl(l)⎦⎥⎥⎥⎤.∗⎝⎜⎜⎜⎛1−⎣⎢⎢⎢⎡a1(l)a2(l)...asl(l)⎦⎥⎥⎥⎤⎠⎟⎟⎟⎞.∗⎝⎜⎜⎜⎛⎣⎢⎢⎢⎡θ1,1(l)θ1,2(l)...θ1,sl(l)θ2,1(l)θ2,2(l)θ2,sl(l).........θsl+1,1(l)θsl+1,2(l)θsl+1,sl(l)⎦⎥⎥⎥⎤⎣⎢⎢⎢⎡δ1(l+1)δ2(l+1)...δsl+1(l+1)⎦⎥⎥⎥⎤⎠⎟⎟⎟⎞
最后得到:
δ(l)=a(l).∗(1−a(l)).∗(θ(l)Tδ(l+1))(2)\delta^{(l)}=a^{(l)}.*(1-a^{(l)}).*({\theta^{(l)}}^T\delta^{(l+1)}) \tag{2}
δ(l)=a(l).∗(1−a(l)).∗(θ(l)Tδ(l+1))(2)
又由公式(1)得:
dJ(x,θ)dθi,j(l)=δi(l+1)aj(l)(3)
\frac{dJ(x,\theta)}{d\theta_{i, j}^{(l)}}=\delta_i^{(l+1)}a_j^{(l)} \tag{3}
dθi,j(l)dJ(x,θ)=δi(l+1)aj(l)(3)
假设J(x,θ)=12(h(x)−y)2J(x,\theta)=\frac{1}{2}(h(x)-y)^2J(x,θ)=21(h(x)−y)2,则:
δ(L)=dJ(x,θ)dzL=dJ(x,θ)daLdaLdzL
\delta^{(L)}=\frac{dJ(x,\theta)}{dz^{L}}=\frac{dJ(x,\theta)}{da^{L}}\frac{da^{L}}{dz^{L}}
δ(L)=dzLdJ(x,θ)=daLdJ(x,θ)dzLdaL
得到
δ(L)=(aL−y)aL(1−aL)(4)
\delta^{(L)}=(a^{L}-y)a^{L}(1-a^{L}) \tag{4}
δ(L)=(aL−y)aL(1−aL)(4)
6. 后向传播算法
- 误差矩阵Δij(l)\Delta_{ij}^{(l)}Δij(l)初始化为零
- For i=1:m
(1). a(1)=x(i)a^{(1)}=x^{(i)}a(1)=x(i) 其中aaa的上标表示不同的层数,xxx的上标表示不同的测试样本
(2). 利用前向传播计算a(2),a(3),...,a(L)a^{(2)},a^{(3)},...,a^{(L)}a(2),a(3),...,a(L)
(3). 初始δ(L)=(a(L)−y(i))a(L)(1−a(L))\delta^{(L)}=(a^{(L)}-y^{(i)})a^{(L)}(1-a^{(L)})δ(L)=(a(L)−y(i))a(L)(1−a(L))
(4). 利用后向传播计算δ(l)\delta^{(l)}δ(l)
(5). Δij(l)\Delta_{ij}^{(l)}Δij(l):=Δij(l)+aj(l)δil+1\Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{l+1}Δij(l)+aj(l)δil+1
求出偏导数Dij(l)=1mΔij(l)+λθij(l)D_{ij}^{(l)}=\frac{1}{m}\Delta_{ij}^{(l)}+\lambda\theta_{ij}^{(l)}Dij(l)=m1Δij(l)+λθij(l) - 利用偏导数进行梯度下降
7. 神经网络训练
A. 参数随机化,若Θ\ThetaΘ全为0,会导致全为相同值,所以必须初始化初始值为随机值。
B. 利用正向传播计算所有层的值a(l)a^{(l)}a(l)
C. 计算此时的代价函数J(Θ)J(\Theta)J(Θ)
D. 利用后向传播计算所有偏导数
E. 利用数值检验法检验偏导数(D≈J(Θ+ϵ)−J(Θ−ϵ)2ϵD\approx\frac{J(\Theta+\epsilon)-J(\Theta-\epsilon)}{2\epsilon}D≈2ϵJ(Θ+ϵ)−J(Θ−ϵ))
F. 利用优化算法最小化代价函数(梯度下降)
本文深入探讨了神经网络的基本概念,包括输入输出表示、代价函数的定义与计算,以及前向传播和后向传播算法的详细步骤。通过数学公式解析了神经网络训练过程中的关键环节,如偏导数计算、梯度下降等。
901

被折叠的 条评论
为什么被折叠?



