Darknet 正向预测与反向传播
这篇文章是没有配图的,很多事情,配图反而说不清楚。
关于正向与反向传播,我仅仅类比梯度下降一个做计算,并结合源码。但这样做的意义和效果不做分析。能力有限。更不知道网络每一层到底干了什么。玄学
∑k=1n∂f(x)∂xkcosβk \sum_{k=1}^n\frac{\partial f(x)}{\partial x_k}cos \beta_kk=1∑n∂xk∂f(x)cosβk
其中:
∑k=1ncos2βk=1\sum_{k=1}^ncos^2\beta_k=1k=1∑ncos2βk=1
则:
∑k=1n∂f(x)∂xicosβi≤(∑k=1n(∂f(x)∂xi)2)(∑k=1ncos2βk) \sum_{k=1}^n\frac{\partial f(x)}{\partial x_i}cos \beta_i\leq \Big(\sum_{k=1}^{n}\Big(\frac{\partial f(x)}{\partial x_i}\Big)^2\Big)\Big(\sum_{k=1}^{n}cos^2\beta_k\Big)k=1∑n∂xi∂f(x)cosβi≤(k=1∑n(∂xi∂f(x))2)(k=1∑ncos2βk)
梯度下降很简单,根据施瓦茨不等式和全微分知识,我门可以知道,对于连续函数,梯度方向是其曾长最快的方向,这里最快有点瞬时的味道,就是说只仅限于当前点,那么反向是其下降最快的方向。而沿着与梯度正交的方向运动,则相当于沿着等势面运动。此时函数值不变。
沿着梯度反方向搜索,可以保证函数值下降,直到梯度消失,当梯度消失时,算法结束在一个极小值点。有人说不存在最优结果,对于训练集来说。这样理解是不对的。其实只能说有时候不存在解析解,最优解还是存在的,因为losslossloss函数是有下界的。
全联接层其实可以看作是特殊的卷积层。你也可以叫他不全联接。我觉得这个名字不错。
全联接层
全联接层的前向
(如果不知道什么是全联接可以百度,介绍还是比较多的)
令ctc_tct为ttt层神经元个数,全联接运算矩阵表示:
[wt(0,0)wt(0,1)…wt(0,ct−1)wt(1,0)wt(1,1)…wt(1,ct−1)⋮⋮⋱⋮wt(ct,0)wt(ct,1)…wt(ct,ct−1)][yt−1(0)yt−1(1)⋮yt−1(ct−1)]=vt \left[\begin{matrix} w_t(0,0)&w_t(0,1)&\dots&w_t(0,c_{t-1})\\
w_t(1,0)&w_t(1,1)&\dots&w_t(1,c_{t-1})\\
\vdots&\vdots&\ddots &\vdots\\
w_t(c_t,0)&w_t(c_t,1)&\dots&w_t(c_t,c_{t-1})
\end{matrix}\right]\left[\begin{matrix}y_{t-1}(0)\\
y_{t-1}(1)\\
\vdots\\
y_{t-1}(c_{t-1})
\end{matrix}\right]=v_t⎣⎢⎢⎢⎡wt(0,0)wt(1,0)⋮wt(ct,0)wt(0,1)wt(1,1)⋮wt(ct,1)……⋱…wt(0,ct−1)wt(1,ct−1)⋮wt(ct,ct−1)⎦⎥⎥⎥⎤⎣⎢⎢⎢⎡yt−1(0)yt−1(1)⋮yt−1(ct−1)⎦⎥⎥⎥⎤=vt
这里,yt−1y_{t-1}yt−1是上一层输出,wtw_twt表示各个突触权重。vtv_tvt就是还未激活的信号(局部诱导域)。
记:φt(vt)=[φt(vt(0))φt(vt(1))⋮φt(vt(ct))]\varphi_t(v_t)=\left[\begin{matrix}\varphi_t(v_{t}(0))\\
\varphi_t(v_{t}(1))\\
\vdots\\
\varphi_t(v_{t}(c_t))
\end{matrix}\right]φt(vt)=⎣⎢⎢⎢⎡φt(vt(0))φt(vt(1))⋮φt(vt(ct))⎦⎥⎥⎥⎤
其中φt\varphi_tφt表示第ttt层的激活函数。
简化矩阵表示为:
vt=wtyt−1yt=φt(vt)v_t=w_ty_{t-1}\\
y_t=\varphi_t(v_t)vt=wtyt−1yt=φt(vt)
这部分操作于代码是对应的。
void forward_connected_layer(connected_layer l, network_state state)
{
int i;
fill_cpu(l.outputs*l.batch, 0, l.output, 1);
int m = l.batch;
int k = l.inputs;
int n = l.outputs;
float *a = state.input;
float *b = l.weights;
float *c = l.output;
gemm(0, 1, m, n, k, 1, a, k, b, k, 1, c, n);
activate_array(l.output, l.outputs*l.batch, l.activation);
}
上面代码删掉了batch normalizebatch\ normalizebatch normalize部分的代码。这部分代码可以先不用看。
其中,gemmgemmgemm函数负责矩阵乘法。batchbatchbatch是训练用的参数,可以自行百度。在预测时,这个数值被强行置为111.
gemmgemmgemm的前两个参数表示矩阵是否进行转置。mmm表述数据个数。n,kn,kn,k为维度答案保存在ccc中,也就是l.outputl.outputl.output.
对应关系:
c:vts:yt−1activate_array():φ(vt)c :v_t \\ s : y_{t-1}\\
activate\_array():\varphi(v_t)c:vts:yt−1activate_array():φ(vt)
全联接层的反向传播
这里以batch=1batch=1batch=1来考虑定义每一层的代价函数:
Jt=12∑k=0ct(dt(k)−yt(k))2=12[et,et]et(k)=dt(k)−yt(k)J_t=\frac{1}{2}\sum_{k=0}^{c_t}(d_t(k)-y_t(k))^2=\frac{1}{2}[e_t,e_t]\\ e_t(k)=d_t(k)-y_t(k)Jt=21k=0∑ct(dt(k)−yt(k))2=21[et,et]et(k)=dt(k)−yt(k)
其中,dt(k)d_t(k)dt(k)是我门期望的第ttt层的输出。对于最后一层输出层来说,这个期望就是我门的标注数据。
当ttt为输出层,也就是说,我门可以直接获得期望输出和实输出的误差ete_tet
那么可以很方便的计算最外层的权重梯度∇Jt\nabla J_t∇Jt。
∂Jt∂wt(a,b)=∑k=0ctet(k)∂et(k)∂wt(a,b)=∑k=0ctet(k)∂et(k)∂vt(k)∂vt(k)∂wt(a,b)=−et(a)φt′(vt(a))∂∑iwt(a,i)yt−1(i)∂wt(a,b)=−et(a)φt′(vt(a))yt−1(b)\frac{\partial J_t}{\partial w_t(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial w_t(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial v_t(k)}\frac{\partial v_t(k)}{\partial w_t(a,b)}\\
=-e_t(a)\varphi'_t(v_t(a))\frac{\partial \sum_{i}w_{t}(a,i)y_{t-1}(i)}{\partial w_t(a,b)}\\= -e_t(a)\varphi'_t(v_t(a))y_{t-1}(b)∂wt(a,b)∂Jt=k=0∑ctet(k)∂wt(a,b)∂et(k)=k=0∑ctet(k)∂vt(k)∂et(k)∂wt(a,b)∂vt(k)=−et(a)φt′(vt(a))∂wt(a,b)∂∑iwt(a,i)yt−1(i)=−et(a)φt′(vt(a))yt−1(b)
但这毕竟是外层神经元,对于内层,虽然假设了每一层的期望输出,但内层的期望输出是不知道的,这也是现阶段对网络了解过少所限制的。
此时算法是使用最外层的误差作为内层误差的。
接着上一部分计算,记:gt(a)=et(a)φt′(vt(a))g_t(a)=e_t(a)\varphi'_t(v_t(a))gt(a)=et(a)φt′(vt(a))
则:∂Jt∂wt(a,b)=−gt(a)yt−1(b)\frac{\partial J_t}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)∂wt(a,b)∂Jt=−gt(a)yt−1(b)
那么考虑计算:
∂Jt∂wt−1(a,b)=∑k=0ctet(k)∂et(k)∂vt(k)∂vt(k)∂wt−1(a,b)=∑k=0ct−gt(k)∂∑iwt(k,i)yt−1(i)∂wt−1(a,b)=∑k=0ct−gt(k)∑i=0ct−1∂wt(k,i)yt−1(i)∂yt−1(i)∂yt−1(i)∂wt−1(a,b)=∑k=0ct−gt(k)∑i=0ct−1wt(k,i)φt−1(vt−1(i))∂∑jwt−1(i,j)yt−2(j)∂wt−1(a,b)=∑k=0ct−gt(k)wt(k,a)φt−1(vt−1(a))yt−2(b)=yt−2(b)φt−1′(vt−1(a))∑k=0ct−gt(k)wt(k,a)\frac{\partial J_{t}}{\partial w_{t-1}(a,b)}=\sum_{k=0}^{c_t}e_t(k)\frac{\partial e_t(k)}{\partial v_t(k)}\frac{\partial v_t(k)}{\partial w_{t
-1}(a,b)}\\
=\sum_{k=0}^{c_t}-g_t(k)\frac{\partial \sum_iw_t(k,i)y_{t-1}(i)}{\partial w_{t-1}(a,b)}\\
=\sum_{k=0}^{c_t}-g_t(k)\sum_{i=0}^{c_{t-1}}\frac{\partial w_t(k,i)y_{t-1}(i)}{\partial y_{t-1}(i)}\frac{\partial y_{t-1}(i)}{\partial w_{t-1}(a,b)}\\
= \sum_{k=0}^{c_t}-g_t(k)\sum_{i=0}^{c_{t-1}}w_{t}(k,i)\varphi_{t-1}(v_{t-1}(i))\frac{\partial \sum_jw_{t-1}(i,j)y_{t-2}(j)}{\partial w_{t-1}(a,b)}\\
=\sum_{k=0}^{c_t}-g_{t}(k)w_t(k,a)\varphi_{t-1}(v_{t-1}(a))y_{t-2}(b)\\=y_{t-2}(b)\varphi'_{t-1}(v_{t-1}(a))\sum_{k=0}^{c_t}-g_t(k)w_t(k,a)∂wt−1(a,b)∂Jt=k=0∑ctet(k)∂vt(k)∂et(k)∂wt−1(a,b)∂vt(k)=k=0∑ct−gt(k)∂wt−1(a,b)∂∑iwt(k,i)yt−1(i)=k=0∑ct−gt(k)i=0∑ct−1∂yt−1(i)∂wt(k,i)yt−1(i)∂wt−1(a,b)∂yt−1(i)=k=0∑ct−gt(k)i=0∑ct−1wt(k,i)φt−1(vt−1(i))∂wt−1(a,b)∂∑jwt−1(i,j)yt−2(j)=k=0∑ct−gt(k)wt(k,a)φt−1(vt−1(a))yt−2(b)=yt−2(b)φt−1′(vt−1(a))k=0∑ct−gt(k)wt(k,a)
进而:
gt−1(a)=φt−1′(vt−1(a))∑k=0ctgt(k)wt(k,a)∂Jt∂wt−1(a,b)=−gt−1(a)yt−2(b)g_{t-1}(a)=\varphi'_{t-1}(v_{t-1}(a))\sum_{k=0}^{c_t}g_t(k)w_t(k,a)\\
\frac{\partial J_t}{\partial w_{t-1}(a,b)}=-g_{t-1}(a)y_{t-2}(b)gt−1(a)=φt−1′(vt−1(a))k=0∑ctgt(k)wt(k,a)∂wt−1(a,b)∂Jt=−gt−1(a)yt−2(b)
这种关系并非只存在ttt为输出层时的t−1t-1t−1层,而是一直在传递。
考虑第lll层全联接以输出层ttt的损失函数为目标函数进行梯度计算:
∂Jt∂wl(a,b)=∂Jt∂yl(a)∂yl(a)∂wl(a,b)=∂Jt∂yl(a)φl′(vl(a))yl−1(b)\frac{\partial J_t}{\partial w_{l}(a,b)}=\frac{\partial J_t}{\partial y_{l}(a)}\frac{\partial y_l(a)}{\partial w_{l}(a,b)}\\=\frac{\partial J_t}{\partial y_{l}(a)}\varphi_l'(v_l(a))y_{l-1}(b)∂wl(a,b)∂Jt=∂yl(a)∂Jt∂wl(a,b)∂yl(a)=∂yl(a)∂Jtφl′(vl(a))yl−1(b)
由于误差的相互独立,回顾前向传播,可以有(非常不好理解):∂Jt∂yl(a)=∑k=0cl+1∂Jt∂yl+1(k)∂yl+1(k)∂yl(a)=∑k=0cl+1∂Jt∂yl+1(k)∂yl+1(k)∂vl+1(k)wl+1(k,a)\frac{\partial J_t}{\partial y_l(a)} = \sum_{k=0}^{c_{l+1}}\frac{\partial J_t}{\partial y_{l+1}(k)}\frac{\partial y_{l+1}(k)}{\partial y_l(a)}\\=\sum_{k=0}^{c_{l+1}}\frac{\partial J_t}{\partial y_{l+1}(k)}\frac{ \partial y_{l+1}(k)}{\partial v_{l+1}(k)}w_{l+1}(k,a)∂yl(a)∂Jt=k=0∑cl+1∂yl+1(k)∂Jt∂yl(a)∂yl+1(k)=k=0∑cl+1∂yl+1(k)∂Jt∂vl+1(k)∂yl+1(k)wl+1(k,a)
其中:
∂Jt∂wl(k,a)=∂Jt∂yl(k)∂yl(k)∂vl(k)∂vl(k)∂wl(k,a)=∂Jt∂yl(k)∂yl(k)∂vl(k)y(a)\frac{\partial J_t}{\partial w_{l}(k,a)}=\frac{\partial J_t}{\partial y_l(k)}\frac{\partial y_{l}(k)}{\partial v_l(k)}\frac{\partial v_l(k)}{\partial w_l(k,a)}\\
=\frac{\partial J_t}{\partial y_l(k)}\frac{\partial y_{l}(k)}{\partial v_l(k)}y(a)∂wl(k,a)∂Jt=∂yl(k)∂Jt∂vl(k)∂yl(k)∂wl(k,a)∂vl(k)=∂yl(k)∂Jt∂vl(k)∂yl(k)y(a)
此时相当于重新定义了ggg,可知:
∂Jt∂wl(k,a)=−g(k)y(a)\frac{\partial J_t}{\partial w_l(k,a)}=-g(k)y(a)∂wl(k,a)∂Jt=−g(k)y(a)
综上,上述证明中,假设ttt作为了输出层,下面的总结以nnn作为输出层。:
∂Jn∂wt(a,b)=−gt(a)yt−1(b)gt(a)=φt′(vt(a))∑k=0ct+1gt+1(k)wt+1(k,a)\frac{\partial J_n}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)\\
g_{t}(a)=\varphi'_{t}(v_{t}(a))\sum_{k=0}^{c_{t+1}}g_{t+1}(k)w_{t+1}(k,a)\\
∂wt(a,b)∂Jn=−gt(a)yt−1(b)gt(a)=φt′(vt(a))k=0∑ct+1gt+1(k)wt+1(k,a)
注意这里g(t)g(t)g(t)不是误差。它隐士的包含了误差传递。
反向传播的矩阵形式
∂Jn∂wt(a,b)=−gt(a)yt−1(b)gt(a)=φt′(vt(a))∑k=0ct+1gt+1(k)wt+1(k,a)\frac{\partial J_n}{\partial w_t(a,b)}=-g_t(a)y_{t-1}(b)\\
g_{t}(a)=\varphi'_{t}(v_{t}(a))\sum_{k=0}^{c_{t+1}}g_{t+1}(k)w_{t+1}(k,a)\\
∂wt(a,b)∂Jn=−gt(a)yt−1(b)gt(a)=φt′(vt(a))k=0∑ct+1gt+1(k)wt+1(k,a)
根据这组关系,令:
gt=[gt(0)gt(1)⋮gt(ct)]g_t = \left[\begin{matrix}g_{t}(0)\\g_{t}(1)\\ \vdots\\g_{t}(c_t)\end{matrix}\right]gt=⎣⎢⎢⎢⎡gt(0)gt(1)⋮gt(ct)⎦⎥⎥⎥⎤
令所有向量都为列向量。
则:
∂Jn∂wt=−gtyt−1T gt=ding(φt′(vt))gt+1Twt+1=φt′(vt)⨀gt+1Twt+1\frac{\partial J_n}{\partial w_t}=-g_ty_{t-1}^T\\
\ \\
g_t=ding(\varphi'_t(v_t))g_{t+1}^Tw_{t+1}
\\=\varphi'_t(v_t)\bigodot g_{t+1}^Tw_{t+1}∂wt∂Jn=−gtyt−1T gt=ding(φt′(vt))gt+1Twt+1=φt′(vt)⨀gt+1Twt+1
⨀\bigodot⨀这个运算符表示矩阵逐元素相乘得到的新矩阵:ci,j=ai,jbi,jc_{i,j}=a_{i,j}b_{i,j}ci,j=ai,jbi,j
卷积层(非全联接层)
void forward_convolutional_layer(convolutional_layer l, network_state state)
{
int out_h = convolutional_out_height(l);
int out_w = convolutional_out_width(l);
int i;
fill_cpu(l.outputs*l.batch, 0, l.output, 1);
int m = l.n;
int k = l.size*l.size*l.c;
int n = out_h*out_w;
float *a = l.weights;
float *b = state.workspace;
float *c = l.output;
static int u = 0;
u++;
for(i = 0; i < l.batch; ++i){
im2col_cpu_custom(state.input, l.c, l.h, l.w, l.size, l.stride, l.pad, b);
gemm(0, 0, m, n, k, 1, a, k, b, n, 1, c, n);
c += n*m;
state.input += l.c*l.h*l.w;
}
add_bias(l.output, l.biases, l.batch, l.n, out_h*out_w);
activate_array_cpu_custom(l.output, m*n*l.batch, l.activation);
}
基本操作:im2col_cpu_customim2col\_cpu\_customim2col_cpu_custom
这个操作对上一层的输入进行处理,得到形如[(size)2l.c]×[out_h×out_w][(size)^2l.c] \times[out\_h\times out\_w][(size)2l.c]×[out_h×out_w]的矩阵,即为当前层输入yyy。
当前层权重www为filters×(size)2l.cfilters \times (size)^2l.cfilters×(size)2l.c
这里,filtersfiltersfilters代表下一层通道数,你也可以理解为神经元个数或者卷积核个数。
sizesizesize表示卷积核的长度(大小)。
这样很好理解,卷积操变成矩阵乘法。w×yw\times yw×y得到filters×out_h×out_wfilters\times out\_h\times out\_wfilters×out_h×out_w的输出。
考虑计算当卷积层作为输出层时,其最外层权重梯度:
∂Jn∂wn(a,b)=12∑k∑m∂en2(k,m)∂wn(a,b)=∑k∑men(k,m)∂en(k,m)∂vn(k,m)∂vn(k,m)∂wn(a,b)=∑k∑m−en(k,m)φn′(vn(k,m))∂∑iwn(k,i)yn−1(i,m)∂wn(a,b)=∑m−en(a,m)φn′(vn(a,m))yn−1(b,m)\frac{\partial J_n}{\partial w_n(a,b)}=\frac{1}{2}\sum_{k}\sum_{m}\frac{\partial e^2_n(k,m)}{\partial w_n(a,b)}\\
=\sum_{k}\sum_{m}e_n(k,m)\frac{\partial e_n(k,m)}{\partial v_n(k,m)}\frac{\partial v_n(k,m)}{\partial w_n(a,b)}
\\=\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\frac{\partial \sum_{i}w_n(k,i)y_{n-1}(i,m)}{\partial w_n(a,b)}
\\=\sum_{m}-e_n(a,m)\varphi'_n(v_n(a,m))y_{n-1}(b,m)∂wn(a,b)∂Jn=21k∑m∑∂wn(a,b)∂en2(k,m)=k∑m∑en(k,m)∂vn(k,m)∂en(k,m)∂wn(a,b)∂vn(k,m)=k∑m∑−en(k,m)φn′(vn(k,m))∂wn(a,b)∂∑iwn(k,i)yn−1(i,m)=m∑−en(a,m)φn′(vn(a,m))yn−1(b,m)
对于n−1n-1n−1层,依然以最外层作为损失作为目标函数:
∂Jn∂wn−1(a,b)=12∑k∑m∂en2(k,m)∂wn−1(a,b)=∑k∑m−en(k,m)∂en(k,m)∂vn(k,m)∂vn(k,m)∂wn−1(a,b)=∑k∑m−en(k,m)φn′(vn(k,m))∂vn(k,m)∂wn−1(a,b)=∑k∑m−en(k,m)φn′(vn(k,m))∂∑iwn(k,i)yn−1(i,m)∂wn−1(a,b)=∑k∑m−en(k,m)φn′(vn(k,m))∑i∂wn(k,i)yn−1(i,m)∂vn−1(i,m)∂vn−1(i,m)∂wn−1(a,b)=∑k∑m−en(k,m)φn′(vn(k,m))∑iwn(k,i)φn−1′(vn−1(i,m))∂∑jwn−1(i,j)yn−2(j,m)∂wn−1(a,b)=∑k∑m−en(k,m)φn′(vn(k,m))wn(k,a)φn−1′(vn−1(a,m))yn−2(b,m)\frac{\partial J_n}{\partial w_{n-1}(a,b)}=\frac{1}{2}\sum_{k}\sum_{m}\frac{\partial e^2_n(k,m)}{\partial w_{n-1}(a,b)}\\
=\sum_{k}\sum_{m}-e_n(k,m)\frac{\partial e_n(k,m)}{\partial v_{n}(k,m)}\frac{\partial v_n(k,m)}{\partial w_{n-1}(a,b)}\\
=\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\frac{\partial v_n(k,m)}{\partial w_{n-1}(a,b)}\\
=\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\frac{\partial \sum_iw_n(k,i)y_{n-1}(i,m)}{\partial w_{n-1}(a,b)}\\
=\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\sum_i\frac{\partial w_n(k,i)y_{n-1}(i,m)}{\partial v_{n-1}(i,m)}\frac{\partial v_{n-1}(i,m)}{\partial w_{n-1}(a,b)}\\
=\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))\sum_iw_n(k,i)\varphi'_{n-1}(v_{n-1}(i,m))\frac{\partial \sum_{j}w_{n-1}(i,j)y_{n-2}(j,m)}{\partial w_{n-1}(a,b)}\\
=\sum_{k}\sum_{m}-e_n(k,m)\varphi'_n(v_n(k,m))w_n(k,a)\varphi'_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m)∂wn−1(a,b)∂Jn=21k∑m∑∂wn−1(a,b)∂en2(k,m)=k∑m∑−en(k,m)∂vn(k,m)∂en(k,m)∂wn−1(a,b)∂vn(k,m)=k∑m∑−en(k,m)φn′(vn(k,m))∂wn−1(a,b)∂vn(k,m)=k∑m∑−en(k,m)φn′(vn(k,m))∂wn−1(a,b)∂∑iwn(k,i)yn−1(i,m)=k∑m∑−en(k,m)φn′(vn(k,m))i∑∂vn−1(i,m)∂wn(k,i)yn−1(i,m)∂wn−1(a,b)∂vn−1(i,m)=k∑m∑−en(k,m)φn′(vn(k,m))i∑wn(k,i)φn−1′(vn−1(i,m))∂wn−1(a,b)∂∑jwn−1(i,j)yn−2(j,m)=k∑m∑−en(k,m)φn′(vn(k,m))wn(k,a)φn−1′(vn−1(a,m))yn−2(b,m)
根据上述计算,令:
gn(a,m)=en(a,m)φn′(vn(a,m))g_n(a,m)=e_n(a,m)\varphi'_n(v_n(a,m))gn(a,m)=en(a,m)φn′(vn(a,m))
则:
∂Jn∂wn(a,b)=∑m−gn(a,m)yn−1(b,m)\frac{\partial J_n}{\partial w_n(a,b)}=\sum_{m}-g_n(a,m)y_{n-1}(b,m)∂wn(a,b)∂Jn=m∑−gn(a,m)yn−1(b,m)
∂Jn∂wn−1(a,b)=∑k∑m−gn(k,m)wn(k,a)φn−1′(vn−1(a,m))yn−2(b,m)=∑m(∑k−gn(k,m)wn(k,a))φn−1′(vn−1(a,m))yn−2(b,m)\frac{\partial J_n}{\partial w_{n-1}(a,b)}=\sum_{k}\sum_{m}-g_n(k,m)w_n(k,a)\varphi'_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m)\\
=\sum_{m}\Big(\sum_{k}-g_n(k,m)w_n(k,a)\Big)\varphi'_{n-1}(v_{n-1}(a,m))y_{n-2}(b,m)∂wn−1(a,b)∂Jn=k∑m∑−gn(k,m)wn(k,a)φn−1′(vn−1(a,m))yn−2(b,m)=m∑(k∑−gn(k,m)wn(k,a))φn−1′(vn−1(a,m))yn−2(b,m)
这里,我们令:
gn−1(a,m)=φn−1′(vn−1(a,m))∑kgn(k,m)wn(k,a)g_{n-1}(a,m)=\varphi'_{n-1}(v_{n-1}(a,m))\sum_{k}g_n(k,m)w_{n}(k,a)gn−1(a,m)=φn−1′(vn−1(a,m))k∑gn(k,m)wn(k,a)
那么:
∂Jn∂wn(a,b)=∑m−gn−1(a,m)yn−2(b,m)\frac{\partial J_n}{\partial w_n(a,b)}=\sum_{m}-g_{n-1}(a,m)y_{n-2}(b,m)∂wn(a,b)∂Jn=m∑−gn−1(a,m)yn−2(b,m)
其实ggg的这种传递性依然可以保持。重新定义ggg
∂Jn∂wt(a,b)=∑k∑m∂Jn∂yt(k,m)∂yt(k,m)∂wt(a,b)=∑k∑m∂Jn∂yt(k,m)∂yt(k,m)∂vt(k,m)∂vt(k,m)∂wt(a,b)=∑m∂Jn∂yt(a,m)∂yt(a,m)∂vt(a,m)yt−1(b,m)\frac{\partial J_n}{\partial w_{t}(a,b)}=\sum_{k}\sum_{m}\frac{\partial J_n}{\partial y_{t}(k,m)}\frac{\partial y_t(k,m)}{\partial w_{t}(a,b)}\\
=\sum_{k}\sum_{m}\frac{\partial J_n}{\partial y_{t}(k,m)}\frac{\partial y_t(k,m)}{\partial v_{t}(k,m)}\frac{\partial v_t(k,m)}{\partial w_t(a,b)}\\
=\sum_{m}\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}y_{t-1}(b,m)∂wt(a,b)∂Jn=k∑m∑∂yt(k,m)∂Jn∂wt(a,b)∂yt(k,m)=k∑m∑∂yt(k,m)∂Jn∂vt(k,m)∂yt(k,m)∂wt(a,b)∂vt(k,m)=m∑∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)yt−1(b,m)
这里,令:gt(a,m)=∂Jn∂yt(a,m)∂yt(a,m)∂vt(a,m)g_t(a,m)=\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}gt(a,m)=∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)
显然,t=n,n−1t=n,n-1t=n,n−1时,是成立的。
归纳有:
∂Jn∂wt(a,b)=∑m∂Jn∂yt(a,m)∂yt(a,m)∂vt(a,m)yt−1(b,m)\frac{\partial J_n}{\partial w_{t}(a,b)}=\sum_{m}\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}y_{t-1}(b,m)∂wt(a,b)∂Jn=m∑∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)yt−1(b,m)
对于(非常不好理解):
∂Jn∂yt(a,m)∂yt(a,m)∂vt(a,m)=∂Jn∂yt(a,m)φt′(vt(a,m))=φt′(vt(a,m))∑i∂Jn∂yt+1(i,m)∂yt+1(i,m)∂vt+1(i,m)∂vt+1(i,m)∂yt(a,m)=φt′(vt(a,m))∑i∂Jn∂yt+1(i,m)∂yt+1(i,m)∂vt+1(i,m)wt+1(i,a)\frac{\partial J_n}{\partial y_{t}(a,m)}\frac{\partial y_t(a,m)}{\partial v_{t}(a,m)}=\frac{\partial J_n}{\partial y_{t}(a,m)}\varphi_{t}'(v_{t}(a,m))\\
=\varphi_{t}'(v_{t}(a,m)) \sum_{i}\frac{\partial J_n}{\partial y_{t+1}(i,m)}\frac{\partial y_{t+1}(i,m)}{\partial v_{t+1}(i,m)}\frac{\partial v_{t+1}(i,m)}{\partial y_t(a,m)}\\
=\varphi_{t}'(v_{t}(a,m)) \sum_{i}\frac{\partial J_n}{\partial y_{t+1}(i,m)}\frac{\partial y_{t+1}(i,m)}{\partial v_{t+1}(i,m)}w_{t+1}(i,a)∂yt(a,m)∂Jn∂vt(a,m)∂yt(a,m)=∂yt(a,m)∂Jnφt′(vt(a,m))=φt′(vt(a,m))i∑∂yt+1(i,m)∂Jn∂vt+1(i,m)∂yt+1(i,m)∂yt(a,m)∂vt+1(i,m)=φt′(vt(a,m))i∑∂yt+1(i,m)∂Jn∂vt+1(i,m)∂yt+1(i,m)wt+1(i,a)
故:gt(a,m)=φt′(vt(a,m))∑kgt+1(k,m)wt+1(k,a)g_t(a,m)=\varphi'_t(v_t(a,m))\sum_{k}g_{t+1}(k,m)w_{t+1}(k,a)gt(a,m)=φt′(vt(a,m))k∑gt+1(k,m)wt+1(k,a)
卷积反向传播也可以用矩阵表示:
gt=φt′(vt)⨀gt+1Twt+1∂Jn∂wt=−gtyt−1Tg_t=\varphi'_t(v_t)\bigodot g_{t+1}^Tw_{t+1}\\
\frac{\partial J_n}{\partial w_t}=-g_ty_{t-1}^Tgt=φt′(vt)⨀gt+1Twt+1∂wt∂Jn=−gtyt−1T
卷积与全连接更新方式一样
对于部分不好理解的地方,做一个解释:
对于上文中,两处不好理解的地方,其实你可以这样理解为连续函数h(x1,x2,…,xn)h(x_1,x_2,\dots,x_n)h(x1,x2,…,xn)
令:f(t)=h(t,t,…,t)f(t)=h(t,t,\dots,t)f(t)=h(t,t,…,t) 则:
df(t)dt=∑k=1n∂h∂xk∣xk=t
\frac{df(t)}{dt}=\sum_{k=1}^n\frac{\partial h}{\partial x_k}\Big|_{x_k=t}
dtdf(t)=k=1∑n∂xk∂h∣∣∣xk=t
对于ttt层第kkk个神经元输出yt(a)y_t(a)yt(a)对输出层nnn的损失函数的影响是多维度的。只不过每个维度的输入变量是一个.显然,这种影响又是可微的。
∂Jn∂yt(a)=∑k=0ct+1∂Jn∂yt+1(k)∂yt+1(k)∂yt(a)
\frac{\partial J_n}{\partial y_t(a)}=\sum_{k=0}^{c_{t+1}}\frac{\partial J_n}{\partial y_{t+1}(k)}\frac{\partial y_{t+1}(k)}{\partial y_t(a)}
∂yt(a)∂Jn=k=0∑ct+1∂yt+1(k)∂Jn∂yt(a)∂yt+1(k)