背景
残差网被提出:
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
在另一篇文章中进行了理论分析:
He K, Zhang X, Ren S, et al. Identity mappings in deep residual networks[C]//European conference on computer vision. Springer, Cham, 2016: 630-645.
残差
残差在数理统计中的表示:实际观察值与估计拟合值之间的差。
ε^=y1−y^1\hat \varepsilon = {y_1} - {\hat y_1}ε^=y1−y^1
深层网络的退化问题
- 网络层数加深,收敛时出现退化问题。
输入:x1x_1x1
输出:y^\hat yy^
激活函数:f(⋅)f(\cdot)f(⋅)
计算方式:xi=f(xi−1wi−1)x_i=f(x_{i-1}w_{i-1})xi=f(xi−1wi−1) y^=f(x4w4)\kern{10pt}\hat y=f(x_4w_4)y^=f(x4w4)
损失函数:E=12(y^−y)2E=\frac{1}{2}(\hat y-y)^2E=21(y^−y)2
求导:
w^∗=argminw^12(y^−y)2{\hat w^*} = \mathop {\arg \min }\limits_{\hat w} \frac{1}{2}{(\hat y - y)^2}w^∗=w^argmin21(y^−y)2
目标最优化损失函数后,求w^∗{\hat w^*}w^∗
∂y^∂w1=∂y^∂x4∂x4∂x3∂x3∂x2∂x2∂w1=w4f′(x4)w3f′(x3)w2f′(x2)w1f′(x1)\begin{array}{l}
\frac{{\partial \hat y}}{{\partial {w_1}}} = \frac{{\partial \hat y}}{{\partial {x_4}}}\frac{{\partial {x_4}}}{{\partial {x_3}}}\frac{{\partial {x_3}}}{{\partial {x_2}}}\frac{{\partial {x_2}}}{{\partial {w_1}}}\\
{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {w_4}f'({x_4}){w_3}f'({x_3}){w_2}f'({x_2}){w_1}f'({x_1})
\end{array}∂w1∂y^=∂x4∂y^∂x3∂x4∂x2∂x3∂w1∂x2=w4f′(x4)w3f′(x3)w2f′(x2)w1f′(x1)
上述为链式法则求导。
残差单元
深度残差网有许多堆叠的残差单元组成(Residual Units), 每个单元可以被表示为下图的形式。
yl=h(xl)+F(xl,Wl){{\rm{y}}_l} = h\left( {{{\rm{x}}_l}} \right) + {\mathcal F}\left( {{{\rm{x}}_l},{{\mathcal W}_l}} \right)yl=h(xl)+F(xl,Wl)
xl+1=f(yl){{\rm{x}}_{l + 1}} = f\left( {{{\rm{y}}_l}} \right)xl+1=f(yl)
xl+1=xl+F(xl,Wl){{\rm{x}}_{l + 1}} = {{\rm{x}}_l}{\rm{ + }}{\mathcal F}\left( {{{\rm{x}}_l},{{\mathcal W}_l}} \right)xl+1=xl+F(xl,Wl)
xl+2=xl+1+F(xl+1,Wl+1)=xl+F(xl,Wl)+F(xl+1,Wl+1)\begin{array}{l}
{{\rm{x}}_{l + 2}} = {{\rm{x}}_{l{\rm{ + }}1}}{\rm{ + }}{\mathcal F}\left( {{{\rm{x}}_{l{\rm{ + }}1}},{{\mathcal W}_{l{\rm{ + }}1}}} \right)\\
{\kern 15pt}= {{\rm{x}}_l}{\rm{ + }}{\mathcal F}\left( {{{\rm{x}}_l},{{\mathcal W}_l}} \right){\rm{ + }}{\mathcal F}\left( {{{\rm{x}}_{l{\rm{ + }}1}},{{\mathcal W}_{l{\rm{ + }}1}}} \right)
\end{array}xl+2=xl+1+F(xl+1,Wl+1)=xl+F(xl,Wl)+F(xl+1,Wl+1)
xL=xl+∑i=lL−1F(xi,Wi){{\rm{x}}_L} = {{\rm{x}}_l}{\rm{ + }}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)}xL=xl+i=l∑L−1F(xi,Wi)
xl{{\rm{x}}_l}xl 第 lll单元的输入特征
xl+1{{\rm{x}}_{l{\rm{ + }}1}}xl+1 第 lll单元的输出特征
F(⋅){\mathcal F}\left( \cdot \right)F(⋅) 残差函数
h(xl)=xlh\left( {{{\rm{x}}_l}} \right) = {{\rm{x}}_l}h(xl)=xl 恒等映射
f(⋅)f\left( \cdot \right)f(⋅) 一个激活函数
LLL 表示Residual Units的数量
残差的原理
将损失表示为ε\varepsilonε
∂ε∂xl=∂ε∂xL∂xL∂xl=∂ε∂xL(1+∂xL∂xl∑i=lL−1F(xi,Wi))=∂ε∂xL+∂ε∂xL(∂xL∂xl∑i=lL−1F(xi,Wi))\begin{array}{l}
\frac{{\partial \varepsilon }}{{\partial {x_l}}} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\left( {1{\rm{ + }}\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)} } \right)\\
{\kern 55pt} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}{\rm{ + }}\frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\left( {\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)} } \right)
\end{array}∂xl∂ε=∂xL∂ε∂xl∂xL=∂xL∂ε(1+∂xl∂xLi=l∑L−1F(xi,Wi))=∂xL∂ε+∂xL∂ε(∂xl∂xLi=l∑L−1F(xi,Wi))
∂ε∂w1=∂ε∂xL∂xL∂xl∂xl∂w1=∂ε∂xL(1+∂xL∂xl∑i=lL−1F(xi,Wi))∂xl∂w1=∂ε∂xL∂xl∂w1+∂ε∂xL(∂xL∂xl∑i=lL−1F(xi,Wi))∂xl∂w1\begin{array}{l} \frac{{\partial \varepsilon }}{{\partial {w_1}}} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\frac{{\partial {x_l}}}{{\partial {w_1}}} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\left( {1{\rm{ + }}\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)} } \right)\frac{{\partial {x_l}}}{{\partial {w_1}}}\\ {\kern 10pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\frac{{\partial {x_l}}}{{\partial {w_1}}}{\rm{ + }}\frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\left( {\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)} } \right)\frac{{\partial {x_l}}}{{\partial {w_1}}} \end{array}∂w1∂ε=∂xL∂ε∂xl∂xL∂w1∂xl=∂xL∂ε(1+∂xl∂xLi=l∑L−1F(xi,Wi))∂w1∂xl=∂xL∂ε∂w1∂xl+∂xL∂ε(∂xl∂xLi=l∑L−1F(xi,Wi))∂w1∂xl
残差相关网络结构
- conv shortcut在浅层训练会比较好
- dropout相当于λ\lambdaλ
上述结构对应不同的实验性能,具体应该是没有任何理论依据的。如果想更多的讨论理论依据,应该从数据分布和网络优化层面考虑。