残差网讲解

背景

残差网被提出:

He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

在另一篇文章中进行了理论分析:

He K, Zhang X, Ren S, et al. Identity mappings in deep residual networks[C]//European conference on computer vision. Springer, Cham, 2016: 630-645.

残差

残差在数理统计中的表示:实际观察值与估计拟合值之间的差。在这里插入图片描述
ε^=y1−y^1\hat \varepsilon = {y_1} - {\hat y_1}ε^=y1y^1

深层网络的退化问题
  1. 网络层数加深,收敛时出现退化问题。
    在这里插入图片描述

输入:x1x_1x1
输出:y^\hat yy^
激活函数:f(⋅)f(\cdot)f()
计算方式:xi=f(xi−1wi−1)x_i=f(x_{i-1}w_{i-1})xi=f(xi1wi1) y^=f(x4w4)\kern{10pt}\hat y=f(x_4w_4)y^=f(x4w4)
损失函数:E=12(y^−y)2E=\frac{1}{2}(\hat y-y)^2E=21(y^y)2
求导:
w^∗=arg⁡min⁡w^12(y^−y)2{\hat w^*} = \mathop {\arg \min }\limits_{\hat w} \frac{1}{2}{(\hat y - y)^2}w^=w^argmin21(y^y)2
目标最优化损失函数后,求w^∗{\hat w^*}w^
∂y^∂w1=∂y^∂x4∂x4∂x3∂x3∂x2∂x2∂w1=w4f′(x4)w3f′(x3)w2f′(x2)w1f′(x1)\begin{array}{l} \frac{{\partial \hat y}}{{\partial {w_1}}} = \frac{{\partial \hat y}}{{\partial {x_4}}}\frac{{\partial {x_4}}}{{\partial {x_3}}}\frac{{\partial {x_3}}}{{\partial {x_2}}}\frac{{\partial {x_2}}}{{\partial {w_1}}}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = {w_4}f'({x_4}){w_3}f'({x_3}){w_2}f'({x_2}){w_1}f'({x_1}) \end{array}w1y^=x4y^x3x4x2x3w1x2=w4f(x4)w3f(x3)w2f(x2)w1f(x1)
上述为链式法则求导。

残差单元

深度残差网有许多堆叠的残差单元组成(Residual Units), 每个单元可以被表示为下图的形式。
在这里插入图片描述
yl=h(xl)+F(xl,Wl){{\rm{y}}_l} = h\left( {{{\rm{x}}_l}} \right) + {\mathcal F}\left( {{{\rm{x}}_l},{{\mathcal W}_l}} \right)yl=h(xl)+F(xl,Wl)
xl+1=f(yl){{\rm{x}}_{l + 1}} = f\left( {{{\rm{y}}_l}} \right)xl+1=f(yl)
xl+1=xl+F(xl,Wl){{\rm{x}}_{l + 1}} = {{\rm{x}}_l}{\rm{ + }}{\mathcal F}\left( {{{\rm{x}}_l},{{\mathcal W}_l}} \right)xl+1=xl+F(xl,Wl)
xl+2=xl+1+F(xl+1,Wl+1)=xl+F(xl,Wl)+F(xl+1,Wl+1)\begin{array}{l} {{\rm{x}}_{l + 2}} = {{\rm{x}}_{l{\rm{ + }}1}}{\rm{ + }}{\mathcal F}\left( {{{\rm{x}}_{l{\rm{ + }}1}},{{\mathcal W}_{l{\rm{ + }}1}}} \right)\\ {\kern 15pt}= {{\rm{x}}_l}{\rm{ + }}{\mathcal F}\left( {{{\rm{x}}_l},{{\mathcal W}_l}} \right){\rm{ + }}{\mathcal F}\left( {{{\rm{x}}_{l{\rm{ + }}1}},{{\mathcal W}_{l{\rm{ + }}1}}} \right) \end{array}xl+2=xl+1+F(xl+1,Wl+1)=xl+F(xl,Wl)+F(xl+1,Wl+1)

xL=xl+∑i=lL−1F(xi,Wi){{\rm{x}}_L} = {{\rm{x}}_l}{\rm{ + }}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)}xL=xl+i=lL1F(xi,Wi)

xl{{\rm{x}}_l}xllll单元的输入特征
xl+1{{\rm{x}}_{l{\rm{ + }}1}}xl+1lll单元的输出特征
F(⋅){\mathcal F}\left( \cdot \right)F() 残差函数
h(xl)=xlh\left( {{{\rm{x}}_l}} \right) = {{\rm{x}}_l}h(xl)=xl 恒等映射
f(⋅)f\left( \cdot \right)f() 一个激活函数
LLL 表示Residual Units的数量

残差的原理

将损失表示为ε\varepsilonε
∂ε∂xl=∂ε∂xL∂xL∂xl=∂ε∂xL(1+∂xL∂xl∑i=lL−1F(xi,Wi))=∂ε∂xL+∂ε∂xL(∂xL∂xl∑i=lL−1F(xi,Wi))\begin{array}{l} \frac{{\partial \varepsilon }}{{\partial {x_l}}} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\left( {1{\rm{ + }}\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)} } \right)\\ {\kern 55pt} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}{\rm{ + }}\frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\left( {\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)} } \right) \end{array}xlε=xLεxlxL=xLε(1+xlxLi=lL1F(xi,Wi))=xLε+xLε(xlxLi=lL1F(xi,Wi))

∂ε∂w1=∂ε∂xL∂xL∂xl∂xl∂w1=∂ε∂xL(1+∂xL∂xl∑i=lL−1F(xi,Wi))∂xl∂w1=∂ε∂xL∂xl∂w1+∂ε∂xL(∂xL∂xl∑i=lL−1F(xi,Wi))∂xl∂w1\begin{array}{l} \frac{{\partial \varepsilon }}{{\partial {w_1}}} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\frac{{\partial {x_l}}}{{\partial {w_1}}} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\left( {1{\rm{ + }}\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)} } \right)\frac{{\partial {x_l}}}{{\partial {w_1}}}\\ {\kern 10pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\frac{{\partial {x_l}}}{{\partial {w_1}}}{\rm{ + }}\frac{{\partial \varepsilon }}{{\partial {{\rm{x}}_L}}}\left( {\frac{{\partial {{\rm{x}}_L}}}{{\partial {x_l}}}\sum\limits_{i = l}^{L - 1} {{\mathcal F}\left( {{{\rm{x}}_{\rm{i}}},{{\mathcal W}_{\rm{i}}}} \right)} } \right)\frac{{\partial {x_l}}}{{\partial {w_1}}} \end{array}w1ε=xLεxlxLw1xl=xLε(1+xlxLi=lL1F(xi,Wi))w1xl=xLεw1xl+xLε(xlxLi=lL1F(xi,Wi))w1xl

残差相关网络结构

在这里插入图片描述
在这里插入图片描述

  1. conv shortcut在浅层训练会比较好
  2. dropout相当于λ\lambdaλ
    在这里插入图片描述
    上述结构对应不同的实验性能,具体应该是没有任何理论依据的。如果想更多的讨论理论依据,应该从数据分布和网络优化层面考虑。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值