[NeurIPS 2016]Understanding the effective receptive field in deep convolutional neural networks

论文网址:Understanding the effective receptive field in deep convolutional neural networks | Proceedings of the 30th International Conference on Neural Information Processing Systems

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Properties of Effective Receptive Fields

2.3.1. The simplest case: a stack of convolutional layers of weights all equal to one

2.3.2. Random weights

2.3.3. Non-uniform kernels

2.3.4. Nonlinear activation functions

2.3.5. Dropout, Subsampling, Dilated Convolution and Skip-Connections

2.4.  Experiments

2.4.1. Verifying theoretical results

2.4.2. How the ERF evolves during training

2.5. Reduce the Gaussian Damage

2.6. Discussion

2.7. Conclusion

3. Reference


1. 心得

(1)这篇文章是16年的让25年的我像个joker

2. 论文逐段精读

2.1. Abstract

        ①They aim to research receptive fields (RF) of units

        ②Effective receptive field (ERF) is part of theoretical receptive field (TRF)

2.2. Introduction

        ①Not all the pixels in the TRF contribute the same

        ②ERF distributes as a Gaussian

        ③⭐Compared with large kernel conv, the deep net with small kernel might indicates a bad initialization bias on initial small ERF

2.3. Properties of Effective Receptive Fields

        ①Pixel index: \left ( i,j \right )

        ②Image center: \left ( 0,0 \right )

        ③Pixel in the p-th conv layer: x^p_{i,j}, where x^0_{i,j} is the input 

        ④Output: y_{i,j}=x_{i,j}^{n}

        ⑤Task: measure how x^0_{i,j} contribute to y_{0,0} by \partial y_{0,0}/\partial x_{i,j}^{0}

        ⑥Back propagation function: \frac{\partial l}{\partial x_{i,j}^{0}}=\sum_{i^{\prime},j^{\prime}}\frac{\partial l}{\partial y_{i^{\prime},j^{\prime}}}\frac{\partial y_{i^{\prime},j^{\prime}}}{\partial x_{i,j}^{0}} where l denotes the loss

        ⑦To get the answer, they define error gradient \partial l/\partial y_{0,0}=1 and \partial l/\partial y_{i,j}=0 for all i \neq 0 and j \neq 0, then \partial l/\partial x_{i,j}^0=\partial y_{0,0}/\partial x_{i,j}^{0}(虽然这个等式代入⑤的确成立吧,但是它理由是啥啊??就直接随便定义上面俩等于1和0吗)

2.3.1. The simplest case: a stack of convolutional layers of weights all equal to one

        ①Consider n conv layers with kernel size K=k, stride S=1, channel C=1, no linear function

        ②g(i,j,p)=\partial l/\partial x_{i,j}^{p} is the gradient on the p-th layer, g(i,j,n)=\partial l/\partial y_{i,j}

        ③Desired result: g(,,0)=\partial l/\partial x_{i,j}^0

        ④They define initial gradient signal u\left ( t \right ) and kernel v\left ( t \right ):

u(t)=\delta(t),\quad v(t)=\sum_{m=0}^{k-1}\delta(t-m),\quad\mathrm{where}\delta(t)=\left\{ \begin{array} {cc}1, & t=0 \\ 0, & t\neq0 \end{array}\right.

where t=0,1,-1,2,-2,...

        ⑤Gradient signal on each input pixel: o=u*v*\cdots*v

        ⑥Calculate these conv by Discrete Time Fourier Transform:

U(\omega)=\sum_{t=-\infty}^{\infty}u(t)e^{-j\omega t}=1,\quad V(\omega)=\sum_{t=-\infty}^{\infty}v(t)e^{-j\omega t}=\sum_{m=0}^{k-1}e^{-j\omega m}

        ⑦Fourier transform o is 

\mathcal{F}(o)=\mathcal{F}(u*v*\cdots*v)(\omega)=U(\omega)\cdot V(\omega)^n=\left(\sum_{m=0}^{k-1}e^{-j\omega m}\right)^n

        ⑧Inverse Fourier transform:

\begin{aligned} o(t) & =\frac{1}{2\pi}\int_{-\pi}^{\pi}\left(\sum_{m=0}^{k-1}e^{-j\omega m}\right)^{n}e^{j\omega t}\mathrm{d}\omega \\ \frac{1}{2\pi}\int_{-\pi}^{\pi}e^{-j\omega s}e^{j\omega t}\mathrm{d}\omega & \left.=\left\{ \begin{array} {ll}{1,} & {s=t} \\ {0,} & {s\neq t} \end{array}\right.\right. \end{aligned}

2.3.2. Random weights

        ①There is pixel indices shifting:

g(i,j,p-1)=\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}w_{a,b}^pg(i+a,i+b,p)

        ②Known \mathbb{E}_w[w_{a,b}^p]=0, there is:

\mathbb{E}_{w,input}[g(i,j,p-1)]=\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}\mathbb{E}_{w}[w_{a,b}^{p}]\mathbb{E}_{input}[g(i+a,i+b,p)]=0,\forall p

        ③Passing a k \times k kernel with full 1's:

\mathrm{Var}[g(i,j,p-1)]=\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}\mathrm{Var}[w_{a,b}^{p}]\mathrm{Var}[g(i+a,i+b,p)]=C\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}\mathrm{Var}[g(i+a,i+b,p)]

2.3.3. Non-uniform kernels

        ①Weights are normalized: \sum_mw(m)=1

        ②So:

U(\omega)\cdot V(\omega)\cdots V(\omega)=\left(\sum_{m=0}^{k-1}w(m)e^{-j\omega m}\right)^n

and the mean and variance of this Gaussian:

\mathbb{E}[S_n]=n\sum_{m=0}^{k-1}mw(m),\operatorname{Var}[S_n]=n\left(\sum_{m=0}^{k-1}m^2w(m)-\left(\sum_{m=0}^{k-1}mw(m)\right)^2\right)

        ③A standard deviation as the radius of ERF:

\sqrt{\mathrm{Var}[S_{n}]}=\sqrt{n\mathrm{Var}[X_{i}]}=O(\sqrt{n})

where X_i's are i.i.d. multinomial variables distributed according to w\left ( m \right )'s, e.g. p(X_{i}=m)=w(m)

        ④Compared with TRF, ERF shrinks ar a rate of O(1/\sqrt{n})

2.3.4. Nonlinear activation functions

        ①For non linear activation function \sigma and a conv, there is:

g(i,j,p-1)=\sigma_{i,j}^{p} {}^\prime \sum_{a=0}^{k-1}\sum_{b=0}^{k-1}w_{a,b}^{p}g(i+a,i+b,p)

where \sigma_{i,j}^{p} {}^\prime is the gradient of the activation function for pixel \left ( i,j \right ) at layer p

        ②If gradients \sigma ' are independent from the weights and g in the upper layer, the variance can be simplified by:

\mathrm{Var}[g(i,j,p-1)]=\mathbb{E}[\sigma_{i,j}^{p}{}^{\prime2}]\sum_{a}\sum_{b}\mathrm{Var}[w_{a,b}^{p}]\mathrm{Var}[g(i+a,i+b,p)]

just for ReLU, it's hard for analysing Sigmoid and Tanh

2.3.5. Dropout, Subsampling, Dilated Convolution and Skip-Connections

        ①Dropout won't change the shape of ERF

        ②Subsampling and dilated convolutions can increase the erea of ERF

2.4.  Experiments

2.4.1. Verifying theoretical results

        ①Vis of ERF:

where non linear function weaken the Gaussian distribution, and kernel size there is 3 × 3, weights of kernal are all 1 in Uniform and are random in Random

        ②Though non linear function corrupt Gaussian distribution, average of 100 running times may change this situation:

        ③Absolute growth (left) and relative shrink (right) for ERF:

        ④Subsample and dilation expand the ERF:

2.4.2. How the ERF evolves during training

        ①ERF before and after training on CIFAR-10 classification and CamVid semantic segmentation:

and TRF remains the same

2.5. Reduce the Gaussian Damage

        ①Assign less weight on the center of kernel but higher on the outside

2.6. Discussion

        ①一些畅想,也就不说了

2.7. Conclusion

        ~

asymptotically  adv.渐近地

3. Reference

Luo, W. et al. (2016) Understanding the effective receptive field in deep convolutional neural networks, NeurIPS. Red Hook, NY, USA.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值