[NeurIPS 2016]Understanding the effective receptive field in deep convolutional neural networks

夏莉莉iy

于 2025-01-06 16:32:34 发布

阅读量1.1k

点赞数 12

CC 4.0 BY-SA版权

分类专栏：论文精读文章标签：深度学习机器学习人工智能分类神经网络计算机视觉 python

本文链接：https://blog.youkuaiyun.com/Sherlily/article/details/144959911

论文精读专栏收录该内容

190 篇文章

订阅专栏

论文网址：Understanding the effective receptive field in deep convolutional neural networks | Proceedings of the 30th International Conference on Neural Information Processing Systems

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.3. Properties of Effective Receptive Fields

2.3.1. The simplest case: a stack of convolutional layers of weights all equal to one

2.3.2. Random weights

2.3.3. Non-uniform kernels

2.3.4. Nonlinear activation functions

2.3.5. Dropout, Subsampling, Dilated Convolution and Skip-Connections

2.4. Experiments

2.4.1. Verifying theoretical results

2.4.2. How the ERF evolves during training

2.5. Reduce the Gaussian Damage

2.6. Discussion

2.7. Conclusion

3. Reference

1. 心得

（1）这篇文章是16年的让25年的我像个joker

2. 论文逐段精读

2.1. Abstract

①They aim to research receptive fields (RF) of units

②Effective receptive field (ERF) is part of theoretical receptive field (TRF)

2.2. Introduction

①Not all the pixels in the TRF contribute the same

②ERF distributes as a Gaussian

③⭐Compared with large kernel conv, the deep net with small kernel might indicates a bad initialization bias on initial small ERF

2.3. Properties of Effective Receptive Fields

①Pixel index: $\left ( i,j \right )$

②Image center: $\left ( 0,0 \right )$

③Pixel in the $p$ -th conv layer: $x^p_{i,j}$ , where $x^0_{i,j}$ is the input

④Output: $y_{i,j}=x_{i,j}^{n}$

⑤Task: measure how $x^0_{i,j}$ contribute to $y_{0,0}$ by $\partial y_{0,0}/\partial x_{i,j}^{0}$

⑥Back propagation function: $\frac{\partial l}{\partial x_{i,j}^{0}}=\sum_{i^{\prime},j^{\prime}}\frac{\partial l}{\partial y_{i^{\prime},j^{\prime}}}\frac{\partial y_{i^{\prime},j^{\prime}}}{\partial x_{i,j}^{0}}$ where $l$ denotes the loss

⑦To get the answer, they define error gradient $\partial l/\partial y_{0,0}=1$ and $\partial l/\partial y_{i,j}=0$ for all $i \neq 0$ and $j \neq 0$ , then $\partial l/\partial x_{i,j}^0=\partial y_{0,0}/\partial x_{i,j}^{0}$ （虽然这个等式代入⑤的确成立吧，但是它理由是啥啊？？就直接随便定义上面俩等于1和0吗）

2.3.1. The simplest case: a stack of convolutional layers of weights all equal to one

①Consider $n$ conv layers with kernel size $K=k$ , stride $S=1$ , channel $C=1$ , no linear function

② $g(i,j,p)=\partial l/\partial x_{i,j}^{p}$ is the gradient on the $p$ -th layer, $g(i,j,n)=\partial l/\partial y_{i,j}$

③Desired result: $g(,,0)=\partial l/\partial x_{i,j}^0$

④They define initial gradient signal $u\left ( t \right )$ and kernel $v\left ( t \right )$ :

$u(t)=\delta(t),\quad v(t)=\sum_{m=0}^{k-1}\delta(t-m),\quad\mathrm{where}\delta(t)=\left\{ \begin{array} {cc}1, & t=0 \\ 0, & t\neq0 \end{array}\right.$

where $t=0,1,-1,2,-2,...$

⑤Gradient signal on each input pixel: $o=u*v*\cdots*v$

⑥Calculate these conv by Discrete Time Fourier Transform:

$U(\omega)=\sum_{t=-\infty}^{\infty}u(t)e^{-j\omega t}=1,\quad V(\omega)=\sum_{t=-\infty}^{\infty}v(t)e^{-j\omega t}=\sum_{m=0}^{k-1}e^{-j\omega m}$

⑦Fourier transform $o$ is

$\mathcal{F}(o)=\mathcal{F}(u*v*\cdots*v)(\omega)=U(\omega)\cdot V(\omega)^n=\left(\sum_{m=0}^{k-1}e^{-j\omega m}\right)^n$

⑧Inverse Fourier transform:

$\begin{aligned} o(t) & =\frac{1}{2\pi}\int_{-\pi}^{\pi}\left(\sum_{m=0}^{k-1}e^{-j\omega m}\right)^{n}e^{j\omega t}\mathrm{d}\omega \\ \frac{1}{2\pi}\int_{-\pi}^{\pi}e^{-j\omega s}e^{j\omega t}\mathrm{d}\omega & \left.=\left\{ \begin{array} {ll}{1,} & {s=t} \\ {0,} & {s\neq t} \end{array}\right.\right. \end{aligned}$

2.3.2. Random weights

①There is pixel indices shifting:

$g(i,j,p-1)=\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}w_{a,b}^pg(i+a,i+b,p)$

②Known $\mathbb{E}_w[w_{a,b}^p]=0$ , there is:

$\mathbb{E}_{w,input}[g(i,j,p-1)]=\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}\mathbb{E}_{w}[w_{a,b}^{p}]\mathbb{E}_{input}[g(i+a,i+b,p)]=0,\forall p$

③Passing a $k \times k$ kernel with full 1's:

$\mathrm{Var}[g(i,j,p-1)]=\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}\mathrm{Var}[w_{a,b}^{p}]\mathrm{Var}[g(i+a,i+b,p)]=C\sum_{a=0}^{k-1}\sum_{b=0}^{k-1}\mathrm{Var}[g(i+a,i+b,p)]$

2.3.3. Non-uniform kernels

①Weights are normalized: $\sum_mw(m)=1$

②So:

$U(\omega)\cdot V(\omega)\cdots V(\omega)=\left(\sum_{m=0}^{k-1}w(m)e^{-j\omega m}\right)^n$

and the mean and variance of this Gaussian:

$\mathbb{E}[S_n]=n\sum_{m=0}^{k-1}mw(m),\operatorname{Var}[S_n]=n\left(\sum_{m=0}^{k-1}m^2w(m)-\left(\sum_{m=0}^{k-1}mw(m)\right)^2\right)$

③A standard deviation as the radius of ERF:

$\sqrt{\mathrm{Var}[S_{n}]}=\sqrt{n\mathrm{Var}[X_{i}]}=O(\sqrt{n})$

where $X_i$ 's are i.i.d. multinomial variables distributed according to $w\left ( m \right )$ 's, e.g. $p(X_{i}=m)=w(m)$

④Compared with TRF, ERF shrinks ar a rate of $O(1/\sqrt{n})$

2.3.4. Nonlinear activation functions

①For non linear activation function $\sigma$ and a conv, there is:

$g(i,j,p-1)=\sigma_{i,j}^{p} {}^\prime \sum_{a=0}^{k-1}\sum_{b=0}^{k-1}w_{a,b}^{p}g(i+a,i+b,p)$

where $\sigma_{i,j}^{p} {}^\prime$ is the gradient of the activation function for pixel $\left ( i,j \right )$ at layer $p$

②If gradients $\sigma '$ are independent from the weights and $g$ in the upper layer, the variance can be simplified by:

$\mathrm{Var}[g(i,j,p-1)]=\mathbb{E}[\sigma_{i,j}^{p}{}^{\prime2}]\sum_{a}\sum_{b}\mathrm{Var}[w_{a,b}^{p}]\mathrm{Var}[g(i+a,i+b,p)]$

just for ReLU, it's hard for analysing Sigmoid and Tanh

2.3.5. Dropout, Subsampling, Dilated Convolution and Skip-Connections

①Dropout won't change the shape of ERF

②Subsampling and dilated convolutions can increase the erea of ERF

2.4. Experiments

2.4.1. Verifying theoretical results

①Vis of ERF:

where non linear function weaken the Gaussian distribution, and kernel size there is 3 × 3, weights of kernal are all 1 in Uniform and are random in Random

②Though non linear function corrupt Gaussian distribution, average of 100 running times may change this situation: