[ECCV 2024]Wavelet Convolutions for Large Receptive Fields

夏莉莉iy

已于 2024-11-14 22:09:02 修改

阅读量1k

点赞数 16

CC 4.0 BY-SA版权

分类专栏：论文精读文章标签：深度学习人工智能机器学习笔记计算机视觉神经网络 python

于 2024-11-01 23:39:21 首次发布

本文链接：https://blog.youkuaiyun.com/Sherlily/article/details/143438779

论文精读专栏收录该内容

190 篇文章

订阅专栏

论文网址：[2407.05848] Wavelet Convolutions for Large Receptive Fields

论文代码：https://github.com/BGU-CS-VIL/WTConv

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.3.1. Wavelet Transforms in Deep Learning

2.3.2. Large-Kernel Convolutions

2.4. Method

2.4.1. Preliminaries: The Wavelet Transform as Convolutions

2.4.2. Convolution in the Wavelet Domain

2.4.3. The Benefits of Using WTConv

2.4.4. Computational Cost

2.5. Results

2.5.1. ImageNet-1K Classification

2.5.2. Semantic Segmentation

2.5.3. Object Detection

2.5.4. WTConv Analysis

1. 心得

（1）公式是截图是因为公式识别软件要收费了好吧！！真是sad，穷学生一分钱掏不出来！手敲公式可以但长了就费劲！好吧！

2. 论文逐段精读

2.1. Abstract

①For increasing the receptive field, researchers attempt to increase the size of conv kernel. However, this method hits upper bound and saturates quickly

②They proposed WTConv to obtain big receptive field

2.2. Introduction

①⭐Expanding the conv kernel blindly increase the parameters explosively

②The bigger the kernel size, the stronger the low-frequency feature capturing ability

③Attention head focuses on low-frequency

④⭐Compared with Fourier, wavelet (WT) remains some spatial resolution

⑤Their WTs are cascade, with increasing receptive field

⑥The effective receptive fields of ConvNeXt-T:

⑦Test tasks: on semantic segmentation and object detection

2.3. Related Work

2.3.1. Wavelet Transforms in Deep Learning

①Lists how WT used in signal processing

②Explaining a mostly related work, which down sampled by WT and up sampled by inverse WT

2.3.2. Large-Kernel Convolutions

①Introducing some Conv methods

②Notes some models employ attention in visual tasks

2.4. Method

2.4.1. Preliminaries: The Wavelet Transform as Convolutions

①WT: Haar (others are all okay)

②For image $X$ , the 2 dimension Haar WT:

where, the same as before~~, LL is low and others are all high, frequency. So they get the output of the four filters:

③Inverse wavelet transform (IWT) by transpose conv:

④"The cascade wavelet decomposition is then given by recursively decomposing the low-frequency component:"

where $X^{(0)}_{LL}=X$ , $i$ denotes current level, which increases the frequency resolution and reduces spatial resolution for low frequencies（不是，这句话是真绕口？到底在说什么？low frequence是修饰一整句话还是只修饰后半句啊？）

2.4.2. Convolution in the Wavelet Domain

①⭐Increasing the size of the convolution kernel will increase the number of parameters in a power of two manner

②Add a small kernel depth wise conv in the middle of WT and IWT:

$Y=IWT(Conv(W,WT(X)))$

where $X$ is the input image（作者写的tensor）, $W$ denotes $k\times k$ depth wise kernel with four times as many input channels as $X$

③Visualization of their WT, with a relatively large receptive field:

④The mapping operator:

where $X_H$ denotes all high-frequency

⑤For the inverse operations,

$IWT(X+Y)=IWT(X)+IWT(Y)$

so they perform:

$Z^{(i)}=\mathrm{IWT}(Y_{LL}^{(i)}+Z^{(i+1)},Y_{H}^{(i)})$

⑥Visualization of 2-level WT:

2.4.3. The Benefits of Using WTConv

①Advantages of WTConv: a) expand receptive field, b) better capturing low-frequency features

2.4.4. Computational Cost

①The temporal cost (FLOPs) of depth wise convolution:

$C\cdot K_W\cdot K_H\cdot N_W\cdot N_H\cdot\frac1{S_W}\cdot\frac1{S_H}$

where $C$ denotes the number of channels, two $N$ s are the spatial coeficients, two $K$ s are the size of conv kernel, $S$ s are strides

②For WT is conducted in half spatial dimension, but the channel number is the four times. The FLOPs will be:

$C\cdot K_W\cdot K_H\cdot\left(N_W\cdot N_H+\sum_{i=1}^\ell4\cdot\frac{N_W}{2^i}\cdot\frac{N_H}{2^i}\right)$

where $\ell$ is the number of WT levels. The same as:

$4C\cdot\sum_{i=0}^{\ell-1}\frac{N_W}{2^i}\cdot\frac{N_H}{2^i}$

2.5. Results

2.5.1. ImageNet-1K Classification

①Base architecture: ConvNeXt

②Replacement: WTConv replaces 7 × 7 depth wise convolution

③Kernel size: 5 × 5

④Input size: 224 × 224

⑤Comparison table in 120 epochs:

⑥Comparison table in 300 epochs:

2.5.2. Semantic Segmentation

①Backbone: WTConvNeXt

②Comparison table on ADE20K:

2.5.3. Object Detection

①Comparison table on COCO:

2.5.4. WTConv Analysis

（1）Scalability

①Creating sub datasets of ImageNet, ImageNet-50/100/200, which means 50/100/200 class

②参数设定在Appendix B，要复现的自己参考原文，我就不在这赘述了

③Comparison table:

（2）Robustness

①Robustness to corruption in classification over ImageNetC/\bar{C}, ImageNet-R, ImageNet-A, and ImageNet-Sketch:

②Robustness to corruptions in object detection measured in mean and relative performance under corruption (mPC and rPC):

（3）Shape-bias

①Quantify shape bias by modelushuman benchmark:

（4）Effective Receptive Field

①The contribution of ERF:

（我放过这张图，但是为了大家更直观地看到，我这里再次放了一次。不是作者放重了！是我博客不想引用上面的。作者用它引用的[11]的论文测的，没有名字。然后明显看出WTConv的感受野更大）

（5）Ablation Study

①Ablation study on different configurations:

2.6. Limitations

①The running time of WTConv is a little bit high, they suggest "performing WT in parallel to convolution in each level to reduce memory reads or performing WT and IWT in-place to reduce memory allocations"