论文网址:A ConvNet for the 2020s | IEEE Conference Publication | IEEE Xplore
论文代码:GitHub - facebookresearch/ConvNeXt: Code release for ConvNeXt model
英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用
目录
2.3. Modernizing a ConvNet: A Roadmap
2.4. Empirical Evaluations on ImageNet
2.4.2. Isotropic ConvNeXt vs. ViT
2.5. Empirical Evaluation on Downstream Tasks
1. 心得
(1)这论文标题顶上空那么多干啥,给我都拿来写了
(2)凌晨四点新开一篇论文!如果世界上没有ddl该多好,可以无忧无虑地看论文
(3)这东西居然这么新!!大为震惊,CNN在22年还能创新的吗。看了眼作者机构基本都是facebook的,打扰了。为什么我总是在打扰。可能我实在太卑微了
(4)该说不说这种文章英文读起来会舒服很多,感觉更有文学性和可读性,推荐阅读英文,会有不同的感受。不过我感觉很多也因为这些作者普遍会自信一些(因为就是很厉害,不会很胆怯),因此在学术论文的表达上也很大胆,不会非常枯燥
(5)不是,怎么看到后面感觉像个调参的
2. 论文逐段精读
2.1. Abstract
①They aim to explore the possibility of pure ConvNet
2.2. Introduction
①The biggest challenge for ViT is the quadratic complexity with respect to the input size
②They aim to identify how do design decisions in Transformers impact ConvNets' performance?
precipitate vt.加速(坏事的发生);使突然陷入(某种状态);使…突然降临 adj.仓促的;鲁莽的;草率的 n.沉淀物;析出物
odyssey n.漫长而充满风险的历程;艰苦的跋涉
2.3. Modernizing a ConvNet: A Roadmap
①They apply ResNet-50 / Swin-T with 4.5e9 FLOPs to present results
②Performance of ConvNeXt on ImageNet under different design:
③Comparison diagram on ImageNet:
2.3.1. Training Techniques
①Training techniques such as optimizer changing will actually enhance the performance of CNN
2.3.2. Macro Design
①Adjusting the block numbers each stage of ResNet-50 from (3, 4, 6, 3) to (3, 3, 9, 3) for aligning FLOPs with Swin-T
②They changed kernels in ResNet from 7 * 7 with stride 2 to 4 * 4 with stride 4
2.3.3. ResNeXt-Ify
①Adding channels from 64 to 96 (same as Swin-T), enhancing accuracy and increasing 5.3 GFLOPs
2.3.4. Inverted Bottleneck
①Bottleneck design:
where (a) is ResNeXt block, (b) is their inverted bottleneck,(c) is inverted bottleneck with block position changing
②Inverted bottleneck design decreases to 4.6 GFLOPs
2.3.5. Large Kernel Sizes
①Increase kernel size to 7*7
②From (b) to (c), the GFLOPs decrease to 4.1 GFLPs
2.3.6. Micro Design
①Replace ReLU by GELU
②Remove 2 GELU to get less activation function:
③Reduce BatchNorm (BN) layers and replace one by Layer Normalization (LN)
④Spatial downsampling by residual 2*2 block
⑤The FLOPs, #params., throughput, and memory use of Swin Transformer and ConvNeXt are similar, but ConvNeXt does not need shifted window attention or relative position biases
2.4. Empirical Evaluations on ImageNet
①All the configurations:
ConvNeXt-T | |
ConvNeXt-S | |
ConvNeXt-B | |
ConvNeXt-L | |
ConvNeXt-XL |
2.4.1. Results
①Performance table:
2.4.2. Isotropic ConvNeXt vs. ViT
①Remove downsampling structure:
2.5. Empirical Evaluation on Downstream Tasks
①Performance on COCO:
②Performance on ADE20K:
2.6. Related Work
①Other models are larger
2.7. Conclusions
~
3. 知识补充
3.1. Inductive bias
(1)参考学习:【机器学习】浅谈 归纳偏置 (Inductive Bias)-优快云博客
4. Reference
Liu, Z. et al. (2022) A ConvNet for the 2020s, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, LA, USA.