A ConvNet for the 2020s

最新推荐文章于 2025-05-23 10:18:27 发布

weixin_46202235

最新推荐文章于 2025-05-23 10:18:27 发布

阅读量585

点赞数 8

文章标签：人工智能

本文链接：https://blog.youkuaiyun.com/weixin_46202235/article/details/142284674

版权

Liu Z, Mao H, Wu C Y, et al. A convnet for the 2020s[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 11976-11986.

Introduction

Vision Transformers (ViTs) some difficulties: general computer vision tasks a family of pure ConvNet: ConvNeXt

ideas: intrinsic superiority rather than the inherent inductive biases of convolutions.

results: COCO detection and ADE20K segmentation

convolutional neural networks

from engineering feature to designing architectures

LeNet5——1980s

2012, AlexNet, VGG, ResNet, MobileNet, EfficientNet

Methods: from ResNet to ConvNet

macro design

ResNeXt

inverted bottleneck

large kernel size

layer-wise micro designs

Methods: training techniques

train a baseline model with vision Transformer training procedure

AdamW optimizer

augmentation techniques

hyper-parameters

average of three random seeds

Methods: macro design

Swin Transformers

multi-stage design (a different feature map resolution)

stage compute ratio

“stem cell”

Swin-T: 1:1:3:1 Swin-L: 1:1:9:1

(3, 4, 6, 3) in ResNet50 to (3, 4, 9, 3)

Methods: ResNeXt-ify

ResNeXt——FLOPs / accuracy

core component: grouped convolution

grouped convolution for the 3x3 conv layer in a bottleneck block

depthwise convolution: mixing information in the spatial dimension

Methods: large kernel sizes

large convolutional kernels

non-local self-attention

VGGNet

local window: 7x7

move up the position of the depthwise conv layer ( from (b) to (c) )

Methods: micro design

at the layer level

activation functions and normalization layers

Rectified Linear Unit to Gaussian Error Linear Unit

BERT, GPT-2, ViTs

at the layer level

activation functions and normalization layers

fewer activation functions

fewer normalization layers

BatchNorm to Layer Normalization

at the layer level

separate downsampling layers

residual block (3x3 conv with stride 2, 1x1 conv with stride 2 at the shortcut connection)

2x2 conv layers with stride 2 for spatial downsampling

adding normalization layers wherever spatial resolution is changed

Results

Conclusion

In the 2020s，vision Transformers，特别是分层Transformers，如swin Transformer，开始超越ConvNets作为通用视觉backbones的首选。人们普遍认为，vision Transformers比ConvNets更准确、更高效和可扩展。本文提出了 ConvNeXts是一种ConvNet模型，可以在多个计算机视觉基准上与最先进的分层Transformers比较，同时保持标准卷积网络的简单性和效率。在某种程度上，观察结果令人惊讶，ConvNeXt 模型本身并不是全新的——在过去的十年中，许多设计选择是单独检查的，但不是集体的。希望本研究中的新结果将挑战几个广泛持有的观点，并促使人们重新思考卷积在计算机视觉中的重要性。