VIT(Vision Transformer)系列论文汇总

机械系的AI小白

已于 2024-09-02 23:35:42 修改

阅读量1k

点赞数 24

分类专栏： vision transformer系列文章标签： transformer 深度学习人工智能

于 2024-09-02 23:31:27 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_42841721/article/details/141832688

版权

vision transformer系列专栏收录该内容

3 篇文章

订阅专栏

最近几年，Vision Transformer以及各种修改版本的模型结构已经广泛应用在各种CV任务当中。虽然transformer有比较强的全局特征提取能力，但是没有偏置(局部特征提取能力受限)，计算量大，耗时(和分辨率的平方成正比的计算复杂度)等。之后，大家针对transformer存在的这些问题，进行不断的改良，出现了各种各样的优化结构以及论文。笔者认为，想学好transformer在CV领域的应用，并且最终能够用到自己的工作或项目当中，甚至提出新的网络结构，应该要全面地先对transformer的优缺点有充足的了解以及理解；并且全面了解其发展，以及每个时期的不同transformer为基础的网络结构的变化，改进方法，相互之间的联系。通过大量阅读相关的论文，以及代码，来建立起一个相对完整的知识体系。下面为各位读者朋友整理出来的最近3-5年的一些CV领域transformer相关的论文，如下所列。之后有机会也会针对一些论文分享一些相关的总结。

1. EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction

2. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

3. Metaformer is actually what you need for vision

4. Mvitv2: Improved multiscale vision transformers for classification and detection

5. Multiscale vision transformers

6. Pvt v2: Improved baselines with pyramid vision transformer

7. Soft: Softmax-free transformer with linear complexity

8. Sima: Simple softmax-free attention for vision transformers

9. Flowformer: A transformer architecture for optical flow

10. Hydra Attention: Efficient Attention with Many Heads

11. Transformers are rnns: Fast autoregressive transformers with linear attention

12. Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization

13. EcoFormer: Energy-Saving Attention with Linear Complexity

14. Castling-ViT: Compressing SelfAttention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

15. Cmt: Convolutional neural networks meet vision transformers

16. Rethinking spatial dimensions of vision transformers

17. Focal self-attention for local-global interactions in vision transformers

18. Quadtree attention for vision transformers

19. Scalable vision transformers with hierarchical pooling

20. Cross-attention multi-scale vision transformer for image classification

21. Co-scale conv-attentional image transformers

22. Fastervit: Fast vision transformers with hierarchical attention

23. Efficientvit: Memory efficient vision transformer with cascaded group attention

24. Castling-vit: Compressing self-attention via switching towards linear-angular attention during vision transformer

25. Tokens-to-token vit: Training vision transformers from scratch on imagenet

26. Levit: a vision transformer in convnet’s clothing for faster inference

27. Mobilevit: light-weight, generalpurpose, and mobile-friendly vision transformer

28. Localvit: Bringing locality to vision transformers

29. Twins: Revisiting the design of spatial attention in vision transformers

30. Regionvit: Regional-to-local attention for vision transformers

31. Kvt: k-nn attention for boosting vision transformers

32. Fast vision transformers with hilo attention

33. Cswin transformer: A general vision transformer backbone with cross-shaped windows

34. Flatten transformer: Vision transformer using focused linear attention

35. Shuffle transformer: Rethinking spatial shuffle for vision transformer