最近几年,Vision Transformer以及各种修改版本的模型结构已经广泛应用在各种CV任务当中。虽然transformer有比较强的全局特征提取能力,但是没有偏置(局部特征提取能力受限),计算量大,耗时(和分辨率的平方成正比的计算复杂度)等。之后,大家针对transformer存在的这些问题,进行不断的改良,出现了各种各样的优化结构以及论文。笔者认为,想学好transformer在CV领域的应用,并且最终能够用到自己的工作或项目当中,甚至提出新的网络结构,应该要全面地先对transformer的优缺点有充足的了解以及理解;并且全面了解其发展,以及每个时期的不同transformer为基础的网络结构的变化,改进方法,相互之间的联系。通过大量阅读相关的论文,以及代码,来建立起一个相对完整的知识体系。下面为各位读者朋友整理出来的最近3-5年的一些CV领域transformer相关的论文,如下所列。之后有机会也会针对一些论文分享一些相关的总结。
1. EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
2. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
3. Metaformer is actually what you need for vision
4. Mvitv2: Improved multiscale vision transformers for classification and detection
5. Multiscale vision transformers
6. Pvt v2: Improved baselines with pyramid vision transformer
7. Soft: Softmax-free transformer with linear complexity
8. Sima: Simple softmax-free attention for vision transformers
9. Flowformer: A transformer architecture for optical flow
10. Hydra Attention: Efficient Attention with Many Heads
11. Transformers are rnns: Fast autoregressive transformers with linear attention
12. Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization
13. EcoFormer: Energy-Saving Attention with Linear Complexity
14. Castling-ViT: Compressing SelfAttention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
15. Cmt: Convolutional neural networks meet vision transformers
16. Rethinking spatial dimensions of vision transformers
17. Focal self-attention for local-global interactions in vision transformers
18. Quadtree attention for vision transformers
19. Scalable vision transformers with hierarchical pooling
20. Cross-attention multi-scale vision transformer for image classification
21. Co-scale conv-attentional image transformers
22. Fastervit: Fast vision transformers with hierarchical attention
23. Efficientvit: Memory efficient vision transformer with cascaded group attention
24. Castling-vit: Compressing self-attention via switching towards linear-angular attention during vision transformer
25. Tokens-to-token vit: Training vision transformers from scratch on imagenet
26. Levit: a vision transformer in convnet’s clothing for faster inference
27. Mobilevit: light-weight, generalpurpose, and mobile-friendly vision transformer
28. Localvit: Bringing locality to vision transformers
29. Twins: Revisiting the design of spatial attention in vision transformers
30. Regionvit: Regional-to-local attention for vision transformers
31. Kvt: k-nn attention for boosting vision transformers
32. Fast vision transformers with hilo attention
33. Cswin transformer: A general vision transformer backbone with cross-shaped windows
34. Flatten transformer: Vision transformer using focused linear attention
35. Shuffle transformer: Rethinking spatial shuffle for vision transformer