ICLR-2021-ViT: AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE 阅读笔记

VisionTransformer(ViT)是首个全注意力模型应用于视觉任务的开创性工作,它摒弃了传统架构中的图像特有偏置,仅通过将图像分割为小块并用标准Transformer编码器处理来实现高性能。尽管初期性能可能不及卷积网络,但预训练后的ViT在多个图像分类数据集上表现出色,达到了或超过了当前的 state-of-the-art。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

论文地址:
https://arxiv.org/pdf/2010.11929.pdf
代码地址:
https://github.com/google-research/vision_transformer

Vision Transformer (ViT)框架:
在这里插入图片描述
模型概述:
将输入图像分割成固定大小的小块(patch),并为他们嵌入位置编码后线性的馈送到标准的变换器编码器(Transformer encoder)中。该模型的设计遵循了原始变压器
在这里插入图片描述
Conclusions:
[原文]
We have explored the direct application of Transformers to image recognition. Unlike prior works using self-attention in computer vision, we do not introduce image-specific inductive biases into the architecture apart from the initial patch extraction step. Instead, we interpret an image as a sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classification datasets, whilst being relatively cheap to pre-train.
[SwinTrack]
The Vision Transformer (ViT, the first fully attentional model in vision tasks) and many of its successors were inferior to convnets in terms of performance, until the appearance of the Swin-Transformer.
[MixFormer]
The Vision Transformer (ViT) first presented a pure vision transformer architecture, obtaining an impressive performance on image classification.
[Swin transformer]
The pioneering work of ViT directly applies a Transformer architecture on nonoverlapping medium-sized image patches for image classification. It achieves an impressive speed-accuracy tradeoff on image classification compared to convolutional networks.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值