4.1 S
ETUP
Datasets.
To explore model scalability, we use the ILSVRC-2012 ImageNet dataset with 1k classes and 1.3M images (we refer to it as ImageNet in what follows), its superset ImageNet-21k with 21k classes and 14M images (Deng et al., 2009), and JFT (Sun et al., 2017) with 18k classes and 303M high-resolution images. We de-duplicate the pre-training datasets w.r.t. the test sets of the downstream tasks following Kolesnikov et al. (2020). We transfer the models trained on these dataset to several benchmark tasks: ImageNet on the original validation labels and the cleaned-up ReaL labels (Beyer et al., 2020), CIFAR-10/100 (Krizhevsky, 2009), Oxford-IIIT Pets (Parkhi et al., 2012), and Oxford Flowers-102 (Nilsback & Zisserman, 2008). For these datasets, pre-processing follows Kolesnikov et al. (2020)
数据集. 为了探索模型的可扩展性,我们使用ILSVRC-2012 ImageNet(1000类别,1300万张图像)、ImageNet-21k(2.1万类别,1.4亿万张图像)以及JFT(1.8万类别,30.3亿张图像)数据集。我们依照Kolesnikov等人,参照下游任务的测试集对预训练集进行去重。我们将在这些数据集上训练的模型迁移到一些基准任务上:原始验证集标签上和清理过的ReaL标签上的ImageNet,Oxford-IIIT Pets,以及Oxford Flowers-102。对于这些数据集,预处理遵循Kolenikov等人的方法。

We also evaluate on the 19-task VTAB classifification suite (Zhai et al., 2019b). VTAB evaluates
low-data transfer to diverse tasks, using 1 000 training examples per task. The tasks are divided into
three groups:
Natural
– tasks like the above, Pets, CIFAR, etc.
Specialized
– medical and satellite
imagery, and
Structured
– tasks that require geometric understanding like localization.
我们同样在具有19个分类任务的VTAB数据集上进行了评估。VTAB每个任务使用1000张训练图像,评估有限数据到各种任务的迁移能力。这些任务分为3个组:自然图像任务——和上述Pets、CIFAR相似,特定图像任务——医学和卫星图像,结构化图像任务——需要理解几何,比如定位。
Model Variants.
We base ViT confifigurations on those used for BERT (Devlin et al., 2019), as
summarized in Table 1. The “Base” and “Large” models are directly adopted from BERT and we
add the larger “Huge” model. In what follows we use brief notation to indicate the model size and
the input patch size: for instance, ViT-L/16 means the “Large” variant with
16
×
16
input patch size.
Note that the Transformer’s sequence length is inversely proportional to the square of the patch size, thus models with smaller patch size are computationally more expensive.
模型变体
如表1所示,我们在BERT所使用的模型结构基础上确定ViT配置。“Base”和“Large”直接取自BERT,“Huge”是我们增加的更大模型。在下文中,我们使用简明的注释来表示模型尺寸和输入图像块尺寸:比如ViT-L/16表示输入块尺寸为16 × 16 16\times 1616×16的“Large”模型。注意Transformer的序列长度和图像块尺寸的平方成反比,因此图像块尺寸更小的模型计算量更大。
For the baseline CNNs, we use ResNet (He et al., 2016), but replace the Batch Normalization layers (Ioffe & Szegedy, 2015) with Group Normalization (Wu & He, 2018), and used standardized convolutions (Qiao et al., 2019). These modififications improve transfer (Kolesnikov et al., 2020), and we denote the modifified model “ResNet (BiT)”. For the hybrids, we feed the intermediate feature maps into ViT with patch size of one “pixel”. To experiment with different sequence lengths, we either (i) take the output of stage 4 of a regular ResNet50 or (ii) remove stage 4, place the same number of layers in stage 3 (keeping the total number of layers), and take the output of this extended stage 3. Option (ii) results in a 4x longer sequence length, and a more expensive ViT model.
对于CNN基线,我们使用ResNet,但是将Batch Norm层替换成Group Norm层,然后使用标准化卷积。这些改动可以提升迁移的性能,我们用ResNet(BiT)表示修改后的模型。对于混合模型,我们将中间层的特征图送给ViT,块尺寸为1个像素。为了实验不同长度的序列,我们(i)使用常规ResNet50中stage 4的输出(ii)移除stage4,使用stage3中同样数量的层进行替换,然后取这个扩展stage 3的输出。选项(ii)可以得到4倍长度的序列,因此对应的ViT模型计算开销更大。
Training & Fine-tuning.
We train all models, including ResNets, using Adam (Kingma & Ba,
2015) with
β
1
= 0
.
9
,
β
2
= 0
.
999
, a batch size of 4096 and apply a high weight decay of
0
.
1
, which
we found to be useful for transfer of all models (Appendix D.1 shows that, in contrast to common
practices, Adam works slightly better than SGD for ResNets in our setting). We use a linear learning rate warmup and decay, see Appendix B.1 for details. For fifine-tuning we use SGD with momentum, batch size 512, for all models, see Appendix B.1.1. For ImageNet results in Table 2, we fifine-tuned at higher resolution: 512
for ViT-L/16 and
518
for ViT-H/14, and also used Polyak & Juditsky (1992) averaging with a factor of 0
.
9999
(Ramachandran et al., 2019; Wang et al., 2020b).
训练和微调
我们训练所有包括ResNets在内的模型,都是用Adam优化器,β 1 = 0.9 , β 2 = 0.999 , w e i g h t _ d e c a y = 0.1 , B A T C H _ S I Z E = 4096,我们发现这对于所有模型的迁移都很有效(附录D.1表明,和一般经验相反,对于ResNets训练,Adam比SGD要稍好一些。)。我们使用一个线性学习率预热和衰减,细节见附录B.1。微调时我们使用带动量的SGD,B A T C H _ S I Z E = 512,见附录B.1.1。对于表2中的ImageNet结果,我们使用高分辨进行微调:ViT-L/16使用512,ViT-H/14使用518。
Metrics.
We report results on downstream datasets either through few-shot or fine-tuning accuracy. Fine-tuning accuracies capture the performance of each model after fifine-tuning it on the respective dataset. Few-shot accuracies are obtained by solving a regularized least-squares regression problem that maps the (frozen) representation of a subset of training images to {−
1
,
1
}
K
target vectors. This formulation allows us to recover the exact solution in closed form. Though we mainly focus on fifine-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-flfly evaluation where fifine-tuning would be too costly.
评价指标
我们报告了下游数据集小样本和微调的准确率。微调准确率体现的是每个模型在对应数据集上微调后的性能。小样本准确率是通过求解将训练图像子集的表征映射到{ − 1 , 1 } K目标向量的最小二乘回归问题得到。该公式使得我们能够以闭环的方式获取精确解。尽管我们主要关注微调性能,有时候微调开销过大时,我们也会使用线性小样本准确率来进行快速的动态评估。
4.2 Comparison to SOTA
We first compare our largest models – ViT-H/14 and ViT-L/16 – to state-of-the-art CNNs from
the literature. The first comparison point is Big Transfer (BiT) (Kolesnikov et al., 2020), which
performs supervised transfer learning with large ResNets. The second is Noisy Student (Xie et al., 2020), which is a large EffificientNet trained using semi-supervised learning on ImageNet and JFT- 300M with the labels removed. Currently, Noisy Student is the state of the art on ImageNet and BiT-L on the other datasets reported here. All models were trained on TPUv3 hardware, and we report the number of TPUv3-core-days taken to pre-train each of them, that is, the number of TPU v3 cores (2 per chip) used for training multiplied by the training time in days.
我们首先拿最大的模型ViT-H/14和ViT-L/16和SOTA文献中的CNN比较。第一个比较点是Big Transfer(BiT),使用大型ResNets进行有监督的迁移学习;第二个点是Noisy Student,使用半监督的方式在去除标签的ImageNet和JFT-300M数据集上训练的EfficientNet。目前,Noisy Student是ImageNet上的SOTA,BiT-L是其他数据集上的SOTA。所有模型都使用TPUv3训练,我们报道了每个模型预训练TPUv3-core-days的数值,该数值等于用于训练的TPUv3核心数量(每块TPUv3有2个核心)乘以训练天数的乘积。
Table 2 shows the results. The smaller ViT-L/16 model pre-trained on JFT-300M outperforms BiT-L (which is pre-trained on the same dataset) on all tasks, while requiring substantially less computational resources to train. The larger model, ViT-H/14, further improves the performance, especially on the more challenging datasets – ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this model still took substantially less compute to pre-train than prior state of the art. However, we notethat pre-training effificiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc. We provide a controlled study of performance vs. compute for different architectures in Section 4.4. Finally, the ViT-L/16 model pre-trained on the public ImageNet-21k dataset performs well on most datasets too, while taking fewer resources to pre-train: it could be trained using a standard cloud TPUv3 with 8 cores in approximately 30 days.
表2是对比实验结果。同样在JFT-300M上进行预训练,更小的ViT-L/16在所有任务上表现都优于BiT-L,而且大幅减少了所需的训练资源。更大的ViT-H/14性能有进一步提升,尤其是在诸如ImageNet、CIFAR-100、VTAB这类难度更大的数据集上。有趣的是,ViT预训练的资源开销相比之前的SOTA方法也大大减少。然而,我们注意到预训练的效率不仅和模型结构选择相关,而且和训练策略、优化器、权重衰减等因素有关。我们在4.4节提进行了一个关于不同模型结构性能和训练量的控制实验。最后,在Image-21k上预训练的ViT-L/16在绝大多数数据集上都表现的很好,而且所需预训练资源更少:使用标准的8核TPUv3大约使用30天可以完成训练。

图2展现了BiT、VIVI(在ImageNet和Youtube上联合训练的ResNet)和S4L(在ImageNet混合监督和半监督方式训练)在VTAB任务上的性能。在Natural和Structured任务分支上,ViT-H/14比BiT-R152x4和其他方法好,在Specialized分支上和BiT-R152x4性能接近。

4.3 P
RE
-
TRAINING
D
ATA
R
EQUIREMENTS
The Vision Transformer performs well when pre-trained on a large JFT-300M dataset. With fewer
inductive biases for vision than ResNets, how crucial is the dataset size? We perform two series of experiments.
First, we pre-train ViT models on datasets of increasing size: ImageNet, ImageNet-21k, and JFT-
300M. To boost the performance on the smaller datasets, we optimize three basic regularization
parameters – weight decay, dropout, and label smoothing. Figure 3 shows the results after fifine
tuning to ImageNet (results on other datasets are shown in Table 5)
2
. When pre-trained on the
smallest dataset, ImageNet, ViT-Large models underperform compared to ViT-Base models, despite (moderate) regularization. With ImageNet-21k pre-training, their performances are similar. Only with JFT-300M, do we see the full benefifit of larger models. Figure 3 also shows the performance region spanned by BiT models of different sizes. The BiT CNNs outperform ViT on ImageNet, but with the larger datasets, ViT overtakes.


Second, we train our models on random subsets of 9M, 30M, and 90M as well as the full JFT-
300M dataset. We do not perform additional regularization on the smaller subsets and use the same hyper-parameters for all settings. This way, we assess the intrinsic model properties, and not the effect of regularization. We do, however, use early-stopping, and report the best validation accuracy achieved during training. To save compute, we report few-shot linear accuracy instead of full finetuning accuracy. Figure 4 contains the results. Vision Transformers overfifit more than ResNets with comparable computational cost on smaller datasets. For example, ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets. The same is true for ResNet152x2 and ViT-L/16. This result reinforces the intuition that the convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from data is suffificient, even benefificial.
Overall, the few-shot results on ImageNet (Figure 4), as well as the low-data results on VTAB
(Table 2) seem promising for very low-data transfer. Further analysis of few-shot properties of ViT
is an exciting direction of future work.
4.4 S
CALING
S
TUDY
We perform a controlled scaling study of different models by evaluating transfer performance from JFT-300M. In this setting data size does not bottleneck the models’ performances, and we assess performance versus pre-training cost of each model. The model set includes: 7 ResNets, R50x1, R50x2 R101x1, R152x1, R152x2, pre-trained for 7 epochs, plus R152x2 and R200x3 pre-trained for 14 epochs; 6 Vision Transformers, ViT-B/32, B/16, L/32, L/16, pre-trained for 7 epochs, plus L/16 and H/14 pre-trained for 14 epochs; and 5 hybrids, R50+ViT-B/32, B/16, L/32, L/16 pretrained for 7 epochs, plus R50+ViT-L/16 pre-trained for 14 epochs (for hybrids, the number at the end of the model name stands not for the patch size, but for the total dowsampling ratio in the ResNet backbone).
我们通过比较JFT-300M数据集上的迁移性能对不同模型进行了实验。这组实验中,数据集大小不是模型的性能瓶颈,我们评估每个模型性能和预训练开销的关系。实验模型包含:预训练7个epoch的7ResNets,R50x1,R50x2,R101x1,R152x1,R152x2;加上预训练14个epoch的R152x2,R200x3;预训练7个epoch的ViT-B/32,B/15,L/32,L/16,加上预训练14个epoch的R50+ViT-L/16(为了测试混合结构,模型后面的数字不是表示图像块大小,而是ResNet骨干中的总下采样率)。
Figure 5 contains the transfer performance versus total pre-training compute (see Appendix D.5
for details on computational costs). Detailed results per model are provided in Table 6 in the Ap
pendix. A few patterns can be observed. First, Vision Transformers dominate ResNets on the
performance/compute trade-off. ViT uses approximately
2
−
4
×
less compute to attain the same
performance (average over 5 datasets). Second, hybrids slightly outperform ViT at small compu
tational budgets, but the difference vanishes for larger models. This result is somewhat surprising, since one might expect convolutional local feature processing to assist ViT at any size. Third, Vision Transformers appear not to saturate within the range tried, motivating future scaling efforts.
实验结果如图5所示,每个模型的详细结果见表6。可以发现一些规律。第一,ViT在性能/计算开销的均衡方面全面由于ResNet,可以减少2~4倍的训练量达到相同的性能水平(平均5个数据集)。第二,混合模型在计算预算较少情况下,性能优于ViT,但是这个差距会随着模型增大而消失。这个结果有些令人惊讶,因为我们也许可以期望通过局部卷积特征来辅助任意尺寸的ViT。第三,ViT在尝试范围内未出现性能饱和的现象,推动未来的扩展工作。

4.5 I
NSPECTING
V
ISION
T
RANSFORMER
To begin to understand how the Vision Transformer processes image data, we analyze its internal representations. The fifirst layer of the Vision Transformer linearly projects the flflattened patches into a lower-dimensional space (Eq. 1). Figure 7 (left) shows the top principal components of the the learned embedding fifilters. The components resemble plausible basis functions for a low-dimensional representation of the fifine structure within each patch.
为了理解ViT如何处理图像数据,我们分析了模型的内部表征。ViT的第一层将展平的图像块线性映射到一个较低纬度的空间(式1)。图7左展现了学习得到的嵌入滤波器的主要作用,似乎是每个图像块中精细结构的低维基础函数。
After the projection, a learned position embedding is added to the patch representations. Figure 7 (center) shows that the model learns to encode distance within the image in the similarity of position embeddings, i.e. closer patches tend to have more similar position embeddings. Further, the row-column structure appears; patches in the same row/column have similar embeddings. Finally, a sinusoidal structure is sometimes apparent for larger grids (Appendix D). That the position embeddings learn to represent 2D image topology explains why hand-crafted 2D-aware embedding variants do not yield improvements (Appendix D.4).
线性映射之后,图像块特征附加了一个通过学习得到的position embedding。图7中展现了模型将图像中的距离使用position embedding相似性进行编码,即靠的更近的图像块倾向于有更相似的position embedding。而且,同一行或同一列的图像块也有相似的position embedding。最后,在更大的网格中有时会出现明显的正弦结构。position embedding可以学习到表达2D图像的拓扑结构,这也解释了为什么手工设计的2D position embedding性能没有提升。

Self-attention allows ViT to integrate information across the entire image even in the lowest layers. We investigate to what degree the network makes use of this capability. Specififically, we compute the average distance in image space across which information is integrated, based on the attention weights (Figure 7, right). This “attention distance” is analogous to receptive fifield size in CNNs. We fifind that some heads attend to most of the image already in the lowest layers, showing that the ability to integrate information globally is indeed used by the model. Other attention heads have consistently small attention distances in the low layers. This highly localized attention is less pronounced in hybrid models that apply a ResNet before the Transformer (Figure 7, right), suggesting that it may serve a similar function as early convolutional layers in CNNs. Further, the attention distance increases with network depth. Globally, we fifind that the model attends to image regions that are semantically relevant for classifification (Figure 6)
自监督使得ViT即使是在最底层也能整合整张图像的信息。我们对网络自监督能力的使用程度进行了研究。具体来说,我们根据注意力权重计算图像空间中信息整合的平均距离,如图7右所示。这个“注意力距离”和CNN中的接收域尺寸相似。我们发现有些注意力头在网络底层就已经整合了图像的绝大部分区域,表明ViT的确使用全局信息整合的能力。另外的注意力头则只注意到图像的一小部分。这种高度集中的注意力在混合模型(ViT前加一个ResNet)中更少见,表明局部注意力的作用和CNN中前几层卷积层的作用相近。而且,注意力距离随着网络的加深而增大。从全局上看,我们发现模型更关注与和分类语义相关的图像区域,如图6和14所示。

4.6 S
ELF
-
SUPERVISION
Transformers show impressive performance on NLP tasks. However, much of their success stems not only from their excellent scalability but also from large scale self-supervised pre-training (Devlin et al., 2019; Radford et al., 2018). We also perform a preliminary exploration on
masked patch
prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a signifificant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.Appendix B.1.2 contains further details. We leave exploration of contrastive pre-training (Chen et al., 2020b; He et al., 2020; Bachman et al., 2019; Henaff et al., 2020) to future work.
4.6 Self-Supervision
Transformer在NLP任务中展现了优异的性能。然而他们大多数的成功并不仅仅是因为Transformer优异的扩展性,同样归功于大规模的自监督预训练。我们还模仿BERT中使用的masked language modeling任务,对自监督的masked patch prediction做了初步探索。在自监督预训练下,我们较小的ViT-B/16模型在ImageNet上达到了79.9%的准确率,相比从头训练提升了2%,但是相比有监督的预训练仍然落后4%。附录B.1.2包含更多的细节。我们将对比预训练的探索留给未来工作。
5. Conclusion
We have explored the direct application of Transformers to image recognition. Unlike prior works
using self-attention in computer vision, we do not introduce image-specifific inductive biases into
the architecture apart from the initial patch extraction step. Instead, we interpret an image as a
sequence of patches and process it by a standard Transformer encoder as used in NLP. This simple, yet scalable, strategy works surprisingly well when coupled with pre-training on large datasets. Thus, Vision Transformer matches or exceeds the state of the art on many image classifification datasets, whilst being relatively cheap to pre-train.
我们探索了直接将Transformer应用于图像识别。和前人在计算机视觉中使用自监督的工作不同,除了初始图像块提取以外,我们不引入针对图像的归纳偏置到网络结构中。作为代替,我们将一张图像视为一个图像块序列,然后使用NLP的标准Transformer编码器处理。这种简单但可扩展的策略结合在大型数据集上进行预训练效果出奇的好。因此,ViT在许多图像范磊数据集上逼近甚至超越了SOTA,而且其预训练开销相比之下降低了很多。
While these initial results are encouraging, many challenges remain. One is to apply ViT to other
computer vision tasks, such as detection and segmentation. Our results, coupled with those in Carion et al. (2020), indicate the promise of this approach. Another challenge is to continue exploring self-supervised pre-training methods. Our initial experiments show improvement from self-supervised pre-training, but there is still large gap between self-supervised and large-scale supervised pretraining. Finally, further scaling of ViT would likely lead to improved performance.
尽管这些初始的结果很令人激动,但仍面临着许多挑战。其中一个是将ViT应用到其他计算机视觉任务中,比如检测和分割。我们以及Carion的结果表明这种方法的可信度。另一个挑战是持续探索自监督的预训练方法。我们最初的实验表明自监督预训练带来的提升,但是自监督和大规模有监督之间仍有巨大的鸿沟。最后,ViT的进一步扩展可能会带来新的性能提升。