[2101] [ICCV 2021] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

koukouvagia

已于 2022-04-04 10:09:58 修改

阅读量1.2k

点赞数

CC 4.0 BY-SA版权

文章标签：计算机视觉深度学习

于 2022-02-21 16:56:21 首次发布

本文链接：https://blog.youkuaiyun.com/weixin_43355838/article/details/123050565

T2T-ViT是一种针对ViT的改进模型，通过Token-to-Token模块解决ViT在建模局部信息上的不足。T2T模块采用软分割和重结构化过程，逐步构建图像的局部结构，减少令牌长度。实验表明，T2T-ViT在ImageNet上从头训练时，相比ViT和ResNet展现出更好的性能，并且在有限的GPU内存下，使用Performer层也能取得良好效果。此外，T2T-ViT的深窄架构设计比传统的宽浅结构更有效，尤其是在特征丰富度和计算效率方面。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

paper
code

Content

Introduction

main limitations of ViT

straightforward tokenization of input images by hard split makes ViT unable to model local information, thus requiring more training samples than CNNs to achieve similar performance
self-attention in ViT is not well-designed as CNNs for vision tasks, which contains redundancy and leads to limited features and more difficult training

Feature visualization of ResNet50, ViT-L/16 and T2T-ViT-24 trained on ImageNet. Green boxes highlight learned low-level structure features such as edges and lines; red boxes highlight invalid feature maps with zero or too large values. Note the feature maps visualized here for ViT and T2T-ViT are not attention maps, but image features reshaped from tokens. For better visualization, we scale the input image to size 1024x1024 or 2048x2048.

ResNet capture desired local structure (edges, lines, textures, etc.) progressively from bottom layer (conv1) to middle layer (conv25)
in ViT, structure information poorly modeled while global relations (e.g., the whole dog) captured by all attention blocks
note that ViT ignores local structure when directly splitting images to tokens with fixed length.

many channels in ViT have zero value
note that backbone of ViT is not efficient as ResNets and offers limited feature richness when training samples are not enough.

main contributions of T2T-ViT

propose a progressive tokenization module to aggregate neighboring Tokens-to-Token, which can model local structure information of surrounding tokens and reduce length of tokens iteratively
borrow architecture designs from CNNs to build transformer layers for improving feature richness, and find deep narrow architecture design in ViT brings much better performance

Comparison between T2T-ViT with ViT, ResNets and MobileNets when trained from scratch on ImageNet. Left: performance curve of MACs vs. top-1 accuracy. Right: performance curve of model size vs. top-1 accuracy.

Method

model architecture

layer-wise T2T module: model local information of images, reduce length of tokens progressively
efficient T2T-ViT backbone: draw global attention relation on tokens from T2T module

The overall network architecture of T2T-ViT. In the T2T module, the input image is first soft split as patches, and then unfolded as a sequence of tokens $T_0$ . The length of tokens is reduced progressively in the T2T module (we use 2 iterations here and output $T_f$ ). Then the T2T-ViT backbone takes the fixed tokens as input and outputs the predictions.

token-to-token (T2T)

aim to overcome limitation of simple tokenization in ViT
progressively structurize an image to tokens and model local structure information, so length of tokens reduced iteratively

Illustration of T2T process. The tokens $T_i$ are re-structurized as an image $I_i$ after transformation and reshaping; then $I_i$ is split with overlapping to tokens $T_{i+1}$ again. Specifically, as shown in the pink panel, the four tokens (1, 2, 4, 5) of the input Ii are concatenated to form one token in $T_{i+1}$ . The T2T transformer can be a normal Transformer layer or other efficient transformers like Performer layer at limited GPU memory.

re-structurization

given a sequence of tokens $T_i$ from preceding transformer layer, transform $T_i'$ by self-attention block
$T_i'=MLP(MSA(T_i))$
where, $T_i\in R^{L\times C}$ , $T_i'\in R^{L\times C}$
tokens $T_i'$ will be reshaped as an image in spatial dimension
$I_i=Reshape(T_i')$
where, $R e s h a p e (.)$ re-organize $T_i'\in R^{L\times C}$ to $I_i\in H\times W\times C$ , with $L=H\times W$

soft split

model local structure information and reduce length of tokens
similar to convolution operation without conv filters

to avoid information loss in generating tokens from re-structurizated image, split image into patches with overlapping

each patch is correlated with surrounding patches to establish a prior that there should be stronger correlations between surrounding tokens.
tokens in each split patch are concatenated as one (red and blue box), so regional information is aggregated from surrounding pixels and patches.

$T_i+1=SS(I_i)$
where, $S S (.)$ is soft split operation, implemented by nn.Unfold
in nn.Unfold, given a tenser $X\in B\times C\times H\times W$ , a kxk-size kernel apply on F to capture $X_1\in C\times$ , which then reshaped into $X_1'\in Ck^2$
get output tensor $Y\in B\times Ck^2\times H_0\times W_0$ , with $H_0=\lfloor \frac {H-k+2p}s+1\rfloor$ , $W_0=\lfloor \frac {W-k+2p}s+1\rfloor$
similarly, given $I_i\in H\times W\times C$ , output tensor $T_{i+1}\in L_0\times Ck^2$ got, with $L_0=\lfloor \frac {H-k+2p}s+1\rfloor \times \lfloor \frac {W-k+2p}s+1\rfloor$
after soft split, output tokens are fed for the next T2T process

T2T module

based on transformer block, with 2 extra components

reshape tokens into $H\times W\times C$ image for learning more local information later
unfold image into $L_0\times Ck^2$ tokens and capture details for more efficient modeling in transformer later

for input image $I_0$ , only apply soft split at first to split it to tokens: $T_1=SS(I_0)$
after last T2T module, output tokens $T_f$ has fixed length, so T2T-ViT backbone can model global relation on $T_f$

T2T-ViT backbone

since many channels in vanilla ViT are invalid, plan to find an efficient backbone to reduce redundancy and improve feature richness

5 designs from CNNs to ViT

dense connection, as DenseNet
deep-narrow vs. shallow-wide structure, as in Wide-ResNet
channel attention, as Squeeze-and-Excitation networks
more split heads in multi-head attention layer, as ResNeXt
Ghost operation, as GhostNet

key findings

a deep-narrow structure: decrease embedding dimensions to improve feature and reduce computation
channel attention improve performance but is less effective than a deep-narrow architecture

design a deep-narrow structure with a small channel dimension and a hidden dimension d but more layers b
for tokens $T_f$ from the last T2T module, concatenate a class token and add sinusoidal position embedding
${T_f}_0=[t_cls; T_f]+PE, PE\in R^{(L+1)\times d}$
${T_f}_i=MLP(MSA({T_f}_{i-1})), i=1, 2,...,b$
$Y=FC(LN({T_f}_b))$

architecture variants

Structure details of T2T-ViT. T2T-ViT-14/19/24 have comparable model size with ResNet50/101/152. T2T-ViT-7/12 have comparable model size with MobileNetV1/V2. For T2T transformer layer, we adopt Transformer layer for T2T-ViTt-14 and Performer layer for T2T-ViT-14 at limited GPU memory. For ViT, ‘S’ means Small, ‘B’ is Base and ‘L’ is Large. ‘ViT-S/16’ is a variant from original ViT-B/16 with smaller MLP size and layer depth.

Experiment

T2T-ViT on ImageNet

dataset ImageNet
data augmentation mixup, cutmix, for both CNNs and ViTs
optimizer AdamW: batchsize=512 or 1024, 310 epochs, cosine lr decay

Comparison between T2T-ViT and ViT by training from scratch on ImageNet.

Comparison between our T2T-ViT and ResNet on ImageNet. T2T-ViTt-14: using Transformer in T2T module. T2TViT-14: using Performer in T2T module. “*” means we train the model with our training scheme for fair comparisons.

Comparison between our lite T2T-ViT and MobileNet. Models with “-Distilled” are taught by teacher model with the method as DeiT.

fine-tuning in transfer learning
dataset CIFAR10, CIFAR100
optimizer SGD: 60 epochs, cosine lr decay

The results of fine-tuning the pretrained T2T-ViT to downstream datasets: CIFAR10 and CIFAR100.

T2T-ViT achieve higher performance than ViT with smaller model sizes on downstream datasets

from CNN to ViT

Transfer of some common designs in CNN to ViT&T2T-ViT, including DenseNet, Wide-ResNet, SE module, ResNeXt, Ghost operation. The same color means the correspond transfer. All models are trained from scratch on ImageNet. “*” means we reproduce the model with our training scheme for fair comparisons.

key findings

dense connection hurt performance of both ViT and T2T-ViT
deep-narrow structure benefit ViT
SE block improve both ViT and T2T-ViT
ResNeXt structure have few effects on ViT and T2T-ViT
little improvement on performance but larger GPU memory occupied, which is unnecessary
Ghost further compress model and reduce MACs of T2T-ViT

ablation study

Ablation study results on T2T module, Deep-Narrow(DN) structure.

T2T module
T2T-ViT-14-woT2T: the same T2T-ViT backbone but without T2T module
T2T-ViTc-14: T2T module replaced by 3 conv layers with kernel size (7, 3, 3) and stride (4, 2, 2)

improve performance by 2.0%-2.2% on ImageNet with similar model size and MACs
better than conv layers: model both global relation and structure information of images

deep-narrow structure
T2T-ViT-d768-4: a shallow-wide structure with hidden dimension of 768 and 4 layers, with similar model size and MACs as T2T-ViT-14
after replacing deep-narrow with shallow-wide structure, 2.7% decrease on ImageNet
deep-narrow structure is crucial for T2T-ViT