[2101] [ICCV 2021] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

T2T-ViT是一种针对ViT的改进模型,通过Token-to-Token模块解决ViT在建模局部信息上的不足。T2T模块采用软分割和重结构化过程,逐步构建图像的局部结构,减少令牌长度。实验表明,T2T-ViT在ImageNet上从头训练时,相比ViT和ResNet展现出更好的性能,并且在有限的GPU内存下,使用Performer层也能取得良好效果。此外,T2T-ViT的深窄架构设计比传统的宽浅结构更有效,尤其是在特征丰富度和计算效率方面。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

paper
code

Introduction

main limitations of ViT

  1. straightforward tokenization of input images by hard split makes ViT unable to model local information, thus requiring more training samples than CNNs to achieve similar performance
  2. self-attention in ViT is not well-designed as CNNs for vision tasks, which contains redundancy and leads to limited features and more difficult training


Feature visualization of ResNet50, ViT-L/16 and T2T-ViT-24 trained on ImageNet. Green boxes highlight learned low-level structure features such as edges and lines; red boxes highlight invalid feature maps with zero or too large values. Note the feature maps visualized here for ViT and T2T-ViT are not attention maps, but image features reshaped from tokens. For better visualization, we scale the input image to size 1024x1024 or 2048x2048.

ResNet capture desired local structure (edges, lines, textures, etc.) progressively from bottom layer (conv1) to middle layer (conv25)
in ViT, structure information poorly modeled while global relations (e.g., the whole dog) captured by all attention blocks
note that ViT ignores local structure when directly splitting images to tokens with fixed length.

many channels in ViT have zero value
note that backbone of ViT is not efficient as ResNets and offers limited feature richness when training samples are not enough.

main contributions of T2T-ViT

  1. propose a progressive tokenization module to aggregate neighboring Tokens-to-Token, which can model local structure information of surrounding tokens and reduce length of tokens iteratively
  2. borrow architecture designs from CNNs to build transformer layers for improving feature richness, and find deep narrow architecture design in ViT brings much better performance


Comparison between T2T-ViT with ViT, ResNets and MobileNets when trained from scratch on ImageNet. Left: performance curve of MACs vs. top-1 accuracy. Right: performance curve of model size vs. top-1 accuracy.

Method

model architecture
  1. layer-wise T2T module: model local information of images, reduce length of tokens progressively
  2. efficient T2T-ViT backbone: draw global attention relation on tokens from T2T module


The overall network architecture of T2T-ViT. In the T2T module, the input image is first soft split as patches, and then unfolded as a sequence of tokens T 0 T_0 T0. The length of tokens is reduced progressively in the T2T module (we use 2 iterations here and output T f T_f Tf). Then the T2T-ViT backbone takes the fixed tokens as input and outputs the predictions.

token-to-token (T2T)

aim to overcome limitation of simple tokenization in ViT
progressively structurize an image to tokens and model local structure information, so length of tokens reduced iteratively


Illustration of T2T process. The tokens T i T_i Ti are re-structurized as an image I i I_i Ii after transformation and reshaping; then I i I_i Ii is split with overlapping to tokens T i + 1 T_{i+1} Ti+1 again. Specifically, as shown in the pink panel, the four tokens (1, 2, 4, 5) of the input Ii are concatenated to form one token in T i + 1 T_{i+1} Ti+1. The T2T transformer can be a normal Transformer layer or other efficient transformers like Performer layer at limited GPU memory.

re-structurization

given a sequence of tokens T i T_i Ti from preceding transformer layer, transform T i ′ T_i' Ti by self-attention block
T i ′ = M L P ( M S A ( T i ) ) T_i'=MLP(MSA(T_i)) Ti=MLP(MSA(Ti))
where, T i ∈ R L × C T_i\in R^{L\times C} TiRL×C, T i ′ ∈ R L × C T_i'\in R^{L\times C} TiRL×C
tokens T i ′ T_i' Ti will be reshaped as an image in spatial dimension
I i = R e s h a p e ( T i ′ ) I_i=Reshape(T_i') Ii=Reshape(Ti)
where, R e s h a p e ( . ) Reshape(.) Reshape(.) re-organize T i ′ ∈ R L × C T_i'\in R^{L\times C} TiRL×C to I i ∈ H × W × C I_i\in H\times W\times C IiH×W×C, with L = H × W L=H\times W L=H×W

soft split

model local structure information and reduce length of tokens
similar to convolution operation without conv filters

to avoid information loss in generating tokens from re-structurizated image, split image into patches with overlapping

  1. each patch is correlated with surrounding patches to establish a prior that there should be stronger correlations between surrounding tokens.
  2. tokens in each split patch are concatenated as one (red and blue box), so regional information is aggregated from surrounding pixels and patches.

T i + 1 = S S ( I i ) T_i+1=SS(I_i) Ti+1=SS(Ii)
where, S S ( . ) SS(.) SS(.) is soft split operation, implemented by nn.Unfold
in nn.Unfold, given a tenser X ∈ B × C × H × W X\in B\times C\times H\times W XB×C×H×W, a kxk-size kernel apply on F to capture X 1 ∈ C × X_1\in C\times X1C×, which then reshaped into X 1 ′ ∈ C k 2 X_1'\in Ck^2 X1Ck2
get output tensor Y ∈ B × C k 2 × H 0 × W 0 Y\in B\times Ck^2\times H_0\times W_0 YB×Ck2×H0×W0, with H 0 = ⌊ H − k + 2 p s + 1 ⌋ H_0=\lfloor \frac {H-k+2p}s+1\rfloor H0=sHk+2p+1, W 0 = ⌊ W − k + 2 p s + 1 ⌋ W_0=\lfloor \frac {W-k+2p}s+1\rfloor W0=sWk+2p+1
similarly, given I i ∈ H × W × C I_i\in H\times W\times C IiH×W×C, output tensor T i + 1 ∈ L 0 × C k 2 T_{i+1}\in L_0\times Ck^2 Ti+1L0×Ck2 got, with L 0 = ⌊ H − k + 2 p s + 1 ⌋ × ⌊ W − k + 2 p s + 1 ⌋ L_0=\lfloor \frac {H-k+2p}s+1\rfloor \times \lfloor \frac {W-k+2p}s+1\rfloor L0=sHk+2p+1×sWk+2p+1
after soft split, output tokens are fed for the next T2T process

T2T module

based on transformer block, with 2 extra components

  1. reshape tokens into H × W × C H\times W\times C H×W×C image for learning more local information later
  2. unfold image into L 0 × C k 2 L_0\times Ck^2 L0×Ck2 tokens and capture details for more efficient modeling in transformer later

for input image I 0 I_0 I0, only apply soft split at first to split it to tokens: T 1 = S S ( I 0 ) T_1=SS(I_0) T1=SS(I0)
after last T2T module, output tokens T f T_f Tf has fixed length, so T2T-ViT backbone can model global relation on T f T_f Tf

T2T-ViT backbone

since many channels in vanilla ViT are invalid, plan to find an efficient backbone to reduce redundancy and improve feature richness

5 designs from CNNs to ViT

  1. dense connection, as DenseNet
  2. deep-narrow vs. shallow-wide structure, as in Wide-ResNet
  3. channel attention, as Squeeze-and-Excitation networks
  4. more split heads in multi-head attention layer, as ResNeXt
  5. Ghost operation, as GhostNet

key findings

  1. a deep-narrow structure: decrease embedding dimensions to improve feature and reduce computation
  2. channel attention improve performance but is less effective than a deep-narrow architecture

design a deep-narrow structure with a small channel dimension and a hidden dimension d but more layers b
for tokens T f T_f Tf from the last T2T module, concatenate a class token and add sinusoidal position embedding
T f 0 = [ t c l s ; T f ] + P E , P E ∈ R ( L + 1 ) × d {T_f}_0=[t_cls; T_f]+PE, PE\in R^{(L+1)\times d} Tf0=[tcls;Tf]+PE,PER(L+1)×d
T f i = M L P ( M S A ( T f i − 1 ) ) , i = 1 , 2 , . . . , b {T_f}_i=MLP(MSA({T_f}_{i-1})), i=1, 2,...,b Tfi=MLP(MSA(Tfi1)),i=1,2,...,b
Y = F C ( L N ( T f b ) ) Y=FC(LN({T_f}_b)) Y=FC(LN(Tfb))

architecture variants


Structure details of T2T-ViT. T2T-ViT-14/19/24 have comparable model size with ResNet50/101/152. T2T-ViT-7/12 have comparable model size with MobileNetV1/V2. For T2T transformer layer, we adopt Transformer layer for T2T-ViTt-14 and Performer layer for T2T-ViT-14 at limited GPU memory. For ViT, ‘S’ means Small, ‘B’ is Base and ‘L’ is Large. ‘ViT-S/16’ is a variant from original ViT-B/16 with smaller MLP size and layer depth.

Experiment

T2T-ViT on ImageNet

dataset ImageNet
data augmentation mixup, cutmix, for both CNNs and ViTs
optimizer AdamW: batchsize=512 or 1024, 310 epochs, cosine lr decay


Comparison between T2T-ViT and ViT by training from scratch on ImageNet.


Comparison between our T2T-ViT and ResNet on ImageNet. T2T-ViTt-14: using Transformer in T2T module. T2TViT-14: using Performer in T2T module. “*” means we train the model with our training scheme for fair comparisons.


Comparison between our lite T2T-ViT and MobileNet. Models with “-Distilled” are taught by teacher model with the method as DeiT.

fine-tuning in transfer learning
dataset CIFAR10, CIFAR100
optimizer SGD: 60 epochs, cosine lr decay


The results of fine-tuning the pretrained T2T-ViT to downstream datasets: CIFAR10 and CIFAR100.

T2T-ViT achieve higher performance than ViT with smaller model sizes on downstream datasets

from CNN to ViT


Transfer of some common designs in CNN to ViT&T2T-ViT, including DenseNet, Wide-ResNet, SE module, ResNeXt, Ghost operation. The same color means the correspond transfer. All models are trained from scratch on ImageNet. “*” means we reproduce the model with our training scheme for fair comparisons.

key findings

  1. dense connection hurt performance of both ViT and T2T-ViT
  2. deep-narrow structure benefit ViT
  3. SE block improve both ViT and T2T-ViT
  4. ResNeXt structure have few effects on ViT and T2T-ViT
    little improvement on performance but larger GPU memory occupied, which is unnecessary
  5. Ghost further compress model and reduce MACs of T2T-ViT
ablation study


Ablation study results on T2T module, Deep-Narrow(DN) structure.

T2T module
T2T-ViT-14-woT2T: the same T2T-ViT backbone but without T2T module
T2T-ViTc-14: T2T module replaced by 3 conv layers with kernel size (7, 3, 3) and stride (4, 2, 2)

  1. improve performance by 2.0%-2.2% on ImageNet with similar model size and MACs
  2. better than conv layers: model both global relation and structure information of images

deep-narrow structure
T2T-ViT-d768-4: a shallow-wide structure with hidden dimension of 768 and 4 layers, with similar model size and MACs as T2T-ViT-14
after replacing deep-narrow with shallow-wide structure, 2.7% decrease on ImageNet
deep-narrow structure is crucial for T2T-ViT

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值