[CLIP] Learning Transferable Visual Models From Natural Language Supervision

本文链接：https://blog.youkuaiyun.com/sinat_30618203/article/details/139533819

通过在4亿图像/文本对上训练文字和图片的匹配关系来预训练网络，可以学习到SOTA的图像特征。预训练模型可以用于下游任务的零样本学习

1、网络结构

1）simplified version of ConVIRT

2）linear projection to map from each encoder's representation to the multi-modal embedding space

3）image encoder

-> ResNet

antialiased rect-2 blur pooling

用attention pooling (single layer of "transformer-style" multi-head QKV attention， where the query is conditioned on the global average-pooled representation of the image)来代替global average pooling

-> Vision Transformer (ViT)

add an additional layer normalization to the combined patch

position embeddings before the transformer

slightly different initialization scheme

4）text encoder

-> Transformer

architecture modifications

63M-parameter 12 layer 512-wide model with 8 attention heads

lower-cased byte pair encoding (BPE) representation of the text with a 49152 vocab size

the max sequence length was capped at 76

the text sequence is bracketed with [SOS] and [EOS] tokens

the activations of the highest layer of the transformer at the [EOS] token are treated as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space

5）scale

-> image encoder

equally increase the width, depth, and resolution of the model

-> text encoder

only scale the width of the model to be proportional to the calculated increase in width of the ResNet, do not scale the depth at all

* text encoder对CLIP的表现影响较小