[CLIP] Learning Transferable Visual Models From Natural Language Supervision

        通过在4亿图像/文本对上训练文字和图片的匹配关系来预训练网络,可以学习到SOTA的图像特征。预训练模型可以用于下游任务的零样本学习

                ​​​​​​​        ​​​​​​​        

1、网络结构

        1)simplified version of ConVIRT

        2)linear projection to map from each encoder's representation to the multi-modal embedding space

        3)image encoder

                -> ResNet

                         antialiased rect-2 blur pooling

                        用attention pooling (single layer of "transformer-style" multi-head QKV attention, where the query is conditioned on the global average-pooled representation of the image)来代替global average pooling

                -> Vision Transformer (ViT)

                        add an additional layer normalization to the combined patch

                        position embeddings before the transformer

                        slightly different initialization scheme

        4)text encoder

                -> Transformer

                        architecture modifications

                        63M-parameter 12 layer 512-wide model with 8 attention heads

                        lower-cased byte pair encoding (BPE) representation of the text with a 49152 vocab size

                        the max sequence length was capped at 76

                        the text sequence is bracketed with [SOS] and [EOS] tokens

                        the activations of the highest layer of the transformer at the [EOS] token are treated as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space

        5)scale

                -> image encoder

                        equally increase the width, depth, and resolution of the model

                -> text encoder

                        only scale the width of the model to be proportional to the calculated increase in width of the ResNet, do not scale the depth at all

                        * text encoder对CLIP的表现影响较小

2、数据

        1)400 million (image, text) pairs from Internet

        2)many of the (image, text) pairs are only a single sentence

3、训练

        1)Contrastive Language-Image Pre-training (CLIP)

        2)text as a whole, not the exact words of that text

        3)Given a batch of N (image, text) pairs, predict N x N possible (image, text) pairings。N取32768

        4)jointly train an image encoder and text encoder

        5)maximize the cosine similarity of the N real pairs; minimizing the cosine similarity of the N^{2} - N incorrect pairs

        6)train from scratch

        7)数据增强

                random square crop from resized images

        8)learnable temperature parameter \tau (control the range of the logits in the softmax)

4、优势

        无需softmax分类器来预测结果,因此可以更灵活的用于zero-shot任务

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值