[ICML 2021]Learning Transferable Visual Models From Natural Language Supervision

论文网址:Learning Transferable Visual Models From Natural Language Supervision

论文代码:GitHub - openai/CLIP:CLIP(对比语言-图像预训练),在给定图像的情况下预测最相关的文本片段

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction and Motivating Work

2.3. Approach

2.3.1. Natural Language Supervision

2.3.2. Creating a Sufficiently Large Dataset

2.3.3. Selecting an Efficient Pre-Training Method

2.3.4. Choosing and Scaling a Model

2.3.5. Training

2.4. Experiments

2.4.1. Zero-Shot Transfer

1. 心得

(1)看的arxiv长一点的版本,会议本体会更短一点

(2)一不小心读到经典调参/疯狂训练文章了吗!!反正是单个人不太能实现的操作,推荐只看模型架构

2. 论文逐段精读

2.1. Abstract

        ①Limitation for existing pre-trained works: large number of labelled data needed

        ②CLIP learned image and text by zero-shot way

2.2. Introduction and Motivating Work

        ①Dataset: includes 400 million (image, text) pairs

        ②Framework of CLIP:

(在这里简介一下,只想了解模型框架的话不用往后翻了。(1)训练:输入是配对的词组文本和图像,分别通过不同的编码器将他们变成文本嵌入T_iI_i,对每个嵌入算相关矩阵。损失是交叉熵,盲猜是让相关矩阵对角线(成功匹配)的值变高,1-I_iT_i,然后让白色块不相关的值变低I_iT_j。损失简化为\mathcal{L}=\sum_{i=1}^{i=N} \sum_{j=1}^{j=N} (| 1-I_iT_i|+| I_iT_j |)。损失越小越好,被训练的东西是俩编码器。(2)测试集标签:把标签加上提示然后送入训练好的文本编码器,把没见过的东西变成预测的嵌入集合。(3)把测试集图片通过训练好的图片编码器变成嵌入,去和所有文本嵌入算相关性,取最相关的可能)

2.3. Approach

2.3.1. Natural Language Supervision

        ①Learning multi-representations from natural language

2.3.2. Creating a Sufficiently Large Dataset

        ①They used relevant small datasets, MS-COCO, Visual Genome, and YFCC100M

        ②Moreover, they create a new dataset WebImageText (WIT)

2.3.3. Selecting an Efficient Pre-Training Method

        ①For plenty of data and training time, they aim to improve training efficiency

        ②Learning efficiency of different methods:

        ③They fucos on match image and text rather than predict the text word by word through increasing the cosine similarity of matching pairs and reduce the similarity of non matching pairs

        ④Pseudo code of CLIP:
 

2.3.4. Choosing and Scaling a Model

        ①Base architecture for the image encoder: ResNet-50, revised by ResNet-D and antialiased rect-2 blur pooling. They replace global average pooling layer by attention

        ②Text encoder: 12 layer 512-wide Transformer with 8 head

        ③Text mark: [SOS] and [EOS](比如“I eat burger”→“[SOS] I eat burger [EOS]”)

2.3.5. Training

        ①这个建议看原文,就是超参数设置

2.4. Experiments

2.4.1. Zero-Shot Transfer

(1)Motivation

        ①Motivation: zero-shot transfer(鼓励归鼓励,为什么这样就能保证效果啊)(作者:许多流行的计算机视觉数据集是由研究界创建的,主要是作为指导通用图像分类方法开发的基准,而不是衡量特定任务的性能。)

(2)Using CLIP for Zeto-Shot Transfer

        ①Class in training set: 32,768

(3)Initial Comparison to Visual N-Grams

        ①Performance:

(4)Prompt Engnieering and Ensembling

        ①Single word of image description might cause ambiguity (polysemy)

        ②For aligning training text and test text, they add prompt "a photo of {}" to increase the length of test text

(5)Analysis of Zero-Shot CLIP Performance

        ①Zero-Shot CLIP vs. Linear Probe on ResNet50 on different dataset:

        ②Few-shot comparison:

后面不记录了都是一些实验感兴趣自己看,就是CLIP表现。扫了一遍应该没有可解释性的图

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值