论文:https://arxiv.org/pdf/2103.00020
代码:https://github.com/OpenAI/CLIP
Learning Transferable Visual Models From Natural Language Supervision
从自然语言监督中学习可迁移的视觉模型
Abstract 摘要
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 30 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
顶尖的计算机视觉系统通常被训练用于预测一组固定的预定对象类别。这种受限制的监督形式限制了它们的通用性和实用性,因为需要额外的标注数据来指定任何其他视觉概念。直接从关于图像的原始文本中学习是一种有前景的替代方案,它利用了更广泛的监督来源。我们证明,预测哪段文字描述与哪张图片匹配这一简单的预训练任务,是从互联网收集的4亿(图像,文本)对数据集中从头开始学习最先进图像表示的一种高效且可扩展的方法。预训练后,自然语言被用来引用学习到的视觉概念(或描述新的概念),从而实现模型在下游任务上的零样本迁移。我们在 30 30 30多个不同的现有计算机视觉数据集上进行了基准测试,涵盖OCR、视频中的动作识别、地理定位以及多种细粒度对象分类等任务,研究了这种方法的性能。该模型在大多数任务上表现出色,通常无需任何特定数据集训练即可与完全监督的基线方法竞争。例如,我们在ImageNet上零样本匹配了原始ResNet-50的准确率,而无需使用其训练所依赖的128万个训练示例中的任何一个。我们在https://github.com/OpenAI/CLIP发布了代码和预训练模型权重。
1. Introduction and Motivating Work 介绍与激励工作
Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2019). Task-agnostic objectives such as autoregressive and masked language modeling have scaled across many orders of magnitude in compute, model capacity, and data, steadily improving capabilities. The development of “text-to-text” as a standardized input-output interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled taskagnostic architectures to zero-shot transfer to downstream datasets removing the need for specialized output heads or dataset specific customization. Flagship systems like GPT-3 (Brown et al., 2020) are now competitive across many tasks with bespoke models while requiring little to no dataset specific training data.
直接从原始文本学习的预训练方法在过去几年彻底改变了自然语言处理领域(Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2019)。诸如自回归和掩码语言建模等任务无关目标,已在计算能力、模型容量和数据规模上实现多个数量级的扩展,持续提升模型性能。将"文本到文本"发展为标准化输入输出接口(McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019)后,任务无关架构能够零样本迁移至下游数据集,无需专门设计输出头或进行数据集特定定制。像GPT-3这样的旗舰系统(Brown et al., 2020)现已能在众多任务上与定制模型竞争,且几乎不需要特定训练数据。
These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets. However, in other fields such as computer vision it is still standard practice to pre-train models on crowd-labeled datasets such as ImageNet (Deng et al., 2009). Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision? Prior work is encouraging.
这些结果表明,现代预训练方法在海量网络文本集合中可获取的总体监督信号超过了高质量众包标注的自然语言处理数据集。然而在其他领域如计算机视觉中,基于ImageNet(Deng等,2009)等众包标注数据集进行模型预训练仍是标准做法。直接从网络文本中学习的大规模预训练方法能否在计算机视觉领域带来类似突破?现有研究结果令人鼓舞。
Over 20 years ago Mori et al. (1999) explored improving content based image retrieval by training a model to predict the nouns and adjectives in text documents paired with images. Quattoni et al. (2007) demonstrated it was possible to learn more data efficient image representations via manifold learning in the weight space of classifiers trained to predict words in captions associated with images. Srivastava & Salakhutdinov (2012) explored deep representation learning by training multimodal Deep Boltzmann Machines on top of low-level image and text tag features. Joulin et al. (2016) modernized this line of work and demonstrated that CNNs trained to predict words in image captions learn useful image representations. They converted the title, description, and hashtag metadata of images in the YFCC100M dataset (Thomee et al., 2016) into a bag-ofwords multi-label classification task and showed that pretraining AlexNet (Krizhevsky et al., 2012) to predict these labels learned representations which preformed similarly to ImageNet-based pre-training on transfer tasks. Li et al. (2017) then extended this approach to predicting phrase ngrams in addition to individual words and demonstrated the ability of their system to zero-shot transfer to other image Learning Transferable Visual Models From Natural Language classification datasets by scoring target classes based on their dictionary of learned visual n-grams and predicting the one with the highest score. Adopting more recent architectures and pre-training approaches, VirTex (Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020) have recently demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.
20多年前,Mori等人(1999年)通过训练模型预测与图像配对的文本文档中的名词和形容词,探索了改进基于内容的图像检索方法。Quattoni等人(2007年)证明,通过在对预测图像相关说明文字的分类器权重空间进行流形学习,可以获得更高效的数据图像表示。Srivastava与Salakhutdinov(2012年)通过在低级图像和文本标签特征上训练多模态深度玻尔兹曼机,探索了深度表示学习。Joulin团队(2016年)将这一研究方向现代化,证明经过训练预测图像说明文字的CNN网络能学习到有效的图像表征。他们将YFCC100M数据集(Thomee等人,2016年)中图像的标题、描述和标签元数据转化为词袋多标签分类任务,并证明通过预训练AlexNet(Krizhevsky等人,2012年)来预测这些标签,所学习的表征在迁移任务中表现与基于ImageNet的预训练相当。Li等人(2017年)随后扩展了该方法,除单个词语外还预测短语n元语法,并通过基于已学视觉n元语法词典对目标类别评分并预测得分最高者,展示了该系统向其他图像分类数据集进行零样本迁移的能力。采用更现代架构与预训练方法的VirTex(Desai与Johnson,2020年)、ICMLM(Bulent Sariyildiz等人,2020年)和ConVIRT(Zhang等人,2020年)近期证明了基于Transformer的语言建模、掩码语言建模以及对比目标在从文本学习图像表征方面的潜力。
While exciting as proofs of concept, using natural language supervision for image representation learning is still rare. This is likely because demonstrated performance on common benchmarks is much lower than alternative approaches. For example, Li et al. (2017) reach only 11.5 % 11.5\% 11.5% accuracy on ImageNet in a zero-shot setting. This is well below the 88.4 % 88.4\% 88.4% accuracy of the current state of the art (Xie et al., 2020). It is even below the 50 % 50\% 50% accuracy of classic computer vision approaches (Deng et al., 2012). Instead, more narrowly scoped but well-targeted uses of weak supervision have improved performance. Mahajan et al. (2018) showed that predicting ImageNet-related hashtags on Instagram images is an effective pre-training task. When fine-tuned to ImageNet these pre-trained models increased accuracy by over 5 % 5\% 5% and improved the overall state of the art at the time. Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have also demonstrated large gains on a broader set of transfer benchmarks by pre-training models to predict the classes of the noisily labeled JFT-300M dataset.
尽管作为概念验证令人兴奋,但利用自然语言监督进行图像表征学习的研究仍属罕见。这很可能是因为在常见基准测试中,其表现远低于替代方法。例如,Li等人(2017)在ImageNet零样本设定中仅达到 11.5 % 11.5\% 11.5%准确率,远低于当前最优方法(Xie等人,2020)的 88.4 % 88.4\% 88.4%准确率,甚至不及经典计算机视觉方法(Deng等人,2012) 50 % 50\% 50%的准确率。相比之下,范围更窄但针对性更强的弱监督应用反而提升了性能:Mahajan等人(2018)证明,预测Instagram图像上与ImageNet相关的标签是有效的预训练任务,经过ImageNet微调后,这些预训练模型的准确率提升超过 5 % 5\% 5%,推动了当时的最优水平;Kolesnikov等人(2019)和Dosovitskiy等人(2020)也通过预训练模型预测噪声标签JFT-300M数据集的类别,在更广泛的迁移基准测试中取得了显著提升。
This line of work represents the current pragmatic middle ground between learning from a limited amount of supervised “gold-labels” and learning from practically unlimited amounts of raw text. However, it is not without compromises. Both works carefully design, and in the process limit, their supervision to 1000 1000 1000 and 18291 18291 18291 classes respectively. Natural language is able to express, and therefore supervise, a much wider set of visual concepts through its generality. Both approaches also use static softmax classifiers to perform prediction and lack a mechanism for dynamic outputs. This severely curtails their flexibility and limits their “zero-shot” capabilities.
这项工作代表了当前在从有限的监督“黄金标签”中学习与从几乎无限量的原始文本中学习之间的实用折中方案。然而,这并非没有妥协。两项研究分别精心设计并将其监督范围限制在 1000 1000 1000个和 18291 18291 18291个类别。自然语言由于其通用性,能够表达并因此监督更广泛的视觉概念。这两种方法还使用静态softmax分类器进行预测,缺乏动态输出的机制。这严重削弱了它们的灵活性,并限制了它们的“零样本”能力。
A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale. While Mahajan et al. (2018) and Kolesnikov et al. (2019) trained their models for accelerator years on millions to billions of images, VirTex, ICMLM, and ConVIRT trained for accelerator days on one to two hundred thousand images. In this work, we close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale. Enabled by the large amounts of publicly available data of this form on the internet, we create a new dataset of 400 million (image, text) pairs and demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision. We study the scalability of CLIP by training a series of eight models spanning almost 2 orders of magnitude of compute and observe that transfer performance is a smoothly predictable function of compute (Hestness et al., 2017; Kaplan et al., 2020). We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models. We also confirm these findings with linear-probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet model while also being more computationally efficient. We additionally find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models which suggests that zero-shot evaluation of task-agnostic models is much more representative of a model’s capability. These results have significant policy and ethical implications, which we consider in Section 7.
这些弱监督模型与近期直接从自然语言学习图像表征的探索之间,一个关键差异在于规模。Mahajan等人(2018)和Kolesnikov等人(2019)在数百万至数十亿图像上进行了加速器年量级的训练,而VirTex、ICMLM和ConVIRT仅在十万至二十万图像上训练了加速器日量级。本研究通过构建4亿(图像,文本)对的全新数据集弥合了这一差距,并证明从头训练的简化版ConVIRT——我们称之为CLIP(对比语言-图像预训练)——是实现自然语言监督学习的高效方法。得益于互联网上公开的海量此类数据,我们训练了横跨近2个数量级计算的8个CLIP模型,发现其迁移性能与计算量呈平滑可预测关系(Hestness等人,2017;Kaplan等人,2020)。与GPT家族类似,CLIP在预训练中习得了包括OCR、地理定位、动作识别等广泛任务能力。通过对30多个现有数据集的零样本迁移测试,我们发现其性能可与专用监督模型媲美。线性探针表征学习分析进一步验证:CLIP在超越最佳公开ImageNet模型的同时具备更高计算效率。零样本CLIP模型还表现出比准确率相当的监督式ImageNet模型更强的鲁棒性,这表明任务无关模型的零样本评估更能真实反映模型能力。这些结果具有重要的政策与伦理意义,我们将在第7节探讨。

图1. 方法概述。标准图像模型通常联合训练图像特征提取器和线性分类器来预测标签,而CLIP则联合训练图像编码器和文本编码器,用于预测一批(图像,文本)训练样本的正确配对关系。在测试阶段,学习到的文本编码器通过嵌入目标数据集类别的名称或描述,合成出零样本线性分类器。
2. Approach 方法
2.1. Natural Language Supervision 自然语言监督
At the core of our approach is the idea of learning perception from supervision contained in natural language. As discussed in the introduction, this is not at all a new idea, however terminology used to describe work in this space is varied, even seemingly contradictory, and stated motivations are diverse. Zhang et al. (2020), Gomez et al. (2017), Joulin et al. (2016), and Desai & Johnson (2020) all introduce methods which learn visual representations from text paired with images but describe their approaches as unsupervised, self-supervised, weakly supervised, and supervised respectively.
我们方法的核心思想是从自然语言蕴含的监督信息中学习感知能力。正如引言所述,这并非全新理念,但该领域研究中使用的术语却五花八门——甚至看似矛盾,且提出的动机也各不相同。张等人(2020)、戈麦斯等人(2017)、乔林等人(2016)以及德赛与约翰逊(2020)都提出了从图文配对数据中学习视觉表征的方法,却分别将其方法描述为无监督、自监督、弱监督和有监督。
We emphasize that what is common across this line of work is not any of the details of the particular methods used but the appreciation of natural language as a training signal. All these approaches are learning from natural language supervision. Although early work wrestled with the complexity of natural language when using topic model and n-gram representations, improvements in deep contextual representation learning suggest we now have the tools to effectively leverage this abundant source of supervision (McCann et al., 2017).
我们强调,这一系列研究的共同点不在于具体方法的细节,而在于对自然语言作为训练信号的深刻理解。所有这些方法都在学习来自自然语言的监督信号。尽管早期研究在使用主题模型和n-gram表示时饱受自然语言复杂性的困扰,但深度上下文表示学习的进步表明,我们现在已拥有有效利用这一丰富监督源的工具(McCann等人,2017)。
Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”. Instead, methods which work on natural language can learn passively from the supervision contained in the vast amount of text on the internet. Learning from natural language also has an important advantage over most unsupervised or self-supervised learning approaches in that it doesn’t “just” learn a representation but also connects that representation to language which enables flexible zero-shot transfer. In the following subsections, we detail the specific approach we settled on.
与其他训练方法相比,自然语言学习具有多项潜在优势。相较于图像分类中需采用"机器学习兼容格式"(如经典的N选1多数表决"黄金标签")的标准众包标注方式,自然语言监督的扩展性更强,因为它不需要特定格式的注释。相反,基于自然语言的方法能够被动地从互联网海量文本蕴含的监督信息中学习。与大多数无监督或自监督学习方法相比,自然语言学习的显著优势在于它不仅学习表征,还将该表征与语言关联起来,从而实现了灵活的零样本迁移能力。在后续小节中,我们将详细阐述最终采用的具体方法。
2.2. Creating a Sufficiently Large Dataset 创建一个足够大的数据集
Existing work has mainly used three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each. By comparison, other computer vision systems are trained on up to 3.5 billion Instagram photos (Mahajan et al., 2018). YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality. Many images use automatically generated filenames like 20160716_113957.JPG as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is approximately the same size as ImageNet.
现有研究主要使用了三个数据集:MS-COCO(Lin等人,2014)、Visual Genome(Krishna等人,2017)和YFCC100M(Thomee等人,2016)。虽然MS-COCO和Visual Genome是高质量的众包标注数据集,但按现代标准来看规模较小,各自仅包含约10万张训练照片。相比之下,其他计算机视觉系统训练时使用的数据量最高可达35亿张Instagram照片(Mahajan等人,2018)。YFCC100M拥有1亿张照片,是一个可能的替代方案,但每张图像的元数据稀疏且质量参差不齐。许多图像使用自动生成的文件名(如20160716_113957.JPG)作为"标题",或包含相机曝光设置的"描述"。经过筛选仅保留含英语自然语言标题和/或描述的图像后,数据集规模缩小了六倍,仅剩1500万张照片——这个数量级与ImageNet大致相当。
A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet. Since existing datasets do not adequately reflect this possibility, considering results only on them would underestimate the potential of this line of research. To address this, we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet. To attempt to cover as broad a set of visual concepts as possible, we search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries.1 We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.
自然语言监督的一个主要动机是互联网上公开提供的大量此类数据。由于现有数据集未能充分体现这一可能性,仅基于它们评估结果会低估这一研究方向的潜力。为此,我们构建了一个包含4亿(图像,文本)对的新数据集,其数据来自互联网上各种公开可用的资源。为了尽可能涵盖广泛的视觉概念,我们在构建过程中搜索了文本包含50万个查询词之一的(图像,文本)对。我们通过限制每个查询最多包含20,000个(图像,文本)对来实现近似类别平衡。最终数据集的单词总量与用于训练GPT-2的WebText数据集相当。我们将这个数据集称为WIT(WebImageText)。
2.3. Selecting an Efficient Pre-Training Method 选择高效的预训练方法
State-of-the-art computer vision systems use very large amounts of compute. Mahajan et al. (2018) required 19 GPU years to train their ResNeXt101-32x48d and Xie et al. (2020) required 33 TPUv3 core-years to train their Noisy Student EfficientNet-L2. When considering that both these systems were trained to predict only 1000 ImageNet classes, the task of learning an open set of visual concepts from natural language seems daunting. In the course of our efforts, we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric.
最先进的计算机视觉系统需要消耗大量算力。Mahajan等人(2018年)训练其ResNeXt101-32x48d模型耗费了19个GPU年,而Xie等人(2020年)训练Noisy Student EfficientNet-L2模型则消耗了33个TPUv3核心年。考虑到这两个系统仅训练用于预测1000个ImageNet类别,从自然语言中学习开放视觉概念集的任务显得尤为艰巨。在我们的研究过程中,发现训练效率是实现自然语言监督规模化成功的关键因素,并基于这一指标选定了最终的预训练方法。
Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method. In Figure 2 we show that a 63 million parameter transformer language model, which already uses twice the compute of its ResNet-50 image encoder, learns to recognize ImageNet classes three times slower than a much simpler baseline that predicts a bag-ofwords encoding of the same text.
我们最初的方法与VirTex类似,联合训练了一个图像CNN和文本变换器,从头开始预测图像的标题。然而,我们在高效扩展这种方法时遇到了困难。图2显示,一个拥有6300万参数的变换器语言模型(其计算量已经是ResNet-50图像编码器的两倍)学习识别ImageNet类别的速度,比预测相同文本词袋编码的简单基线模型慢三倍。

图2. CLIP在零样本迁移方面比我们的图像字幕基线模型高效得多。尽管表达能力很强,但我们发现基于Transformer的语言模型在零样本ImageNet分类任务中表现相对较弱。如图所示,其学习速度比预测文本词袋编码(BoW)的基线模型(Joulin等人,2016)慢3倍。将预测目标替换为CLIP的对比目标后,效率又提升了4倍。
Both these approaches share a key similarity. They try to predict the exact words of the text accompanying each image. This is a difficult task due to the wide variety of descriptions, comments, and related text that co-occur with images. Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent predictive objective (Tian et al., 2019). Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance (Chen et al., 2020a). Noting these findings, we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text. Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.
这两种方法有一个关键共同点:它们都试图预测每张图片对应的确切文字描述。由于图片可能伴随多样化的说明、评论及相关文本,这一任务极具挑战性。近期图像对比表征学习研究表明,对比目标能比同等的预测目标学习到更优质的表征(Tian等人,2019)。另有研究发现,虽然图像的生成式模型可以学习到高质量的图像表征,但达到同等性能时,其计算需求是对比模型的十倍以上(Chen等人,2020a)。基于这些发现,我们转而训练系统解决一个可能更简单的代理任务:只需预测整体文本与图像的配对关系,而无需预测文本的具体用词。沿用相同的词袋编码基线时,我们将图2中的预测目标替换为对比目标后,观察到ImageNet零样本迁移效率又提升了4倍。
Given a batch of N N N (image, text) pairs, CLIP is trained to predict which of the N × N N × N N×N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N N N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N 2 − N N^2 − N N2−N incorrect pairings. We optimize a symmetric cross entropy loss over these similarity scores. In Figure 3 we include pseudocode of the core of an implementation of CLIP. To our knowledge this batch construction technique and objective was first introduced in the area of deep metric learning as the multi-class N-pair lossSohn (2016), was popularized for contrastive representation learning by Oord et al. (2018) as the InfoNCE loss, and was recently adapted for contrastive (text, image) representation learning in the domain of medical imaging by Zhang et al. (2020).
给定一批 N N N个(图像,文本)配对数据,CLIP的训练目标是预测该批次中实际发生的 N × N N × N N×N种可能(图像,文本)配对组合。为此,CLIP通过联合训练图像编码器和文本编码器,在多模态嵌入空间中最大化批次里 N N N个真实配对的图像与文本嵌入向量之间的余弦相似度,同时最小化 N 2 − N N^2 − N N2−N个不正确配对的嵌入相似度。我们基于这些相似度得分优化对称交叉熵损失。图3展示了CLIP实现核心部分的伪代码。据我们所知,这种批次构建技术和目标函数最早由Sohn(2016)在深度度量学习领域提出,称为多类N配对损失;后被Oord等人(2018)作为InfoNCE损失推广至对比表征学习领域;近期张等人(2020)在医学影像领域将其适配用于(文本,图像)对比表征学习。

图3. CLIP实现核心部分的类Numpy伪代码。
Due to the large size of our pre-training dataset, over-fitting is not a major concern and the details of training CLIP are simplified compared to the implementation of Zhang et al. (2020). We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights. We do not use the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We instead use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only in self-supervised representation learning methods. We also remove the text transformation function t u t_u tu from Zhang et al. (2020) which samples a single sentence at uniform from the text since many of the (image, text) pairs in CLIP’s pretraining dataset are only a single sentence. We also simplify the image transformation function t v t_v tv . A random square crop from resized images is the only data augmentation used during training. Finally, the temperature parameter which controls the range of the logits in the softmax, τ τ τ , is directly optimized during training as a log-parameterized multiplicative scalar to avoid turning as a hyper-parameter.
由于我们的预训练数据集规模庞大,过拟合不是主要问题,因此CLIP的训练细节相较于张等人(2020)的实现进行了简化。我们从头开始训练CLIP,既没有用ImageNet权重初始化图像编码器,也没有使用预训练权重初始化文本编码器。我们未采用表征与对比嵌入空间之间的非线性投影(该技术由Bachman等人于2019年提出,并由Chen等人于2020b推广),而是仅使用线性投影将每个编码器的表征映射到多模态嵌入空间。我们未观察到两个版本在训练效率上的差异,推测非线性投影可能仅适用于当前图像的自监督表征学习方法细节。我们还移除了张等人(2020)的文本变换函数 t u t_u tu(该函数从文本中均匀采样单一句子),因为CLIP预训练数据集中的许多(图像,文本)对本身就是单句文本。我们也简化了图像变换函数 t v t_v tv——训练期间唯一使用的数据增强方法是从调整尺寸后的图像中随机裁剪方形区域。最后,作为控制softmax中logits范围的温度参数 τ τ τ,在训练期间被直接优化为对数参数化的乘法标量,以避免将其作为超参数进行调优。
2.4. Choosing and Scaling a Model 选择和扩展模型
We consider two different architectures for the image encoder. For the first, we use ResNet-50 (He et al., 2016a) as the base architecture for the image encoder due to its widespread adoption and proven performance. We make several modifications to the original version using the ResNetD improvements from He et al. (2019) and the antialiased rect-2 blur pooling from Zhang (2019). We also replace the global average pooling layer with an attention pooling mechanism. The attention pooling is implemented as a single layer of “transformer-style” multi-head QKV attention where the query is conditioned on the global average-pooled representation of the image. For the second architecture, we experiment with the recently introduced Vision Transformer (ViT) (Dosovitskiy et al., 2020). We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.
我们为图像编码器考虑了两种不同的架构。首先采用ResNet-50(何凯明等,2016a)作为基础架构,因其广泛适用性和已验证的性能。我们基于何凯明2019年提出的ResNetD改进方案和张航2019年的抗锯齿矩形池化技术,对原始版本进行了多项改进。同时将全局平均池化层替换为注意力池化机制,该机制以单层"Transformer式"多头QKV注意力实现,其中查询向量基于图像的全局平均池化表征生成。第二种架构则采用近期提出的视觉Transformer(ViT)(Dosovitskiy等,2020),基本遵循原方案实现,仅做了两处微调:在Transformer层的输入前对图像块嵌入和位置嵌入的叠加表征增加层归一化处理,并采用了略有不同的参数初始化策略。
The text encoder is a Transformer (Vaswani et al., 2017) with the architecture modifications described in Radford et al. (2019). As a base size we use a 63M-parameter 12-layer 512-wide model with 8 attention heads. The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size (Sennrich et al., 2015). For computational efficiency, the max sequence length was capped at 76. The text sequence is bracketed with [SOS] and [EOS] tokens and the activations of the highest layer of the transformer at the [EOS]token are treated as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space. Masked self-attention was used in the text encoder to preserve the ability to initialize with a pre-trained language model or add language modeling as an auxiliary objective, though exploration of this is left as future work.
文本编码器是一个经过Radford等人(2019)所述架构修改的Transformer模型(Vaswani等人,2017)。基础模型采用6300万参数配置,包含12层512宽度的网络结构及8个注意力头。该Transformer处理的文本采用小写字节对编码(BPE)表示法,词汇表规模为49,152(Sennrich等人,2015)。出于计算效率考虑,最大序列长度限制为76个token。文本序列以[SOS]和[EOS]标记作为边界,取Transformer顶层在[EOS]标记处的激活值作为文本特征表示——该特征会经过层标准化处理,再线性投射到多模态嵌入空间。文本编码器使用了掩码自注意力机制,以保留加载预训练语言模型的可能性,或将语言建模作为辅助训练目标(具体探索将作为后续工作)。
While previous computer vision research has often scaled models by increasing the width (Mahajan et al., 2018) or depth (He et al., 2016a) in isolation, for the ResNet image encoders we adapt the approach of Tan & Le (2019) which found that allocating additional compute across all of width, depth, and resolution outperforms only allocating it to only one dimension of the model. While Tan & Le (2019) tune the ratio of compute allocated to each dimension for their EfficientNet architecture, we use a simple baseline of allocating additional compute equally to increasing the width, depth, and resolution of the model. For the text encoder, we only scale the width of the model to be proportional to the calculated increase in width of the ResNet and do not scale the depth at all, as we found CLIP’s performance to be less sensitive to the capacity of the text encoder.
以往的计算机视觉研究通常通过单独增加宽度(Mahajan等人,2018)或深度(He等人,2016a)来扩展模型。对于ResNet图像编码器,我们采用了Tan & Le(2019)的方法。该方法发现:将额外计算资源均衡分配给宽度、深度和分辨率三个维度,比仅增加单一维度能获得更优性能。虽然Tan & Le(2019)为他们的EfficientNet架构调整了各维度的计算资源分配比例,但我们采用简单基线方案——将额外计算资源均等分配给模型的宽度、深度和分辨率提升。对于文本编码器,我们仅按ResNet宽度计算增幅等比例扩展其宽度,完全不增加深度,因为我们发现CLIP性能对文本编码器容量的敏感性较低。
2.5. Training 训练
We train a series of 5 ResNets and 3 Vision Transformers. For the ResNets we train a ResNet-50, a ResNet-101, and then 3 more which follow EfficientNet-style model scaling and use approximately 4x, 16x, and 64x the compute of a ResNet-50. They are denoted as RN50x4, RN50x16, and RN50x64 respectively. For the Vision Transformers we train a ViT-B/32, a ViT-B/16, and a ViT-L/14. We train all models for 32 epochs. We use the Adam optimizer (Kingma & Ba, 2014) with decoupled weight decay regularization (Loshchilov & Hutter, 2017) applied to all weights that are not gains or biases, and decay the learning rate using a cosine schedule (Loshchilov & Hutter, 2016). Initial hyperparameters were set using a combination of grid searches, random search, and manual tuning on the baseline ResNet50 model when trained for 1 epoch. Hyper-parameters were then adapted heuristically for larger models due to computational constraints. The learnable temperature parameterτ was initialized to the equivalent of 0.07 from (Wu et al., 2018) and clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability. We use a very large minibatch size of 32,768. Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory. To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used. The calculation of embedding similarities was also sharded with individual GPUs computing only the subset of the pairwise similarities necessary for their local batch of embeddings. The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs. For the ViT-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch to boost performance similar to FixRes (Touvron et al., 2019). We denote this model as ViT-L/14@336px. Unless otherwise specified, all results reported in this paper as “CLIP” use this model which we found to perform best.
我们训练了5个ResNet和3个Vision Transformer模型系列。对于ResNet,我们训练了ResNet-50、ResNet-101,以及另外三个采用EfficientNet式模型缩放架构的模型,其计算量分别约为ResNet-50的4倍、16倍和64倍,对应命名为RN50x4、RN50x16和RN50x64。Vision Transformer方面则训练了ViT-B/32、ViT-B/16和ViT-L/14三种架构。所有模型均训练32个周期。
我们采用Adam优化器(Kingma & Ba,2014),对所有非增益/偏置参数应用解耦权重衰减正则化(Loshchilov & Hutter,2017),并使用余弦退火调整学习率(Loshchilov & Hutter,2016)。初始超参数通过在基线ResNet50模型上进行1个周期的网格搜索、随机搜索和人工调参综合确定。受计算资源限制,更大模型的超参数采用启发式调整。可学习的温度参数τ初始化为0.07(参考Wu等人2018年研究),并限制其对数幅值缩放不超过100倍——这是维持训练稳定性必要的约束措施。
我们采用32,768的极大批次规模,并使用混合精度计算(Micikevicius等人,2017)加速训练并节省显存。为进一步节约内存,还采用了梯度检查点技术(Griewank & Walther,2000;Chen等人,2016)、半精度Adam统计量(Dhariwal等人,2020)以及经过随机舍入的半精度文本编码器权重。嵌入相似度计算也进行了分片处理,每块GPU仅计算其本地嵌入批次所需的子集相似度。
最大的ResNet模型RN50x64在592块V100 GPU上训练耗时18天,而最大的Vision Transformer模型在256块V100 GPU上训练12天。对于ViT-L/14,我们还额外增加1个周期的高分辨率(336像素)预训练来提升性能(方法类似FixRes,Touvron等人2019),该模型标记为ViT-L/14@336px。除非特别说明,本文所有标注为"CLIP"的结果均采用这个表现最佳的模型。
3. Experiments 实验
3.1. Zero-Shot Transfer 零样本迁移
3.1.1. MOTIVATION 动机
In computer vision, zero-shot learning usually refers to the study of generalizing to unseen object categories in image classification (Lampert et al., 2009). We instead use the term in a broader sense and study generalization to unseen datasets. We motivate this as a proxy for performing unseen tasks, as aspired to in the zero-data learning paper of Larochelle et al. (2008). While much research in the field of unsupervised learning focuses on the representation learning capabilities of machine learning systems, we motivate studying zero-shot transfer as a way of measuring the tasklearning capabilities of machine learning systems. In this view, a dataset evaluates performance on a task on a specific distribution. However, many popular computer vision datasets were created by the research community primarily as benchmarks to guide the development of generic image classification methods rather than measuring performance on a specific task. While it is reasonable to say that the SVHN dataset measures the task of street number transcription on the distribution of Google Street View photos, it is unclear what “real” task the CIFAR-10 dataset measures. It is clear, however, what distribution CIFAR-10 is drawn from - TinyImages (Torralba et al., 2008). On these kinds of datasets, zero-shot transfer is more an evaluation of CLIP’s robustness to distribution shift and domain generalization rather than task generalization. Please see Section 3.3 for analysis focused on this.
在计算机视觉领域,零样本学习通常指图像分类中对未见物体类别泛化能力的研究(Lampert等,2009)。我们在此采用更广义的概念,研究对未见数据集的泛化能力。这一设定是执行未见过任务的代理目标,正如Larochelle等人(2008)在零数据学习论文中提出的愿景。虽然无监督学习领域的许多研究聚焦于机器学习系统的表征学习能力,但我们通过研究零样本迁移来衡量系统的任务学习能力。从这个角度看,数据集评估的是特定数据分布下的任务表现。然而,许多流行的计算机视觉数据集被研究界创建时,主要是作为通用图像分类方法开发的基准,而非衡量特定任务的表现。虽然可以说SVHN数据集评估的是谷歌街景照片分布下的街道号码转录任务,但CIFAR-10数据集衡量的"真实"任务并不明确。不过可以明确的是CIFAR-10的数据来源——TinyImages(Torralba等,2008)。对于这类数据集,零样本迁移更多评估的是CLIP模型对分布偏移和领域泛化的鲁棒性,而非任务泛化能力。相关分析详见第3.3节。
To our knowledge, Visual N-Grams (Li et al., 2017) first studied zero-shot transfer to existing image classification datasets in the manner described above. It is also the only other work we are aware of that has studied zero-shot transfer to standard image classification datasets using a generically pre-trained model and serves as the best reference point for contextualizing CLIP. Their approach learns the parameters of a dictionary of 142,806 visual n-grams (spanning 1- to 5- grams) and optimizes these n-grams using a differential version of Jelinek-Mercer smoothing to maximize the probability of all text n-grams for a given image. In order to perform zero-shot transfer, they first convert the text of each of the dataset’s class names into its n-gram representation and then compute its probability according to their model, predicting the one with the highest score.
据我们所知,Visual N-Grams(Li等人,2017)首次采用上述方式研究了向现有图像分类数据集的零样本迁移。这也是我们已知唯一另一项研究使用通用预训练模型向标准图像分类数据集进行零样本迁移的工作,为CLIP的语境化提供了最佳参照。他们的方法通过学习包含142,806个视觉N元语法(涵盖1至5元)词典的参数,并采用Jelinek-Mercer平滑的微分版本优化这些N元语法,以最大化给定图像所有文本N元语法的概率。为实现零样本迁移,他们首先将数据集中每个类名的文本转换为N元语法表示,然后根据其模型计算概率,预测得分最高的类别。
Our focus on studying zero-shot transfer as an evaluation of task learning is inspired by work demonstrating task learning in the field of NLP. To our knowledge Liu et al. (2018) first identified task learning as an “unexpected side-effect” when a language model trained to generate Wikipedia articles learned to reliably transliterate names between languages. While GPT-1 (Radford et al., 2018) focused on pretraining as a transfer learning method to improve supervised fine-tuning, it also included an ablation study demonstrating that the performance of four heuristic zero-shot transfer methods improved steadily over the course of pre-training, without any supervised adaption. This analysis served as the basis for GPT-2 (Radford et al., 2019) which focused exclusively on studying the task-learning capabilities of language models via zero-shot transfer.
我们选择将零样本迁移能力作为任务学习的评估指标,这一思路受到自然语言处理领域任务学习研究的启发。据我们所知,Liu等人(2018)首次发现任务学习是"意料之外的副作用"——当语言模型被训练生成维基百科文章时,竟自发掌握了跨语言音译人名的能力。虽然GPT-1(Radford等人,2018)主要关注将预训练作为提升监督微调效果的迁移学习方法,但其消融实验同时表明:四种启发式零样本迁移方法的性能在预训练过程中持续提升,且无需任何监督适配。这项分析为GPT-2(Radford等人,2019)奠定了基础,后者完全专注于通过零样本迁移来研究语言模型的任务学习能力。
3.1.2. USING CLIP FOR ZERO-SHOT TRANSFER 使用CLIP进行零样本迁移
CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot classification, we reuse this capability.

最低0.47元/天 解锁文章
1659

被折叠的 条评论
为什么被折叠?



