系列论文研读目录
文章目录
- 系列论文研读目录
- Abstract
- 1. Introduction and Motivating Work
- 2. Approach 方法
- 3. Experiments
- 4. Comparison to Human Performance 与人类表现的比较
- 5. Data Overlap Analysis
- 6. Limitations
- 7. Broader Impacts 更广泛的影响
- 8. Related Work
- 9. Conclusion
Abstract
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.最先进的计算机视觉系统被训练来预测一组固定的预定对象类别。这种受限的监督形式限制了它们的通用性和可用性,因为需要额外的标记数据来指定任何其他视觉概念。直接从图像的原始文本中学习是一种很有前途的选择,它利用了更广泛的监督来源。我们证明,预测哪个标题与哪个图像相匹配的简单预训练任务是一种有效且可扩展的方法,可以在从互联网收集的4亿(图像、文本)对的数据集上从头开始学习SOTA图像表示。在预训练之后,使用自然语言来引用所学习的视觉概念(或描述新的视觉概念),使得能够将模型零触发地转移到下游任务。我们通过对30多种不同的现有计算机视觉数据集进行基准测试来研究这种方法的性能,这些数据集涵盖了OCR、视频中的动作识别、地理定位和许多类型的细粒度对象分类等任务。该模型可以轻松地转移到大多数任务中,并且通常与完全监督的基线竞争,而无需任何数据集特定的训练。例如,我们在ImageNet zero-shot上匹配原始ResNet-50的准确性,而不需要使用它所训练的128万个训练示例中的任何一个。我们在https://github.com/OpenAI/CLIP上发布了我们的代码和预训练模型权重。
1. Introduction and Motivating Work
- Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2019). Task-agnostic objectives such as autoregressive and masked language modeling have scaled across many orders of magnitude in compute, model capacity, and data, steadily improving capabilities. The development of “text-to-text” as a standardized input-output interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled taskagnostic architectures to zero-shot transfer to downstream datasets removing the need for specialized output heads or dataset specific customization. Flagship systems like GPT-3 (Brown et al., 2020) are now competitive across many tasks with bespoke models while requiring little to no dataset specific training data.在过去的几年里,直接从原始文本中学习的预训练方法已经彻底改变了NLP(Dai & Le,2015; Peters等人,2018年;霍华德和Ruder,2018年;拉德福等人,2018年; Devlin等人,2018年; Raffel等人,(2019年版)。诸如自回归和屏蔽语言建模等任务无关的目标已经在计算、模型容量和数据方面扩展了许多数量级,从而稳步提高了功能。“文本到文本”作为标准化输入输出接口的发展(McCann等人,2018年;拉德福等人,2019年; Raffel等人,2019)已实现了任务不可知架构到下游数据集的零触发传输,无需专门的输出头或数据集特定定制。旗舰系统如GPT-3(Brown等人,2020年)现在在许多任务中具有竞争力,只需很少或不需要特定于数据集的训练数据。
- These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets. However, in other fields such as computer vision it is still standard practice to pre-train models on crowd-labeled datasets such as ImageNet (Deng et al., 2009). Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision? Prior work is encouraging.这些结果表明,在网络规模的文本集合中,现代预训练方法可获得的聚合监督超过了高质量的人群标记的NLP数据集。然而,在诸如计算机视觉的其它领域中,在诸如ImageNet的人群标记的数据集上预训练模型仍然是标准实践(Deng等人,2009年)的报告。直接从网络文本学习的可扩展预训练方法是否会在计算机视觉领域带来类似的突破?先前的工作令人鼓舞。
- Over 20 years ago Mori et al. (1999) explored improving content based image retrieval by training a model to predict the nouns and adjectives in text documents paired with images. Quattoni et al. (2007) demonstrated it was possible to learn more data efficient image representations via manifold learning in the weight space of classifiers trained to predict words in captions associated with images. Srivastava & Salakhutdinov (2012) explored deep representation learning by training multimodal Deep Boltzmann Machines on top of low-level image and text tag features. Joulin et al. (2016) modernized this line of work and demonstrated that CNNs trained to predict words in image captions learn useful image representations. They converted the title, description, and hashtag metadata of images in the YFCC100M dataset (Thomee et al., 2016) into a bag-ofwords multi-label classification task and showed that pretraining AlexNet (Krizhevsky et al., 2012) to predict these labels learned representations which preformed similarly to ImageNet-based pre-training on transfer tasks. Li et al. (2017) then extended this approach to predicting phrase ngrams in addition to individual words and demonstrated the ability of their system to zero-shot transfer to other image classification datasets by scoring target classes based on their dictionary of learned visual n-grams and predicting the one with the highest score. Adopting more recent architectures and pre-training approaches, VirTex (Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020) have recently demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.20多年前,Mori等人(1999)通过训练一个模型来预测与图像配对的文本文档中的名词和形容词,探索了改进基于内容的图像检索。Quattoni et al.(2007)证明,通过训练分类器的权重空间中的流形学习,可以学习更多数据有效的图像表示,以预测与图像相关的标题中的单词。Srivastava & Salakhutdinov(2012)通过在底层图像和文本标签特征上训练多模态深度玻尔兹曼机,探索了深度表征学习。Joulin等人(2016)对这一工作进行了现代化,并证明了经过训练的CNN可以预测图像标题中的单词,从而学习有用的图像表示。他们转换了YFCC 100 M数据集中图像的标题、描述和标签元数据(Thomee等人,2016)转化为一个词袋多标签分类任务,并表明预训练AlexNet(Krizhevsky等人,2012)来预测这些标签学习表征,这些表征类似于基于ImageNet的转移任务预训练。Li等人(2017)随后将该方法扩展到预测短语ngram以及单个单词,并通过基于其学习的视觉n-gram词典对目标类进行评分并预测得分最高的分类,证明了其系统零触发转移到其他图像分类数据集的能力。采用更新的架构和预训练方法,VirTex(Desai &约翰逊,2020),ICMLM(Bulent Sariyildiz等人,2020)和ConVIRT(Zhang等人,2020)最近展示了基于转换器的语言建模、掩蔽语言建模和对比目标从文本中学习图像表示的潜力。
- While exciting as proofs of concept, using natural language supervision for image representation learning is still rare. This is likely because demonstrated performance on common benchmarks is much lower than alternative approaches. For example, Li et al. (2017) reach only 11.5% accuracy on ImageNet in a zero-shot setting. This is well below the 88.4% accuracy of the current state of the art (Xie et al., 2020). It is even below the 50% accuracy of classic computer vision approaches (Deng et al., 2012). Instead, more narrowly scoped but well-targeted uses of weak supervision have improved performance. Mahajan et al. (2018) showed that predicting ImageNet-related hashtags on Instagram images is an effective pre-training task. When fine-tuned to ImageNet these pre-trained models increased accuracy by over 5% and improved the overall state of the art at the time. Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have also demonstrated large gains on a broader set of transfer benchmarks by pre-training models to predict the classes of the noisily labeled JFT-300M dataset.虽然作为概念证明令人兴奋,但使用自然语言监督进行图像表示学习仍然很少见。这很可能是因为在通用基准测试中表现出的性能远远低于替代方法。例如,Li等人(2017)在零激发设置下,在ImageNet上仅达到11.5%的准确度。这大大低于现有技术的88.4%的准确度(Xie等人,2020年)的报告。它甚至低于经典计算机视觉方法的50%准确度(Deng等人,2012年)的报告。相反,范围更窄但针对性更强的弱监管改善了绩效。Mahajan等人(2018)指出,预测Instagram图片上与ImageNet相关的标签是一项有效的预训练任务。当对ImageNet进行微调时,这些预先训练的模型将准确度提高了5%以上,并改善了当时的整体技术水平。Kolesnikov等人(2019)和Dosovitskiy等人(2020)还证明,通过预训练模型预测噪声标记JFT-300 M数据集的类别,在更广泛的转移基准集上获得了巨大收益。
- This line of work represents the current pragmatic middle ground between learning from a limited amount of supervised “gold-labels” and learning from practically unlimited amounts of raw text. However, it is not without compromises. Both works carefully design, and in the process limit, their supervision to 1000 and 18291 classes respectively. Natural language is able to express, and therefore supervise, a much wider set of visual concepts through its generality. Both approaches also use static softmax classifiers to perform prediction and lack a mechanism for dynamic outputs. This severely curtails their flexibility and limits their “zero-shot” capabilities.这一工作路线代表了当前实用主义的中间立场,介于从有限数量的监督“金标签”学习和从实际上无限数量的原始文本学习之间。然而,这并非没有妥协。这两项工程都经过精心设计,并在工艺上有所限制,他们的监理班数分别为1000班和18291班。自然语言能够通过其通用性来表达并因此管理更广泛的视觉概念集合。这两种方法还使用静态softmax分类器来执行预测,并且缺乏用于动态输出的机制。这严重地削弱了它们的灵活性,限制了它们的“零射击”能力。
- A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale. While Mahajan et al. (2018) and Kolesnikov et al. (2019) trained their models for accelerator years on millions to billions of images, VirTex, ICMLM, and ConVIRT trained for accelerator days on one to two hundred thousand images. In this work, we close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale. Enabled by the large amounts of publicly available data of this form on the internet, we create a new dataset of 400 million (image, text) pairs and demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision. We study the scalability of CLIP by training a series of eight models spanning almost 2 orders of magnitude of compute and observe that transfer performance is a smoothly predictable function of compute (Hestness et al., 2017; Kaplan et al., 2020). We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and fin it can be competitive with prior task-specific supervised models. We also confirm these findings with linear-probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet model while also being more computationally efficient. We additionally find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models which suggests that zero-shot evaluation of task-agnostic models is much more representative of a model’s capability. These results have significant policy and ethical implications, which we consider in Section 7.这些弱监督模型与最近直接从自然语言学习图像表示的探索之间的一个关键区别是尺度。Mahajan等人(2018)和Kolesnikov等人(2019)在数百万至数十亿张图像上训练了他们的模型,加速器年,而Virtex、ICMLM和ConVIRT在1至20万张图像上训练了加速器天。在本文中,我们将填补这一空白,并在大规模上研究自然语言监督训练的图像分类器的行为。在互联网上大量这种形式的公开可用数据的支持下,我们创建了一个新的4亿对(图像,文本)数据集,并证明了从头开始训练的ConVIRT的简化版本,我们称之为CLIP,即对比语言-图像预训练,是一种从自然语言监督中学习的有效方法。我们通过训练一系列八个几乎跨越2个计算数量级的模型来研究CLIP的可扩展性,并观察到传输性能是计算的平滑可预测函数(Hestness等人,2017年; Kaplan等人,2020年)的报告。我们发现CLIP与GPT家族相似,在预训练期间学习执行一系列广泛的任务,包括OCR、地理定位、动作识别和许多其他任务。我们通过在30多个现有数据集上对CLIP的零触发传输性能进行基准测试来衡量这一点,并发现它可以与先前的特定于任务的监督模型竞争。我们还通过线性探针表示学习分析证实了这些发现,并表明CLIP优于最好的公开可用ImageNet模型,同时也更有效地计算。此外,我们还发现,零触发CLIP模型比同等精度的监督ImageNet模型更稳健,这表明任务不可知模型的零触发评估更能代表模型的能力。这些结果具有重大的政策和伦理影响,我们将在第7节中对此进行讨论。
2. Approach 方法
2.1. Natural Language Supervision
- At the core of our approach is the idea of learning perception from supervision contained in natural language. As discussed in the introduction, this is not at all a new idea, however terminology used to describe work in this space is varied, even seemingly contradictory, and stated motivations are diverse. Zhang et al. (2020), Gomez et al. (2017), Joulin et al. (2016), and Desai & Johnson (2020) all introduce methods which learn visual representations from text paired with images but describe their approaches as unsupervised, self-supervised, weakly supervised, and supervised respectively.我们的方法的核心是从自然语言中包含的监督中学习感知的思想。正如在引言中所讨论的,这根本不是一个新的想法,然而,用于描述这一领域工作的术语多种多样,甚至看似矛盾,而且所陈述的动机也多种多样。Zhang et al.(2020),Gomez et al.(2017),Joulin et al.(2016)和Desai &约翰逊(2020)都介绍了从与图像配对的文本中学习视觉表示的方法,但分别将其方法描述为无监督,自监督,弱监督和监督。
- We emphasize that what is common across this line of work is not any of the details of the particular methods used but the appreciation of natural language as a training signal. All these approaches are learning from natural language super vision. Although early work wrestled with the complexity of natural language when using topic model and n-gram representations, improvements in deep contextual representation learning suggest we now have the tools to effectively leverage this abundant source of supervision (McCann et al., 2017).我们要强调的是,这一工作领域的共同点不是所使用的特定方法的任何细节,而是对作为训练信号的自然语言的欣赏。所有这些方法都是借鉴自然语言超视觉的。尽管早期的工作在使用主题模型和n元语法表示时与自然语言的复杂性作了斗争,但是深度上下文表示学习的改进表明我们现在具有有效地利用这种丰富的监督来源的工具(McCann等人,(2017年版)。
- Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”. Instead, methods which work on natural language can learn passively from the supervision contained in the vast amount of text on the internet. Learning from natural language also has an important advantage over most unsupervised or self-supervised learning approaches in that it doesn’t “just” learn a representation but also connects that representation to language which enables flexible zero-shot transfer. In the following subsections, we detail the specific approach we settled on.从自然语言中学习比其他训练方法有几个潜在的优势。与标准的用于图像分类的众包标注相比,自然语言监督更容易扩展,因为它不需要注释采用经典的“机器学习兼容格式”,例如规范的1/N多数投票“黄金标签”。相反,对自然语言有效的方法可以被动地从互联网上大量文本中包含的监督中学习。与大多数无监督或自监督学习方法相比,从自然语言学习也具有重要优势,因为它不仅“仅仅”学习表示,而且还将表示与语言连接起来,从而实现灵活的零触发转移。在下面的小节中,我们将详细介绍我们确定的具体方法。
2.2. Creating a Sufficiently Large Dataset 创建足够大的数据集
Existing work has mainly used three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each. By comparison, other computer vision systems are trained on up to 3.5 billion Instagram photos (Mahajan et al., 2018). YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality. Many images use automatically generated filenames like 20160716 113957.JPG as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is approximately the same size as ImageNet. A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet. Since existing datasets do not adequately reflect this possibility, considering results only on them would underestimate the potential of this line of research. To address this, we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet. To attempt to cover as broad a set of visual concepts as possible, we search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries.1 We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.现有的工作主要使用了三个数据集,MS-COCO(Lin等人,2014),可视化基因组(Krishna等人,2017年)和YFCC 100 M(托米等人,(2016年版)。虽然MS-COCO和Visual Genome是高质量的人群标记数据集,但按照现代标准,它们都很小,每个都有大约100,000张训练照片。相比之下,其他计算机视觉系统在多达35亿张Instagram照片上进行训练(Mahajan等人,(2018年版)。1亿张照片的YFCC 100 M是一种可能的替代方案,但每张图像的元数据都很稀疏,质量也各不相同。许多图像使用自动生成的文件名(如20160716 113957.JPG)作为“标题”或包含相机曝光设置的“描述”。在过滤后只保留带有自然语言标题和/或英文描述的图像,数据集缩小了6倍,仅为1500万张照片。这与ImageNet的大小大致相同。自然语言监督的一个主要动机是在互联网上可以公开获得这种形式的大量数据。由于现有的数据集并没有充分反映这种可能性,因此只考虑这些数据集的结果将低估这一研究方向的潜力。为了解决这一问题,我们构建了一个新的数据集,其中包含从互联网上的各种公共资源中收集的4亿对(图像、文本)。为了尝试覆盖尽可能广泛的视觉概念集,我们在构建过程中搜索(图像、文本)对,其中文本包括500,000个查询中的一个。1我们通过每个查询最多包括20,000个(图像、文本)对来近似地平衡结果的类别。结果数据集的总字数与用于训练GPT-2的WebText数据集相似。我们将此数据集称为WebImageText的WIT。
2.3. Selecting an Efficient Pre-Training Method 选择一种有效的预训练方法
State-of-the-art computer vision systems use very large amounts of compute. Mahajan et al. (2018) required 19 GPU years to train their ResNeXt101-32x48d and Xie et al. (2020) required 33 TPUv3 core-years to train their Noisy Student EfficientNet-L2. When considering that both these systems were trained to predict only 1000 ImageNet classes, the task of learning an open set of visual concepts from natural language seems daunting. In the course of our efforts, we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric.最先进的计算机视觉系统使用非常大的计算量。Mahajan等人(2018)需要19 GPU年来训练他们的ResNeXt 101 - 32 x48 d,Xie等人(2020)需要33 TPUv 3核心年来训练他们的Noisy Student EfficientNet-L2。当考虑到这两个系统都只被训练来预测1000个ImageNet类时,从自然语言中学习一组开放的视觉概念的任务似乎是令人生畏的。在我们的努力过程中,我们发现训练效率是成功扩展自然语言监督的关键,我们根据这一指标选择了最终的预训练方法。
Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method. In Figure 2 we show that a 63 million parameter transformer language model, which already uses twice the compute of its ResNet-50 image encoder, learns to recognize ImageNet classes three times slower than a much simpler baseline that predicts a bag-ofwords encoding of the same text.我们最初的方法类似于VirTex,从头开始联合训练图像CNN和文本Transformer来预测图像的标题。然而,我们遇到了有效扩展该方法的困难。在图2中,我们展示了一个6300万参数的Transformer语言模型,它已经使用了其ResNet-50图像编码器的两倍计算,学习识别ImageNet类比预测相同文本的bag-ofwords编码的简单得多的基线慢三倍。
Both these approaches share a key similarity. They try to predict the exact words of the text accompanying each image. This is a difficult task due to the wide variety of descriptions, comments, and related text that co-occur with images. Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent predictive objective (Tian et al., 2019). Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance (Chen et al., 2020a). Noting these findings, we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text. Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.这两种方法都有一个关键的相似之处。他们试着预测每个图像所附文本的确切单词。由于与图像同时出现的各种各样的描述、注释和相关文本,这是一项困难的任务。最近在图像的对比表示学习中的工作已经发现对比目标可以比它们的等效预测目标学习更好的表示(Tian等人,(2019年版)。其他工作已经发现,尽管图像的生成模型可以学习高质量的图像表示,但是它们需要比具有相同性能的对比模型多一个数量级的计算(Chen等人,2020年a)的规定。注意到这些发现,我们研究了训练一个系统来解决一个潜在的更容易的代理任务,即只预测哪个文本作为一个整体与哪个图像配对,而不是该文本的确切单词。从相同的词袋编码基线开始,我们将预测目标替换为图2中的对比目标,并观察到零触发传输到ImageNet的速率进一步提高了4倍。
Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N ×N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2 − N incorrect pairings. We optimize a symmetric cross entropy loss over these similarity scores. In Figure 3 we include pseudocode of the core of an implementation of CLIP. To our knowledge this batch construction technique and objective was first introduced in the area of deep metric learning as the multi-class N-pair loss Sohn (2016), was popularized for contrastive representation learning by Oord et al. (2018) as the InfoNCE loss, and was recently adapted for contrastive (text, image) representation learning in the domain of medical imaging by Zhang et al. (2020).给定一批N(图像,文本)对,CLIP被训练来预测跨一批的N ×N可能的(图像,文本)对中的哪一个实际发生。为此,CLIP通过联合训练图像编码器和文本编码器来学习多模态嵌入空间,以最大化批次中N个真实的对的图像和文本嵌入的余弦相似性,同时最小化N2 − N个不正确配对的嵌入的余弦相似性。我们优化这些相似性分数的对称交叉熵损失。在图3中,我们包括CLIP实现的核心的伪代码。据我们所知,这种批量构造技术和目标首先在深度度量学习领域引入,作为多类N对损失Sohn(2016),由Oord等人推广用于对比表示学习。(2018)作为InfoNCE损失,最近适用于对比表示学习。Zhang et al.(2020)在医学成像领域的(文本,图像)表示学习。
Due to the large size of our pre-training dataset, over-fitting is not a major concern and the details of training CLIP are simplified compared to the implementation of Zhang et al. (2020). We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights. We do not use the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We instead use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only in self-supervised representation learning methods. We also remove the text transformation function tu from Zhang et al. (2020) which samples a single sentence at uniform from the text since many of the (image, text) pairs in CLIP’s pretraining dataset are only a single sentence. We also simplify the image transformation function tv. A random square crop from resized images is the only data augmentation used during training. Finally, the temperature parameter which controls the range of the logits in the softmax, τ, is directly optimized during training as a log-parameterized multiplicative scalar to avoid turning as a hyper-parameter.由于我们的预训练数据集很大,过度拟合不是主要问题,与Zhang et al.(2020)的实现相比,训练CLIP的细节得到了简化。我们从头开始训练CLIP,而不使用ImageNet权重初始化图像编码器或使用预先训练的权重初始化文本编码器。我们不使用表示和对比嵌入空间之间的非线性投影,这是由Bachman等人引入的变化。(2019)并由Chen等人推广。(2020 b)。相反,我们只使用线性投影来从每个编码器的表示映射到多模态嵌入空间。我们没有注意到两个版本之间的训练效率差异,并推测非线性投影可能仅在自监督表示学习方法中与当前图像的细节协同适应。我们还删除了Zhang et al.(2020)中的文本转换函数tu,该函数从文本中统一采样单个句子,因为CLIP预训练数据集中的许多(图像,文本)对只是一个句子。我们还简化了图像变换函数tv。来自调整大小的图像的随机正方形裁剪是训练期间使用的唯一数据扩充。最后,控制softmax中logits范围的温度参数τ在训练期间直接优化为对数参数化乘法标量,以避免作为超参数转向。
2.4. Choosing and Scaling a Model
We consider two different architectures for the image encoder. For the first, we use ResNet-50 (He et al., 2016a) as the base architecture for the image encoder due to its widespread adoption and proven performance. We make several modifications to the original version using the ResNetD improvements from He et al. (2019) and the antialiased rect-2 blur pooling from Zhang (2019). We also replace the global average pooling layer with an attention pooling mechanism. The attention pooling is implemented as a single layer of “transformer-style” multi-head QKV attention where the query is conditioned on the global average-pooled representation of the image. For the second architecture, we experiment with the recently introduced Vision Transformer (ViT) (Dosovitskiy et al., 2020). We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.我们考虑两种不同的图像编码器架构。首先,我们使用ResNet-50(He等人,2016 a)作为图像编码器的基础架构,这是由于其被广泛采用并且性能得到证实。我们使用He等人(2019)的ResNetD改进和Zhang(2019)的反锯齿rect-2模糊池对原始版本进行了几处修改。我们还将全局平均池层替换为注意力池机制。注意力池被实现为单层的“变换器式”多头QKV注意力,其中查询以图像的全局平均池表示为条件。对于第二种体系结构,我们使用最近引入的视觉Transformer(ViT)进行实验(Dosovitskiy等人,2020年)的报告。我们密切关注它们的实现,仅进行了微小的修改,即在Transformer之前向组合的面片和位置嵌入添加了额外的层归一化,并使用了略微不同的初始化方案。
The text encoder is a Transformer (Vaswani et al., 2017) with the architecture modifications described in Radford et al. (2019). As a base size we use a 63M-parameter 12layer 512-wide model with 8 attention heads. The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size (Sennrich et al., 2015). For computational efficiency, the max sequence length was capped at 76. The text sequence is bracketed with [SOS] and [EOS] tokens and the activations of the highest layer of the transformer at the [EOS] token are treated as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space. Masked self-attention was used in the text encoder to preserve the ability to initialize with a pre-trained language model or add language modeling as an auxiliary objective, though exploration of this is left as future work.文本编码器是一个Transformer(Vaswani等人,2017年),并对拉德福等人(2019年)所述的架构进行了修改。作为基本尺寸,我们使用具有8个注意头的63M参数12层512宽的模型。Transformer对具有49,152 vocab大小的文本的小写字节对编码(BPE)表示进行操作(Sennrich等人,2015年)的报告。为了提高计算效率,最大序列长度被限制在76。文本序列被[SOS]和[EOS]标记包围,Transformer在[EOS]标记处的最高层的激活被视为文本的特征表示,该特征表示被层归一化,然后被线性投影到多模态嵌入空间。在文本编码器中使用了掩蔽的自注意,以保留使用预先训练的语言模型进行初始化或添加语言建模作为辅助目标的能力,尽管对此的探索留待将来工作。
While previous computer vision research has often scaled models by increasing the width (Mahajan et al., 2018) or depth (He et al., 2016a) in isolation, for the ResNet image encoders we adapt the approach of Tan & Le (2019) which found that allocating additional compute across all of width, depth, and resolution outperforms only allocating it to only one dimension of the model. While Tan & Le (2019) tune the ratio of compute allocated to each dimension for their EfficientNet architecture, we use a simple baseline of allocating additional compute equally to increasing the width, depth, and resolution of the model. For the text encoder, we only scale the width of the model to be proportional to the calculated increase in width of the ResNet and do not scale the depth at all, as we found CLIP’s performance to be less sensitive to the capacity of the text encoder.尽管先前的计算机视觉研究经常通过增加宽度来缩放模型(Mahajan等人,2018)或深度(He等人,2016a),对于ResNet图像编码器,我们采用了Tan和Le(2019)的方法,该方法发现,在宽度、深度和分辨率上分配额外的计算优于仅将其分配给模型的一个维度。虽然Tan & Le(2019)针对其EfficientNet架构调整了分配给每个维度的计算比率,但我们使用了一个简单的基准,即平均分配额外的计算,以增加模型的宽度、深度和分辨率。对于文本编码器,我们仅根据计算出的ResNet宽度增量按比例缩放模型的宽度,而根本不缩放深度,因为我们发现CLIP的性能对文本编码器的容量不太敏感。
2.5. Training
We train a series of 5 ResNets and 3 Vision Transformers. For the ResNets we train a ResNet-50, a ResNet-101, and then 3 more which follow EfficientNet-style model scaling and use approximately 4x, 16x, and 64x the compute of a ResNet-50. They are denoted as RN50x4, RN50x16, and RN50x64 respectively. For the Vision Transformers we train a ViT-B/32, a ViT-B/16, and a ViT-L/14. We train all models for 32 epochs. We use the Adam optimizer (Kingma & Ba, 2014) with decoupled weight decay regularization (Loshchilov & Hutter, 2017) applied to all weights that are not gains or biases, and decay the learning rate using a cosine schedule (Loshchilov & Hutter, 2016). Initial hyperparameters were set using a combination of grid searches, random search, and manual tuning on the baseline ResNet50 model when trained for 1 epoch. Hyper-parameters were then adapted heuristically for larger models due to computational constraints. The learnable temperature parameter τ was initialized to the equivalent of 0.07 from (Wu et al., 2018) and clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability. We use a very large minibatch size of 32,768. Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory. To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used. The calculation of embedding similarities was also sharded with individual GPUs computing only the subset of the pairwise similarities necessary for their local batch of embeddings. The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs. For the ViT-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch to boost performance similar to FixRes (Touvron et al., 2019). We denote this model as ViT-L/14@336px. Unless otherwise specified, all results reported in this paper as “CLIP” use this model which we found to perfor