Learning Transferable Visual Models From Natural Language Supervision(CLIP论文)

论文:https://arxiv.org/pdf/2103.00020
代码:https://github.com/OpenAI/CLIP

Learning Transferable Visual Models From Natural Language Supervision

从自然语言监督中学习可迁移的视觉模型

Abstract 摘要

 State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 30 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
 顶尖的计算机视觉系统通常被训练用于预测一组固定的预定对象类别。这种受限制的监督形式限制了它们的通用性和实用性,因为需要额外的标注数据来指定任何其他视觉概念。直接从关于图像的原始文本中学习是一种有前景的替代方案,它利用了更广泛的监督来源。我们证明,预测哪段文字描述与哪张图片匹配这一简单的预训练任务,是从互联网收集的4亿(图像,文本)对数据集中从头开始学习最先进图像表示的一种高效且可扩展的方法。预训练后,自然语言被用来引用学习到的视觉概念(或描述新的概念),从而实现模型在下游任务上的零样本迁移。我们在 30 30 30多个不同的现有计算机视觉数据集上进行了基准测试,涵盖OCR、视频中的动作识别、地理定位以及多种细粒度对象分类等任务,研究了这种方法的性能。该模型在大多数任务上表现出色,通常无需任何特定数据集训练即可与完全监督的基线方法竞争。例如,我们在ImageNet上零样本匹配了原始ResNet-50的准确率,而无需使用其训练所依赖的128万个训练示例中的任何一个。我们在https://github.com/OpenAI/CLIP发布了代码和预训练模型权重。

1. Introduction and Motivating Work 介绍与激励工作

 Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2019). Task-agnostic objectives such as autoregressive and masked language modeling have scaled across many orders of magnitude in compute, model capacity, and data, steadily improving capabilities. The development of “text-to-text” as a standardized input-output interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled taskagnostic architectures to zero-shot transfer to downstream datasets removing the need for specialized output heads or dataset specific customization. Flagship systems like GPT-3 (Brown et al., 2020) are now competitive across many tasks with bespoke models while requiring little to no dataset specific training data.
 直接从原始文本学习的预训练方法在过去几年彻底改变了自然语言处理领域(Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2019)。诸如自回归和掩码语言建模等任务无关目标,已在计算能力、模型容量和数据规模上实现多个数量级的扩展,持续提升模型性能。将"文本到文本"发展为标准化输入输出接口(McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019)后,任务无关架构能够零样本迁移至下游数据集,无需专门设计输出头或进行数据集特定定制。像GPT-3这样的旗舰系统(Brown et al., 2020)现已能在众多任务上与定制模型竞争,且几乎不需要特定训练数据。

 These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets. However, in other fields such as computer vision it is still standard practice to pre-train models on crowd-labeled datasets such as ImageNet (Deng et al., 2009). Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision? Prior work is encouraging.
 这些结果表明,现代预训练方法在海量网络文本集合中可获取的总体监督信号超过了高质量众包标注的自然语言处理数据集。然而在其他领域如计算机视觉中,基于ImageNet(Deng等,2009)等众包标注数据集进行模型预训练仍是标准做法。直接从网络文本中学习的大规模预训练方法能否在计算机视觉领域带来类似突破?现有研究结果令人鼓舞。

 Over 20 years ago Mori et al. (1999) explored improving content based image retrieval by training a model to predict the nouns and adjectives in text documents paired with images. Quattoni et al. (2007) demonstrated it was possible to learn more data efficient image representations via manifold learning in the weight space of classifiers trained to predict words in captions associated with images. Srivastava & Salakhutdinov (2012) explored deep representation learning by training multimodal Deep Boltzmann Machines on top of low-level image and text tag features. Joulin et al. (2016) modernized this line of work and demonstrated that CNNs trained to predict words in image captions learn useful image representations. They converted the title, description, and hashtag metadata of images in the YFCC100M dataset (Thomee et al., 2016) into a bag-ofwords multi-label classification task and showed that pretraining AlexNet (Krizhevsky et al., 2012) to predict these labels learned representations which preformed similarly to ImageNet-based pre-training on transfer tasks. Li et al. (2017) then extended this approach to predicting phrase ngrams in addition to individual words and demonstrated the ability of their system to zero-shot transfer to other image Learning Transferable Visual Models From Natural Language classification datasets by scoring target classes based on their dictionary of learned visual n-grams and predicting the one with the highest score. Adopting more recent architectures and pre-training approaches, VirTex (Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020) have recently demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.
 20多年前,Mori等人(1999年)通过训练模型预测与图像配对的文本文档中的名词和形容词,探索了改进基于内容的图像检索方法。Quattoni等人(2007年)证明,通过在对预测图像相关说明文字的分类器权重空间进行流形学习,可以获得更高效的数据图像表示。Srivastava与Salakhutdinov(2012年)通过在低级图像和文本标签特征上训练多模态深度玻尔兹曼机,探索了深度表示学习。Joulin团队(2016年)将这一研究方向现代化,证明经过训练预测图像说明文字的CNN网络能学习到有效的图像表征。他们将YFCC100M数据集(Thomee等人,2016年)中图像的标题、描述和标签元数据转化为词袋多标签分类任务,并证明通过预训练AlexNet(Krizhevsky等人,2012年)来预测这些标签,所学习的表征在迁移任务中表现与基于ImageNet的预训练相当。Li等人(2017年)随后扩展了该方法,除单个词语外还预测短语n元语法,并通过基于已学视觉n元语法词典对目标类别评分并预测得分最高者,展示了该系统向其他图像分类数据集进行零样本迁移的能力。采用更现代架构与预训练方法的VirTex(Desai与Johnson,2020年)、ICMLM(Bulent Sariyildiz等人,2020年)和ConVIRT(Zhang等人,2020年)近期证明了基于Transformer的语言建模、掩码语言建模以及对比目标在从文本学习图像表征方面的潜力。

 While exciting as proofs of concept, using natural language supervision for image representation learning is still rare. This is likely because demonstrated performance on common benchmarks is much lower than alternative approaches. For example, Li et al. (2017) reach only 11.5 % 11.5\% 11.5% accuracy on ImageNet in a zero-shot setting. This is well below the 88.4 % 88.4\% 88.4% accuracy of the current state of the art (Xie et al., 2020). It is even below the 50 % 50\% 50% accuracy of classic computer vision approaches (Deng et al., 2012). Instead, more narrowly scoped but well-targeted uses of weak supervision have improved performance. Mahajan et al. (2018) showed that predicting ImageNet-related hashtags on Instagram images is an effective pre-training task. When fine-tuned to ImageNet these pre-trained models increased accuracy by over 5 % 5\% 5% and improved the overall state of the art at the time. Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have also demonstrated large gains on a broader set of transfer benchmarks by pre-training models to predict the classes of the noisily labeled JFT-300M dataset.
 尽管作为概念验证令人兴奋,但利用自然语言监督进行图像表征学习的研究仍属罕见。这很可能是因为在常见基准测试中,其表现远低于替代方法。例如,Li等人(2017)在ImageNet零样本设定中仅达到 11.5 % 11.5\% 11.5%准确率,远低于当前最优方法(Xie等人,2020)的 88.4 % 88.4\% 88.4%准确率,甚至不及经典计算机视觉方法(Deng等人,2012) 50 % 50\% 50%的准确率。相比之下,范围更窄但针对性更强的弱监督应用反而提升了性能:Mahajan等人(2018)证明,预测Instagram图像上与ImageNet相关的标签是有效的预训练任务,经过ImageNet微调后,这些预训练模型的准确率提升超过 5 % 5\% 5%,推动了当时的最优水平;Kolesnikov等人(2019)和Dosovitskiy等人(2020)也通过预训练模型预测噪声标签JFT-300M数据集的类别,在更广泛的迁移基准测试中取得了显著提升。

 This line of work represents the current pragmatic middle ground between learning from a limited amount of supervised “gold-labels” and learning from practically unlimited amounts of raw text. However, it is not without compromises. Both works carefully design, and in the process limit, their supervision to 1000 1000 1000 and 18291 18291 18291 classes respectively. Natural language is able to express, and therefore supervise, a much wider set of visual concepts through its generality. Both approaches also use static softmax classifiers to perform prediction and lack a mechanism for dynamic outputs. This severely curtails their flexibility and limits their “zero-shot” capabilities.
 这项工作代表了当前在从有限的监督“黄金标签”中学习与从几乎无限量的原始文本中学习之间的实用折中方案。然而,这并非没有妥协。两项研究分别精心设计并将其监督范围限制在 1000 1000 1000个和 18291 18291 18291个类别。自然语言由于其通用性,能够表达并因此监督更广泛的视觉概念。这两种方法还使用静态softmax分类器进行预测,缺乏动态输出的机制。这严重削弱了它们的灵活性,并限制了它们的“零样本”能力。

 A crucial difference between these weakly supervised models and recent explorations of learning image representations directly from natural language is scale. While Mahajan et al. (2018) and Kolesnikov et al. (2019) trained their models for accelerator years on millions to billions of images, VirTex, ICMLM, and ConVIRT trained for accelerator days on one to two hundred thousand images. In this work, we close this gap and study the behaviors of image classifiers trained with natural language supervision at large scale. Enabled by the large amounts of publicly available data of this form on the internet, we create a new dataset of 400 million (image, text) pairs and demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient method of learning from natural language supervision. We study the scalability of CLIP by training a series of eight models spanning almost 2 orders of magnitude of compute and observe that transfer performance is a smoothly predictable function of compute (Hestness et al., 2017; Kaplan et al., 2020). We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models. We also confirm these findings with linear-probe representation learning analysis and show that CLIP outperforms the best publicly available ImageNet model while also being more computationally efficient. We additionally find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models which suggests that zero-shot evaluation of task-agnostic models is much more representative of a model’s capability. These results have significant policy and ethical implications, which we consider in Section 7.
 这些弱监督模型与近期直接从自然语言学习图像表征的探索之间,一个关键差异在于规模。Mahajan等人(2018)和Kolesnikov等人(2019)在数百万至数十亿图像上进行了加速器年量级的训练,而VirTex、ICMLM和ConVIRT仅在十万至二十万图像上训练了加速器日量级。本研究通过构建4亿(图像,文本)对的全新数据集弥合了这一差距,并证明从头训练的简化版ConVIRT——我们称之为CLIP(对比语言-图像预训练)——是实现自然语言监督学习的高效方法。得益于互联网上公开的海量此类数据,我们训练了横跨近2个数量级计算的8个CLIP模型,发现其迁移性能与计算量呈平滑可预测关系(Hestness等人,2017;Kaplan等人,2020)。与GPT家族类似,CLIP在预训练中习得了包括OCR、地理定位、动作识别等广泛任务能力。通过对30多个现有数据集的零样本迁移测试,我们发现其性能可与专用监督模型媲美。线性探针表征学习分析进一步验证:CLIP在超越最佳公开ImageNet模型的同时具备更高计算效率。零样本CLIP模型还表现出比准确率相当的监督式ImageNet模型更强的鲁棒性,这表明任务无关模型的零样本评估更能真实反映模型能力。这些结果具有重要的政策与伦理意义,我们将在第7节探讨。
在这里插入图片描述

图1. 方法概述。标准图像模型通常联合训练图像特征提取器和线性分类器来预测标签,而CLIP则联合训练图像编码器和文本编码器,用于预测一批(图像,文本)训练样本的正确配对关系。在测试阶段,学习到的文本编码器通过嵌入目标数据集类别的名称或描述,合成出零样本线性分类器。

2. Approach 方法

2.1. Natural Language Supervision 自然语言监督

 At the core of our approach is the idea of learning perception from supervision contained in natural language. As discussed in the introduction, this is not at all a new idea, however terminology used to describe work in this space is varied, even seemingly contradictory, and stated motivations are diverse. Zhang et al. (2020), Gomez et al. (2017), Joulin et al. (2016), and Desai & Johnson (2020) all introduce methods which learn visual representations from text paired with images but describe their approaches as unsupervised, self-supervised, weakly supervised, and supervised respectively.
 我们方法的核心思想是从自然语言蕴含的监督信息中学习感知能力。正如引言所述,这并非全新理念,但该领域研究中使用的术语却五花八门——甚至看似矛盾,且提出的动机也各不相同。张等人(2020)、戈麦斯等人(2017)、乔林等人(2016)以及德赛与约翰逊(2020)都提出了从图文配对数据中学习视觉表征的方法,却分别将其方法描述为无监督、自监督、弱监督和有监督。

 We emphasize that what is common across this line of work is not any of the details of the particular methods used but the appreciation of natural language as a training signal. All these approaches are learning from natural language supervision. Although early work wrestled with the complexity of natural language when using topic model and n-gram representations, improvements in deep contextual representation learning suggest we now have the tools to effectively leverage this abundant source of supervision (McCann et al., 2017).
 我们强调,这一系列研究的共同点不在于具体方法的细节,而在于对自然语言作为训练信号的深刻理解。所有这些方法都在学习来自自然语言的监督信号。尽管早期研究在使用主题模型和n-gram表示时饱受自然语言复杂性的困扰,但深度上下文表示学习的进步表明,我们现在已拥有有效利用这一丰富监督源的工具(McCann等人,2017)。

 Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”. Instead, methods which work on natural language can learn passively from the supervision contained in the vast amount of text on the internet. Learning from natural language also has an important advantage over most unsupervised or self-supervised learning approaches in that it doesn’t “just” learn a representation but also connects that representation to language which enables flexible zero-shot transfer. In the following subsections, we detail the specific approach we settled on.
 与其他训练方法相比,自然语言学习具有多项潜在优势。相较于图像分类中需采用"机器学习兼容格式"(如经典的N选1多数表决"黄金标签")的标准众包标注方式,自然语言监督的扩展性更强,因为它不需要特定格式的注释。相反,基于自然语言的方法能够被动地从互联网海量文本蕴含的监督信息中学习。与大多数无监督或自监督学习方法相比,自然语言学习的显著优势在于它不仅学习表征,还将该表征与语言关联起来,从而实现了灵活的零样本迁移能力。在后续小节中,我们将详细阐述最终采用的具体方法。

2.2. Creating a Sufficiently Large Dataset 创建一个足够大的数据集

 Existing work has mainly used three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each. By comparison, other computer vision systems are trained on up to 3.5 billion Instagram photos (Mahajan et al., 2018). YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality. Many images use automatically generated filenames like 20160716_113957.JPG as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is approximately the same size as ImageNet.
 现有研究主要使用了三个数据集:MS-COCO(Lin等人,2014)、Visual Genome(Krishna等人,2017)和YFCC100M(Thomee等人,2016)。虽然MS-COCO和Visual Genome是高质量的众包标注数据集,但按现代标准来看规模较小,各自仅包含约10万张训练照片。相比之下,其他计算机视觉系统训练时使用的数据量最高可达35亿张Instagram照片(Mahajan等人,2018)。YFCC100M拥有1亿张照片,是一个可能的替代方案,但每张图像的元数据稀疏且质量参差不齐。许多图像使用自动生成的文件名(如20160716_113957.JPG)作为"标题",或包含相机曝光设置的"描述"。经过筛选仅保留含英语自然语言标题和/或描述的图像后,数据集规模缩小了六倍,仅剩1500万张照片——这个数量级与ImageNet大致相当。

 A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet. Since existing datasets do not adequately reflect this possibility, considering results only on them would underestimate the potential of this line of research. To address this, we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet. To attempt to cover as broad a set of visual concepts as possible, we search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries.1 We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.
 自然语言监督的一个主要动机是互联网上公开提供的大量此类数据。由于现有数据集未能充分体现这一可能性,仅基于它们评估结果会低估这一研究方向的潜力。为此,我们构建了一个包含4亿(图像,文本)对的新数据集,其数据来自互联网上各种公开可用的资源。为了尽可能涵盖广泛的视觉概念,我们在构建过程中搜索了文本包含50万个查询词之一的(图像,文本)对。我们通过限制每个查询最多包含20,000个(图像,文本)对来实现近似类别平衡。最终数据集的单词总量与用于训练GPT-2的WebText数据集相当。我们将这个数据集称为WIT(WebImageText)。

2.3. Selecting an Efficient Pre-Training Method 选择高效的预训练方法

 State-of-the-art computer vision systems use very large amounts of compute. Mahajan et al. (2018) required 19 GPU years to train their ResNeXt101-32x48d and Xie et al. (2020) required 33 TPUv3 core-years to train their Noisy Student EfficientNet-L2. When considering that both these systems were trained to predict only 1000 ImageNet classes, the task of learning an open set of visual concepts from natural language seems daunting. In the course of our efforts, we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric.
 最先进的计算机视觉系统需要消耗大量算力。Mahajan等人(2018年)训练其ResNeXt101-32x48d模型耗费了19个GPU年,而Xie等人(2020年)训练Noisy Student EfficientNet-L2模型则消耗了33个TPUv3核心年。考虑到这两个系统仅训练用于预测1000个ImageNet类别,从自然语言中学习开放视觉概念集的任务显得尤为艰巨。在我们的研究过程中,发现训练效率是实现自然语言监督规模化成功的关键因素,并基于这一指标选定了最终的预训练方法。

 Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method. In Figure 2 we show that a 63 million parameter transformer language model, which already uses twice the compute of its ResNet-50 image encoder, learns to recognize ImageNet classes three times slower than a much simpler baseline that predicts a bag-ofwords encoding of the same text.
 我们最初的方法与VirTex类似,联合训练了一个图像CNN和文本变换器,从头开始预测图像的标题。然而,我们在高效扩展这种方法时遇到了困难。图2显示,一个拥有6300万参数的变换器语言模型(其计算量已经是ResNet-50图像编码器的两倍)学习识别ImageNet类别的速度,比预测相同文本词袋编码的简单基线模型慢三倍。
在这里插入图片描述

图2. CLIP在零样本迁移方面比我们的图像字幕基线模型高效得多。尽管表达能力很强,但我们发现基于Transformer的语言模型在零样本ImageNet分类任务中表现相对较弱。如图所示,其学习速度比预测文本词袋编码(BoW)的基线模型(Joulin等人,2016)慢3倍。将预测目标替换为CLIP的对比目标后,效率又提升了4倍。

 Both these approaches share a key similarity. They try to predict the exact words of the text accompanying each image. This is a difficult task due to the wide variety of descriptions, comments, and related text that co-occur with images. Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent predictive objective (Tian et al., 2019). Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance (Chen et al., 2020a). Noting these findings, we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text. Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.
 这两种方法有一个关键共同点:它们都试图预测每张图片对应的确切文字描述。由于图片可能伴随多样化的说明、评论及相关文本,这一任务极具挑战性。近期图像对比表征学习研究表明,对比目标能比同等的预测目标学习到更优质的表征(Tian等人,2019)。另有研究发现,虽然图像的生成式模型可以学习到高质量的图像表征,但达到同等性能时,其计算需求是对比模型的十倍以上(Chen等人,2020a)。基于这些发现,我们转而训练系统解决一个可能更简单的代理任务:只需预测整体文本与图像的配对关系,而无需预测文本的具体用词。沿用相同的词袋编码基线时,我们将图2中的预测目标替换为对比目标后,观察到ImageNet零样本迁移效率又提升了4倍。

 Given a batch of N N N (image, text) pairs, CLIP is trained to predict which of the N × N N × N N×N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N N N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N 2 − N N^2 − N N2N incorrect pairings. We optimize a symmetric cross entropy loss over these similarity scores. In Figure 3 we include pseudocode of the core of an implementation of CLIP. To our knowledge this batch construction technique and objective was first introduced in the area of deep metric learning as the multi-class N-pair lossSohn (2016), was popularized for contrastive representation learning by Oord et al. (2018) as the InfoNCE loss, and was recently adapted for contrastive (text, image) representation learning in the domain of medical imaging by Zhang et al. (2020).
 给定一批 N N N个(图像,文本)配对数据,CLIP的训练目标是预测该批次中实际发生的 N × N N × N N×N种可能(图像,文本)配对组合。为此,CLIP通过联合训练图像编码器和文本编码器,在多模态嵌入空间中最大化批次里 N N N个真实配对的图像与文本嵌入向量之间的余弦相似度,同时最小化 N 2 − N N^2 − N N2N个不正确配对的嵌入相似度。我们基于这些相似度得分优化对称交叉熵损失。图3展示了CLIP实现核心部分的伪代码。据我们所知,这种批次构建技术和目标函数最早由Sohn(2016)在深度度量学习领域提出,称为多类N配对损失;后被Oord等人(2018)作为InfoNCE损失推广至对比表征学习领域;近期张等人(2020)在医学影像领域将其适配用于(文本,图像)对比表征学习。
在这里插入图片描述

图3. CLIP实现核心部分的类Numpy伪代码。

 Due to the large size of our pre-training dataset, over-fitting is not a major concern and the details of training CLIP are simplified compared to the implementation of Zhang et al. (2020). We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights. We do not use the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We instead use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only in self-supervised representation learning methods. We also remove the text transformation function t u t_u tu from Zhang et al. (2020) which samples a single sentence at uniform from the text since many of the (image, text) pairs in CLIP’s pretraining dataset are only a single sentence. We also simplify the image transformation function t v t_v tv . A random square crop from resized images is the only data augmentation used during training. Finally, the temperature parameter which controls the range of the logits in the softmax, τ τ τ , is directly optimized during training as a log-parameterized multiplicative scalar to avoid turning as a hyper-parameter.
 由于我们的预训练数据集规模庞大,过拟合不是主要问题,因此CLIP的训练细节相较于张等人(2020)的实现进行了简化。我们从头开始训练CLIP,既没有用ImageNet权重初始化图像编码器,也没有使用预训练权重初始化文本编码器。我们未采用表征与对比嵌入空间之间的非线性投影(该技术由Bachman等人于2019年提出,并由Chen等人于2020b推广),而是仅使用线性投影将每个编码器的表征映射到多模态嵌入空间。我们未观察到两个版本在训练效率上的差异,推测非线性投影可能仅适用于当前图像的自监督表征学习方法细节。我们还移除了张等人(2020)的文本变换函数 t u t_u tu(该函数从文本中均匀采样单一句子),因为CLIP预训练数据集中的许多(图像,文本)对本身就是单句文本。我们也简化了图像变换函数 t v t_v tv——训练期间唯一使用的数据增强方法是从调整尺寸后的图像中随机裁剪方形区域。最后,作为控制softmax中logits范围的温度参数 τ τ τ,在训练期间被直接优化为对数参数化的乘法标量,以避免将其作为超参数进行调优。

2.4. Choosing and Scaling a Model 选择和扩展模型

 We consider two different architectures for the image encoder. For the first, we use ResNet-50 (He et al., 2016a) as the base architecture for the image encoder due to its widespread adoption and proven performance. We make several modifications to the original version using the ResNetD improvements from He et al. (2019) and the antialiased rect-2 blur pooling from Zhang (2019). We also replace the global average pooling layer with an attention pooling mechanism. The attention pooling is implemented as a single layer of “transformer-style” multi-head QKV attention where the query is conditioned on the global average-pooled representation of the image. For the second architecture, we experiment with the recently introduced Vision Transformer (ViT) (Dosovitskiy et al., 2020). We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.
 我们为图像编码器考虑了两种不同的架构。首先采用ResNet-50(何凯明等,2016a)作为基础架构,因其广泛适用性和已验证的性能。我们基于何凯明2019年提出的ResNetD改进方案和张航2019年的抗锯齿矩形池化技术,对原始版本进行了多项改进。同时将全局平均池化层替换为注意力池化机制,该机制以单层"Transformer式"多头QKV注意力实现,其中查询向量基于图像的全局平均池化表征生成。第二种架构则采用近期提出的视觉Transformer(ViT)(Dosovitskiy等,2020),基本遵循原方案实现,仅做了两处微调:在Transformer层的输入前对图像块嵌入和位置嵌入的叠加表征增加层归一化处理,并采用了略有不同的参数初始化策略。

 The text encoder is a Transformer (Vaswani et al., 2017) with the architecture modifications described in Radford et al. (2019). As a base size we use a 63M-parameter 12-layer 512-wide model with 8 attention heads. The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text with a 49,152 vocab size (Sennrich et al., 2015). For computational efficiency, the max sequence length was capped at 76. The text sequence is bracketed with [SOS] and [EOS] tokens and the activations of the highest layer of the transformer at the [EOS]token are treated as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space. Masked self-attention was used in the text encoder to preserve the ability to initialize with a pre-trained language model or add language modeling as an auxiliary objective, though exploration of this is left as future work.
 文本编码器是一个经过Radford等人(2019)所述架构修改的Transformer模型(Vaswani等人,2017)。基础模型采用6300万参数配置,包含12层512宽度的网络结构及8个注意力头。该Transformer处理的文本采用小写字节对编码(BPE)表示法,词汇表规模为49,152(Sennrich等人,2015)。出于计算效率考虑,最大序列长度限制为76个token。文本序列以[SOS]和[EOS]标记作为边界,取Transformer顶层在[EOS]标记处的激活值作为文本特征表示——该特征会经过层标准化处理,再线性投射到多模态嵌入空间。文本编码器使用了掩码自注意力机制,以保留加载预训练语言模型的可能性,或将语言建模作为辅助训练目标(具体探索将作为后续工作)。

 While previous computer vision research has often scaled models by increasing the width (Mahajan et al., 2018) or depth (He et al., 2016a) in isolation, for the ResNet image encoders we adapt the approach of Tan & Le (2019) which found that allocating additional compute across all of width, depth, and resolution outperforms only allocating it to only one dimension of the model. While Tan & Le (2019) tune the ratio of compute allocated to each dimension for their EfficientNet architecture, we use a simple baseline of allocating additional compute equally to increasing the width, depth, and resolution of the model. For the text encoder, we only scale the width of the model to be proportional to the calculated increase in width of the ResNet and do not scale the depth at all, as we found CLIP’s performance to be less sensitive to the capacity of the text encoder.
 以往的计算机视觉研究通常通过单独增加宽度(Mahajan等人,2018)或深度(He等人,2016a)来扩展模型。对于ResNet图像编码器,我们采用了Tan & Le(2019)的方法。该方法发现:将额外计算资源均衡分配给宽度、深度和分辨率三个维度,比仅增加单一维度能获得更优性能。虽然Tan & Le(2019)为他们的EfficientNet架构调整了各维度的计算资源分配比例,但我们采用简单基线方案——将额外计算资源均等分配给模型的宽度、深度和分辨率提升。对于文本编码器,我们仅按ResNet宽度计算增幅等比例扩展其宽度,完全不增加深度,因为我们发现CLIP性能对文本编码器容量的敏感性较低。

2.5. Training 训练

 We train a series of 5 ResNets and 3 Vision Transformers. For the ResNets we train a ResNet-50, a ResNet-101, and then 3 more which follow EfficientNet-style model scaling and use approximately 4x, 16x, and 64x the compute of a ResNet-50. They are denoted as RN50x4, RN50x16, and RN50x64 respectively. For the Vision Transformers we train a ViT-B/32, a ViT-B/16, and a ViT-L/14. We train all models for 32 epochs. We use the Adam optimizer (Kingma & Ba, 2014) with decoupled weight decay regularization (Loshchilov & Hutter, 2017) applied to all weights that are not gains or biases, and decay the learning rate using a cosine schedule (Loshchilov & Hutter, 2016). Initial hyperparameters were set using a combination of grid searches, random search, and manual tuning on the baseline ResNet50 model when trained for 1 epoch. Hyper-parameters were then adapted heuristically for larger models due to computational constraints. The learnable temperature parameterτ was initialized to the equivalent of 0.07 from (Wu et al., 2018) and clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability. We use a very large minibatch size of 32,768. Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory. To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used. The calculation of embedding similarities was also sharded with individual GPUs computing only the subset of the pairwise similarities necessary for their local batch of embeddings. The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs. For the ViT-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch to boost performance similar to FixRes (Touvron et al., 2019). We denote this model as ViT-L/14@336px. Unless otherwise specified, all results reported in this paper as “CLIP” use this model which we found to perform best.
 我们训练了5个ResNet和3个Vision Transformer模型系列。对于ResNet,我们训练了ResNet-50、ResNet-101,以及另外三个采用EfficientNet式模型缩放架构的模型,其计算量分别约为ResNet-50的4倍、16倍和64倍,对应命名为RN50x4、RN50x16和RN50x64。Vision Transformer方面则训练了ViT-B/32、ViT-B/16和ViT-L/14三种架构。所有模型均训练32个周期。
我们采用Adam优化器(Kingma & Ba,2014),对所有非增益/偏置参数应用解耦权重衰减正则化(Loshchilov & Hutter,2017),并使用余弦退火调整学习率(Loshchilov & Hutter,2016)。初始超参数通过在基线ResNet50模型上进行1个周期的网格搜索、随机搜索和人工调参综合确定。受计算资源限制,更大模型的超参数采用启发式调整。可学习的温度参数τ初始化为0.07(参考Wu等人2018年研究),并限制其对数幅值缩放不超过100倍——这是维持训练稳定性必要的约束措施。
我们采用32,768的极大批次规模,并使用混合精度计算(Micikevicius等人,2017)加速训练并节省显存。为进一步节约内存,还采用了梯度检查点技术(Griewank & Walther,2000;Chen等人,2016)、半精度Adam统计量(Dhariwal等人,2020)以及经过随机舍入的半精度文本编码器权重。嵌入相似度计算也进行了分片处理,每块GPU仅计算其本地嵌入批次所需的子集相似度。
最大的ResNet模型RN50x64在592块V100 GPU上训练耗时18天,而最大的Vision Transformer模型在256块V100 GPU上训练12天。对于ViT-L/14,我们还额外增加1个周期的高分辨率(336像素)预训练来提升性能(方法类似FixRes,Touvron等人2019),该模型标记为ViT-L/14@336px。除非特别说明,本文所有标注为"CLIP"的结果均采用这个表现最佳的模型。

3. Experiments 实验

3.1. Zero-Shot Transfer 零样本迁移

3.1.1. MOTIVATION 动机
 In computer vision, zero-shot learning usually refers to the study of generalizing to unseen object categories in image classification (Lampert et al., 2009). We instead use the term in a broader sense and study generalization to unseen datasets. We motivate this as a proxy for performing unseen tasks, as aspired to in the zero-data learning paper of Larochelle et al. (2008). While much research in the field of unsupervised learning focuses on the representation learning capabilities of machine learning systems, we motivate studying zero-shot transfer as a way of measuring the tasklearning capabilities of machine learning systems. In this view, a dataset evaluates performance on a task on a specific distribution. However, many popular computer vision datasets were created by the research community primarily as benchmarks to guide the development of generic image classification methods rather than measuring performance on a specific task. While it is reasonable to say that the SVHN dataset measures the task of street number transcription on the distribution of Google Street View photos, it is unclear what “real” task the CIFAR-10 dataset measures. It is clear, however, what distribution CIFAR-10 is drawn from - TinyImages (Torralba et al., 2008). On these kinds of datasets, zero-shot transfer is more an evaluation of CLIP’s robustness to distribution shift and domain generalization rather than task generalization. Please see Section 3.3 for analysis focused on this.
 在计算机视觉领域,零样本学习通常指图像分类中对未见物体类别泛化能力的研究(Lampert等,2009)。我们在此采用更广义的概念,研究对未见数据集的泛化能力。这一设定是执行未见过任务的代理目标,正如Larochelle等人(2008)在零数据学习论文中提出的愿景。虽然无监督学习领域的许多研究聚焦于机器学习系统的表征学习能力,但我们通过研究零样本迁移来衡量系统的任务学习能力。从这个角度看,数据集评估的是特定数据分布下的任务表现。然而,许多流行的计算机视觉数据集被研究界创建时,主要是作为通用图像分类方法开发的基准,而非衡量特定任务的表现。虽然可以说SVHN数据集评估的是谷歌街景照片分布下的街道号码转录任务,但CIFAR-10数据集衡量的"真实"任务并不明确。不过可以明确的是CIFAR-10的数据来源——TinyImages(Torralba等,2008)。对于这类数据集,零样本迁移更多评估的是CLIP模型对分布偏移和领域泛化的鲁棒性,而非任务泛化能力。相关分析详见第3.3节。

 To our knowledge, Visual N-Grams (Li et al., 2017) first studied zero-shot transfer to existing image classification datasets in the manner described above. It is also the only other work we are aware of that has studied zero-shot transfer to standard image classification datasets using a generically pre-trained model and serves as the best reference point for contextualizing CLIP. Their approach learns the parameters of a dictionary of 142,806 visual n-grams (spanning 1- to 5- grams) and optimizes these n-grams using a differential version of Jelinek-Mercer smoothing to maximize the probability of all text n-grams for a given image. In order to perform zero-shot transfer, they first convert the text of each of the dataset’s class names into its n-gram representation and then compute its probability according to their model, predicting the one with the highest score.
 据我们所知,Visual N-Grams(Li等人,2017)首次采用上述方式研究了向现有图像分类数据集的零样本迁移。这也是我们已知唯一另一项研究使用通用预训练模型向标准图像分类数据集进行零样本迁移的工作,为CLIP的语境化提供了最佳参照。他们的方法通过学习包含142,806个视觉N元语法(涵盖1至5元)词典的参数,并采用Jelinek-Mercer平滑的微分版本优化这些N元语法,以最大化给定图像所有文本N元语法的概率。为实现零样本迁移,他们首先将数据集中每个类名的文本转换为N元语法表示,然后根据其模型计算概率,预测得分最高的类别。

 Our focus on studying zero-shot transfer as an evaluation of task learning is inspired by work demonstrating task learning in the field of NLP. To our knowledge Liu et al. (2018) first identified task learning as an “unexpected side-effect” when a language model trained to generate Wikipedia articles learned to reliably transliterate names between languages. While GPT-1 (Radford et al., 2018) focused on pretraining as a transfer learning method to improve supervised fine-tuning, it also included an ablation study demonstrating that the performance of four heuristic zero-shot transfer methods improved steadily over the course of pre-training, without any supervised adaption. This analysis served as the basis for GPT-2 (Radford et al., 2019) which focused exclusively on studying the task-learning capabilities of language models via zero-shot transfer.
 我们选择将零样本迁移能力作为任务学习的评估指标,这一思路受到自然语言处理领域任务学习研究的启发。据我们所知,Liu等人(2018)首次发现任务学习是"意料之外的副作用"——当语言模型被训练生成维基百科文章时,竟自发掌握了跨语言音译人名的能力。虽然GPT-1(Radford等人,2018)主要关注将预训练作为提升监督微调效果的迁移学习方法,但其消融实验同时表明:四种启发式零样本迁移方法的性能在预训练过程中持续提升,且无需任何监督适配。这项分析为GPT-2(Radford等人,2019)奠定了基础,后者完全专注于通过零样本迁移来研究语言模型的任务学习能力。

3.1.2. USING CLIP FOR ZERO-SHOT TRANSFER 使用CLIP进行零样本迁移

 CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot classification, we reuse this capability. For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text) pair according to CLIP. In a bit more detail, we first compute the feature embedding of the image and the feature embedding of the set of possible texts by their respective encoders. The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter τ , and normalized into a probability distribution via a softmax. Note that this prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling. When interpreted this way, the image encoder is the computer vision backbone which computes a feature representation for the image and the text encoder is a hypernetwork (Ha et al., 2016) which generates the weights of a linear classifier based on the text specifying the visual concepts that the classes represent. Lei Ba et al. (2015) first introduced a zero-shot image classifier of this form while the idea of generating a classifier from natural language dates back to at least Elhoseiny et al. (2013). Continuing with this interpretation, every step of CLIP pre-training can be viewed as optimizing the performance of a randomly created proxy to a computer vision dataset which contains 1 example per class and has 32,768 total classes defined via natural language descriptions. For zero-shot evaluation, we cache the zero-shot classifier once it has been computed by the text encoder and reuse it for all subsequent predictions. This allows the cost of generating it to be amortized across all the predictions in a dataset.
 CLIP经过预训练,能够预测图像与文本片段是否在其数据集中配对。为实现零样本分类,我们复用这一能力。对于每个数据集,我们使用该数据集所有类别的名称作为潜在文本配对集合,并根据CLIP预测最可能的(图像,文本)配对。具体而言,我们首先通过各自的编码器计算图像的特征嵌入和可能文本集合的特征嵌入。然后计算这些嵌入的余弦相似度,通过温度参数τ进行缩放,并通过softmax归一化为概率分布。请注意,该预测层是一个具有L2归一化输入、L2归一化权重、无偏置项和温度缩放的多项式逻辑回归分类器。从这个角度看,图像编码器是计算图像特征表示的计算机视觉骨干网络,而文本编码器则是一个基于描述类别视觉概念的自然语言生成线性分类器权重的超网络(Ha等人,2016)。Lei Ba等人(2015)首次提出这种形式的零样本图像分类器,而通过自然语言生成分类器的思想至少可追溯至Elhoseiny等人(2013)。延续这一解读,CLIP预训练的每个步骤都可视为优化一个随机创建的代理模型在计算机视觉数据集上的性能——该数据集每个类别仅含1个样本,且通过自然语言描述定义了总计32,768个类别。在零样本评估时,我们通过文本编码器计算获得零样本分类器后将其缓存,并复用于所有后续预测。这使得生成分类器的成本可分摊到数据集中所有预测上。

3.1.3. INITIAL COMPARISON TO VISUAL N-GRAMS 初始与视觉N元语法的对比

 In Table 1 we compare Visual N-Grams to CLIP. The best CLIP model improves accuracy on ImageNet from a proof of concept 11.5 % 11.5\% 11.5% to 76.2 % 76.2\% 76.2% and matches the performance of the original ResNet-50 despite using none of the 1.28 million crowd-labeled training examples available for this dataset. Additionally, the top-5 accuracy of CLIP models are noticeably higher than their top-1, and this model has a 95 % 95\% 95% top-5 accuracy, matching Inception-V4 (Szegedy et al., 2016). The ability to match the performance of a strong, fully supervised baselines in a zero-shot setting suggests CLIP is a significant step towards flexible and practical zero-shot computer vision classifiers. As mentioned above, the comparison to Visual N-Grams is meant for contextualizing the performance of CLIP and should not be interpreted as a direct methods comparison between CLIP and Visual N-Grams as many performance relevant differences between the two systems were not controlled for. For instance, we train on a dataset that is 10x larger, use a vision model that requires nearly 100x more compute per prediction, likely used over 1000x their training compute, and use a transformer-based model which did not exist when Visual N-Grams was published. As a closer comparison, we trained a CLIP ResNet-50 on the same YFCC100M dataset that Visual N-Grams was trained on and found it matched their reported ImageNet performance within a V100 GPU day. This baseline was also trained from scratch instead of being initialized from pre-trained ImageNet weights as in Visual N-Grams.
 在表1中,我们将Visual N-Grams与CLIP进行对比。最佳CLIP模型将ImageNet上的准确率从概念验证阶段的11.5%提升至76.2%,在不使用该数据集128万个人工标注训练样本的情况下,其性能与原始ResNet-50持平。此外,CLIP模型的Top-5准确率显著高于Top-1准确率,该模型的Top-5准确率达到95%,与Inception-V4(Szegedy等人,2016)相当。这种在零样本学习场景中媲美强监督基线的能力表明,CLIP向灵活实用的零样本计算机视觉分类器迈出了重要一步。需要说明的是,与Visual N-Grams的对比旨在为CLIP性能提供参照背景,由于未控制两个系统间的多项关键差异(例如我们使用的训练数据集规模扩大10倍,视觉模型单次预测计算量增加近100倍,总训练算力可能超出1000倍,且采用Visual N-Grams发布时尚未出现的Transformer架构),不应视为直接方法对比。为更公平比较,我们在Visual N-Grams使用的YFCC100M数据集上训练CLIP ResNet-50,发现仅需1个V100 GPU日即达到其报告的ImageNet性能。该基线模型也采用从头训练方式,而非如Visual N-Grams那样初始化自ImageNet预训练权重。
在表1中,我们将Visual N-Grams与CLIP进行对比。最佳CLIP模型将ImageNet上的准确率从概念验证阶段的 11.5 % 11.5\% 11.5%提升至 76.2 % 76.2\% 76.2%,在不使用该数据集128万个人工标注训练样本的情况下,其性能与原始ResNet-50持平。此外,CLIP模型的Top-5准确率显著高于Top-1准确率,该模型的Top-5准确率达到 95 % 95\% 95%,与Inception-V4(Szegedy等人,2016)相当。这种在零样本学习场景中媲美强监督基线的能力表明,CLIP向灵活实用的零样本计算机视觉分类器迈出了重要一步。需要说明的是,与Visual N-Grams的对比旨在为CLIP性能提供参照背景,由于未控制两个系统间的多项关键差异(例如我们使用的训练数据集规模扩大10倍,视觉模型单次预测计算量增加近100倍,总训练算力可能超出1000倍,且采用Visual N-Grams发布时尚未出现的Transformer架构),不应视为直接方法对比。为更公平比较,我们在Visual N-Grams使用的YFCC100M数据集上训练CLIP ResNet-50,发现仅需1个V100 GPU日即达到其报告的ImageNet性能。该基线模型也采用从头训练方式,而非如Visual N-Grams那样初始化自ImageNet预训练权重。
在这里插入图片描述

表1. CLIP与先前零样本迁移图像分类结果的对比。CLIP显著提升了三个数据集的性能表现。这一改进反映了自Visual N-Grams(Li等人,2017)开发以来四年间的诸多差异。

 CLIP also outperforms Visual N-Grams on the other 2 reported datasets. On aYahoo, CLIP achieves a 95% reduction in the number of errors, and on SUN, CLIP more than doubles the accuracy of Visual N-Grams. To conduct a more comprehensive analysis and stress test, we implement a much larger evaluation suite detailed in Appendix A. In total we expand from the 3 datasets reported in Visual NGrams to include over 30 datasets and compare to over 50 existing computer vision systems to contextualize results.
 CLIP在其他两个已报告的数据集上也优于Visual N-Grams。在aYahoo数据集上,CLIP的错误数量减少了95%;在SUN数据集上,CLIP的准确率比Visual N-Grams提高了一倍多。为了进行更全面的分析和压力测试,我们在附录A中详细实施了一个更大型的评估方案。总体而言,我们将评估范围从Visual N-Grams报告的3个数据集扩展到30多个数据集,并与50多个现有计算机视觉系统进行对比,从而对结果进行背景化分析。

3.1.4. PROMPT ENGINEERING AND ENSEMBLING 提示工程与集成

 Most standard image classification datasets treat the information naming or describing classes which enables natural language based zero-shot transfer as an afterthought. The vast majority of datasets annotate images with just a numeric id of the label and contain a file mapping these ids back to their names in English. Some datasets, such as Flowers102 and GTSRB, don’t appear to include this mapping at all in their released versions preventing zero-shot transfer entirely.2 For many datasets, we observed these labels may be chosen somewhat haphazardly and do not anticipate issues related to zero-shot transfer which relies on task description in order to transfer successfully.
 大多数标准图像分类数据集将用于命名或描述类别的信息(即支持基于自然语言的零样本迁移)视为次要考虑。绝大多数数据集仅用标签的数字ID标注图像,并附带一个将这些ID映射回英文名称的文件。部分数据集(如Flowers102和GTSRB)在其发布版本中似乎完全未包含这类映射文件,导致完全无法进行零样本迁移。 对于许多数据集,我们注意到这类标签可能被随意选定,并未预见到与零样本迁移相关的问题——该技术需依赖任务描述才能成功迁移。

 A common issue is polysemy. When the name of a class is the only information provided to CLIP’s text encoder it is unable to differentiate which word sense is meant due to the lack of context. In some cases multiple meanings of the same word might be included as different classes in the same dataset! This happens in ImageNet which contains both construction cranes and cranes that fly. Another example is found in classes of the Oxford-IIIT Pet dataset where the word boxer is, from context, clearly referring to a breed of dog, but to a text encoder lacking context could just as likely refer to a type of athlete.
 一个常见的问题是词义多义性。当类名是提供给CLIP文本编码器的唯一信息时,由于缺乏上下文,它无法区分所指的词义。在某些情况下,同一个词的不同含义可能会作为不同类别包含在同一数据集中!ImageNet中就存在这种情况,它同时包含建筑起重机(construction cranes)和飞行的鹤(cranes that fly)。另一个例子出现在Oxford-IIIT宠物数据集的类别中:根据上下文,"boxer"一词显然指的是犬种,但对于缺乏上下文的文本编码器来说,它同样可能指代一种运动员类型。

 Another issue we encountered is that it’s relatively rare in our pre-training dataset for the text paired with the image to be just a single word. Usually the text is a full sentence describing the image in some way. To help bridge this distribution gap, we found that using the prompt template " A    p h o t o    o f    a    { l a b e l } . " "A{\,}{\,} photo{\,}{\,} of {\,}{\,} a {\,}{\,} \{label\}." "Aphotoofa{label}." to be a good default that helps specify the text is about the content of the image. This often improves performance over the baseline of using only the label text. For instance, just using this prompt improves accuracy on ImageNet by 1.3%.
 我们遇到的另一个问题是,在预训练数据集中,与图像配对的文本仅包含单个单词的情况相对罕见。通常这些文本会是描述图像的完整句子。为了弥合这种分布差异,我们发现采用提示模板 " A    p h o t o    o f    a    { l a b e l } . " "A{\,}{\,} photo{\,}{\,} of {\,}{\,} a {\,}{\,} \{label\}." "Aphotoofa{label}."作为默认格式效果良好,它能明确指示文本应与图像内容相关。这种处理方式通常比仅使用标签文本的基线方法表现更优。例如,仅应用这一提示模板就使ImageNet的准确率提升了1.3%。
 Similar to the “prompt engineering” discussion around GPT3 (Brown et al., 2020; Gao et al., 2020), we have also observed that zero-shot performance can be significantly improved by customizing the prompt text to each task. A few, non exhaustive, examples follow. We found on several fine-grained image classification datasets that it helped to specify the category. For example on Oxford-IIIT Pets, using " A    p h o t o    o f    a    { l a b e l } , a    t y p e    o f    p e t . " "A {\,}{\,} photo{\,}{\,} of {\,}{\,} a {\,}{\,} \{label\}, a {\,}{\,} type {\,}{\,} of {\,}{\,} pet." "Aphotoofa{label},atypeofpet." to help provide context worked well. Likewise, on Food101 specifying a type of food and on FGVC Aircraft a type of aircraft helped too. For OCR datasets, we found that putting quotes around the text or number to be recognized improved performance. Finally, we found that on satellite image classification datasets it helped to specify that the images were of this form and we use variants of " a    s a t e l l i t e    p h o t o    o f    a    { l a b e l } . " "a {\,}{\,} satellite {\,}{\,} photo {\,}{\,} of {\,}{\,} a {\,}{\,} \{label\}." "asatellitephotoofa{label}.".
 类似于围绕GPT3(Brown等人,2020;Gao等人,2020)的"提示工程"讨论,我们也观察到,通过为每个任务定制提示文本可以显著提升零样本性能。以下是几个非穷尽的示例:在多个细粒度图像分类数据集上,我们发现明确指定类别很有帮助。例如在Oxford-IIIT Pets数据集上,使用 " A    p h o t o    o f    a    { l a b e l } , a    t y p e    o f    p e t . " "A {\,}{\,} photo{\,}{\,} of {\,}{\,} a {\,}{\,} \{label\}, a {\,}{\,} type {\,}{\,} of {\,}{\,} pet." "Aphotoofa{label},atypeofpet."的提示模板能有效提供上下文。同样地,在Food101数据集上指定食物类型,在FGVC Aircraft数据集上指定飞机类型也都有助益。对于OCR数据集,我们发现将待识别的文字或数字加上引号能提升性能。最后,我们发现卫星图像分类数据集采用 " a    s a t e l l i t e    p h o t o    o f    a    { l a b e l } . " "a {\,}{\,} satellite {\,}{\,} photo {\,}{\,} of {\,}{\,} a {\,}{\,} \{label\}." "asatellitephotoofa{label}."这类提示变体会更有效。

 We also experimented with ensembling over multiple zeroshot classifiers as another way of improving performance. These classifiers are computed by using different context prompts such as ′ A    p h o t o    o f    a    b i g    { l a b e l } . " 'A {\,}{\,} photo {\,}{\,} of {\,}{\,} a {\,}{\,} big {\,}{\,} \{label\}." Aphotoofabig{label}." and " A    p h o t o    o f    a    s m a l l    { l a b e l } . " "A {\,}{\,} photo {\,}{\,} of {\,}{\,} a {\,}{\,} small {\,}{\,} \{label\}." "Aphotoofasmall{label}.". We construct the ensemble over the embedding space instead of probability space. This allows us to cache a single set of averaged text embeddings so that the compute cost of the ensemble is the same as using a single classifier when amortized over many predictions. We’ve observed ensembling across many generated zero-shot classifiers to reliably improve performance and use it for the majority of datasets. On ImageNet, we ensemble 80 different context prompts and this improves performance by an additional 3.5% over the single default prompt discussed above. When considered together, prompt engineering and ensembling improve ImageNet accuracy by almost 5%. In Figure 4 we visualize how prompt engineering and ensembling change the performance of a set of CLIP models compared to the contextless baseline approach of directly embedding the class name as done in Li et al. (2017).
 我们还尝试通过集成多个零样本分类器作为另一种提升性能的方式。这些分类器是通过使用不同的上下文提示模板计算得出的,例如 ′ A    p h o t o    o f    a    b i g    { l a b e l } . " 'A {\,}{\,} photo {\,}{\,} of {\,}{\,} a {\,}{\,} big {\,}{\,} \{label\}." Aphotoofabig{label}." " A    p h o t o    o f    a    s m a l l    { l a b e l } . " "A {\,}{\,} photo {\,}{\,} of {\,}{\,} a {\,}{\,} small {\,}{\,} \{label\}." "Aphotoofasmall{label}."。我们在嵌入空间而非概率空间构建集成模型,这使得只需缓存一组平均文本嵌入向量即可,因此在多次预测摊销后,集成模型的计算成本与使用单个分类器相同。我们观察到,通过集成多个生成的零样本分类器能稳定提升性能,并将其应用于大多数数据集。在ImageNet上,我们集成了80种不同的上下文提示模板,相比前文讨论的单一默认提示模板,性能又提升了3.5%。综合来看,提示工程和集成策略共同将ImageNet的准确率提升了近5%。图4展示了相较于Li等人(2017)采用的直接嵌入类名的无上下文基线方法,提示工程与集成策略如何改变一系列CLIP模型的性能表现。

3.1.5. ANALYSIS OF ZERO-SHOT CLIP PERFORMANCE 零样本CLIP性能分析

 Since task-agnostic zero-shot classifiers for computer vision have been understudied, CLIP provides a promising opportunity to gain a better understanding of this type of model. In this section, we conduct a study of various properties of CLIP’s zero-shot classifiers. As a first question, we look simply at how well zero-shot classifiers perform. To contextualize this, we compare to the performance of a simple off-the-shelf baseline: fitting a fully supervised, regularized, logistic regression classifier on the features of the canonical ResNet-50. In Figure 5 we show this comparison across 27 datasets. Please see Appendix A for details of datasets and setup.
 由于计算机视觉中任务无关的零样本分类器尚未得到充分研究,CLIP为深入理解此类模型提供了重要契机。本节我们将系统研究CLIP零样本分类器的各项特性。首要研究问题是评估零样本分类器的基准性能。为提供参照基准,我们将其与现成模型的性能进行对比:即在标准ResNet-50特征上训练全监督、正则化的逻辑回归分类器。图5展示了27个数据集上的对比结果,具体数据集设置详见附录A。
在这里插入图片描述

图5. 零样本CLIP与全监督基线模型表现相当。在涵盖27个数据集的评估体系中,零样本CLIP分类器在包括ImageNet在内的16个数据集上表现优于基于ResNet-50特征训练的全监督线性分类器。

 Zero-shot CLIP outperforms this baseline slightly more often than not and wins on 16 of the 27 datasets. Looking at individual datasets reveals some interesting behavior. On fine-grained classification tasks, we observe a wide spread in performance. On two of these datasets, Stanford Cars and Food101, zero-shot CLIP outperforms logistic regression on ResNet-50 features by over 20% while on two others, Flowers102 and FGVCAircraft, zero-shot CLIP underperforms by over 10%. On OxfordPets and Birdsnap, performance is much closer. We suspect these difference are primarily due to varying amounts of per-task supervision between WIT and ImageNet. On “general” object classification datasets such as ImageNet, CIFAR10/100, STL10, and PascalVOC2007 performance is relatively similar with a slight advantage for zero-shot CLIP in all cases. On STL10, CLIP achieves 99.3% overall which appears to be a new state of the art despite not using any training examples. Zeroshot CLIP significantly outperforms a ResNet-50 on two datasets measuring action recognition in videos. On Kinetics700, CLIP outperforms a ResNet-50 by 14.5%. Zeroshot CLIP also outperforms a ResNet-50’s features by 7.7% on UCF101. We speculate this is due to natural language providing wider supervision for visual concepts involving verbs, compared to the noun-centric object supervision in ImageNet.
 零样本CLIP在多数情况下略微优于这一基线模型,在27个数据集中有16个表现更佳。深入分析单个数据集时,我们发现了些有趣现象。在细粒度分类任务中,模型性能差异显著:针对Stanford Cars和Food101两个数据集,零样本CLIP比基于ResNet-50特征训练的逻辑回归模型准确率高出20%以上;而在Flowers102和FGVCAircraft数据集上却落后10%以上。OxfordPets和Birdsnap数据集上的表现则较为接近。我们推测这些差异主要源于WIT与ImageNet在不同任务上提供的监督信号量不同。
在ImageNet、CIFAR10/100、STL10和PascalVOC2007等"通用"物体分类数据集上,两者性能相对接近,但零样本CLIP均保持微弱优势。其中STL10数据集上,CLIP达到了99.3%的整体准确率——这似乎创造了未使用任何训练样本的新技术标杆。在测量视频动作识别的两个数据集中,零样本CLIP显著超越ResNet-50:Kinetics700数据集上领先14.5%,UCF101数据集上特征表现优于7.7%。我们推测这是由于自然语言对涉及动词的视觉概念提供了更广泛的监督,相较ImageNet以名词为中心的物体监督更具优势。

 Looking at where zero-shot CLIP notably underperforms, we see that zero-shot CLIP is quite weak on several specialized, complex, or abstract tasks such as satellite image classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), counting objects in synthetic scenes (CLEVRCounts), self-driving related tasks such as German traffic sign recognition (GTSRB), recognizing distance to the nearest car (KITTI Distance). These results highlight the poor capability of zero-shot CLIP on more complex tasks. By contrast, non-expert humans can robustly perform several of these tasks, such as counting, satellite image classification, and traffic sign recognition, suggesting significant room for improvement. However, we caution that it is unclear whether measuring zero-shot transfer, as opposed to few-shot transfer, is a meaningful evaluation for difficult tasks that a learner has no prior experience with, such as lymph node tumor classification for almost all humans (and possibly CLIP).
 观察零样本CLIP显著表现不佳的领域,我们发现其在多项专业化、复杂或抽象任务上能力较弱,例如卫星图像分类(EuroSAT和RESISC45数据集)、淋巴结肿瘤检测(PatchCamelyon数据集)、合成场景中的物体计数(CLEVRCounts数据集),以及自动驾驶相关任务如德国交通标志识别(GTSRB数据集)、判断最近车辆距离(KITTI Distance数据集)。这些结果凸显了零样本CLIP在复杂任务上的局限性。相比之下,非专业人士却能稳健完成其中部分任务(如计数、卫星图像分类和交通标志识别),这表明模型存在显著改进空间。不过需要指出:对于学习者完全缺乏先验经验的高难度任务(如几乎所有人——可能也包括CLIP——都未接触过的淋巴结肿瘤分类),相较于小样本学习,衡量零样本迁移能力是否构成有意义的评估标准尚存疑问。

 While comparing zero-shot performance to fully supervised models contextualizes the task-learning capabilities of CLIP, comparing to few-shot methods is a more direct comparison, since zero-shot is its limit. In Figure 6, we visualize how zero-shot CLIP compares to few-shot logistic regression on the features of many image models including the best publicly available ImageNet models, self-supervised learning methods, and CLIP itself. While it is intuitive to expect zero-shot to underperform one-shot, we instead find that zero-shot CLIP matches the performance of 4-shot logistic regression on the same feature space. This is likely due to an important difference between the zero-shot and few-shot approach. First, CLIP’s zero-shot classifier is generated via natural language which allows for visual concepts to be directly specified (“communicated”). By contrast, “normal” supervised learning must infer concepts indirectly from training examples. Context-less example-based learning has the drawback that many different hypotheses can be consistent with the data, especially in the one-shot case. A single image often contains many different visual concepts. Although a capable learner is able to exploit visual cues and heuristics, such as assuming that the concept being demonstrated is the primary object in an image, there is no guarantee.
 将零样本性能与全监督模型进行比较可以说明CLIP的任务学习能力,而与少样本方法对比则更为直接,因为零样本是它的极限情况。图6中,我们可视化对比了零样本CLIP与基于多种图像模型特征的少样本逻辑回归表现,这些模型包括当前公开的最佳ImageNet模型、自监督学习方法以及CLIP本身。虽然直觉上认为零样本会逊色于单样本,但我们发现零样本CLIP在相同特征空间上的表现相当于4样本逻辑回归。这可能源于零样本与少样本方法的关键差异:首先,CLIP的零样本分类器通过自然语言生成,这使得视觉概念可以被直接指定(“传达”);相比之下,"常规"监督学习必须从训练样本中间接推断概念。这种无上下文的基于示例的学习存在缺陷——尤其在单样本情况下,许多不同假设都可能与数据相符。单张图像往往包含多个视觉概念,尽管优秀的学习者能够利用视觉线索和启发式方法(例如假设演示的概念是图像中的主要对象),但这并非绝对。
在这里插入图片描述

图6. 零样本CLIP模型优于少量样本线性探针。在相同特征空间上,零样本CLIP模型的性能与4样本线性分类器的平均表现相当,并几乎达到公开可用模型中16样本线性分类器的最佳效果。图中高亮显示了BiT-M和SimCLRv2中表现最佳的模型。浅灰色线条代表评估套件中的其他模型。本分析使用了每类至少包含16个样本的20个数据集。

 A potential resolution of this discrepancy between zeroshot and few-shot performance is to use CLIP’s zero-shot classifier as a prior for the weights of the few-shot classifier. While adding an L2 penalty towards the generated weights is a straightforward implementation of this idea, we found that hyperparameter optimization would often select for such a large value of this regularizer that the resulting fewshot classifier was “just” the zero-shot classifier. Research into better methods of combining the strength of zero-shot transfer with flexibility of few-shot learning is a promising direction for future work.
 解决零样本与小样本性能差异的一个潜在方案是:将CLIP的零样本分类器作为小样本分类器权重训练的参考基准。虽然直接添加L2正则项约束权重生成是最直观的实现方式,但我们发现超参数优化常会选取过大的正则系数,导致最终的小样本分类器"退化"为零样本分类器。如何更有效地结合零样本迁移的强泛化能力与小样本学习的灵活性,将是未来研究中极具前景的方向。

 When comparing zero-shot CLIP to few-shot logistic regression on the features of other models, zero-shot CLIP roughly matches the performance of the best performing 16-shot classifier in our evaluation suite, which uses the features of a BiT-M ResNet-152x2 trained on ImageNet-21K. We are certain that a BiT-L model trained on JFT-300M would perform even better but these models have not been publicly released. That a BiT-M ResNet-152x2 performs best in a 16-shot setting is somewhat surprising since, as analyzed in Section 3.2, the Noisy Student EfficientNet-L2 outperforms it in a fully supervised setting by almost 5% on average across 27 datasets.
 在将零样本CLIP与其他模型特征的少样本逻辑回归进行比较时,零样本CLIP的性能大致相当于我们评估套件中表现最佳的16样本分类器(该分类器使用经过ImageNet-21K训练的BiT-M ResNet-152x2特征)。我们确信在JFT-300M上训练的BiT-L模型表现会更优,但这些模型尚未公开。值得注意的是,BiT-M ResNet-152x2在16样本设置中表现最佳有些出人意料——如第3.2节所述,在完全监督设置下,Noisy Student EfficientNet-L2在27个数据集上的平均性能比它高出近5%。

 In addition to studying the average performance of zero-shot CLIP and few-shot logistic regression, we also examine performance on individual datasets. In Figure 7, we show estimates for the number of labeled examples per class that a logistic regression classifier on the same feature space requires to match the performance of zero-shot CLIP. Since zero-shot CLIP is also a linear classifier, this estimates the effective data efficiency of zero-shot transfer in this setting. In order to avoid training thousands of linear classifiers, we estimate the effective data efficiency based on a loglinear interpolation of the performance of a 1, 2, 4, 8, 16-shot (when possible), and a fully supervised linear classifier trained on each dataset. We find that zero-shot transfer can have widely varying efficiency per dataset from less than 1 labeled example per class to 184. Two datasets, Flowers102 and EuroSAT underperform one-shot models. Half of the datasets require less than 5 examples per class with a median of 5.4. However, the mean estimated data efficiency is 20.8 examples per class. This is due to the 20% of datasets where supervised classifiers require many labeled examples per class in order to match performance. On ImageNet, zero-shot CLIP matches the performance of a 16-shot linear classifier trained on the same feature space.
 除了研究零样本CLIP和小样本逻辑回归的平均性能外,我们还考察了各个数据集上的表现。在图7中,我们展示了同一特征空间上的逻辑回归分类器为匹配零样本CLIP性能所需每类标注样本数量的估算值。由于零样本CLIP本身也是线性分类器,该结果估算出此场景下零样本迁移的有效数据效率。为避免训练数千个线性分类器,我们基于1、2、4、8、16样本(可能情况下)及全监督线性分类器在每个数据集上的性能表现,采用对数线性插值法估算有效数据效率。研究发现零样本迁移的效率因数据集差异巨大:每类所需标注样本从不足1个到184个不等。Flowers102和EuroSAT两个数据集表现逊于单样本模型。半数数据集每类所需样本少于5个,中位数为5.4。但平均估算数据效率达每类20.8个样本,这是由于20%的数据集中监督分类器需要大量标注样本才能匹配性能。在ImageNet上,零样本CLIP的性能等同于同一特征空间训练的16样本线性分类器。
在这里插入图片描述

图7. 零样本迁移的数据效率差异显著。通过计算在相同CLIP特征空间上,线性分类器需要每个类别多少标注样本才能达到零样本分类器的性能,可以量化零样本迁移的有效性。数值基于1、2、4、8、16样本及全监督结果的半对数线性插值估算。性能差异范围极大——在两个数据集上仍低于单样本分类器,而在另一个案例中则相当于每个类别需要184个标注样本才能达到同等效果。

 If we assume that evaluation datasets are large enough that the parameters of linear classifiers trained on them are well estimated, then, because CLIP’s zero-shot classifier is also a linear classifier, the performance of the fully supervised classifiers roughly sets an upper bound for what zero-shot transfer can achieve. In Figure 8 we compare CLIP’s zeroshot performance with fully supervised linear classifiers across datasets. The dashed, y = x y = x y=x line represents an “optimal” zero-shot classifier that matches the performance of its fully supervised equivalent. For most datasets, the performance of zero-shot classifiers still underperform fully supervised classifiers by 10 % 10\% 10% to 25 % 25\% 25%, suggesting that there is still plenty of headroom for improving CLIP’s task-learning and zero-shot transfer capabilities.
 如果我们假设评估数据集足够大,使得在其上训练的线性分类器参数能得到良好估计,那么由于CLIP的零样本分类器也是线性分类器,全监督分类器的性能大致为零样本迁移能实现的上限设定了边界。在图8中,我们将CLIP的零样本性能与各数据集上的全监督线性分类器进行对比。虚线 y = x y=x y=x代表"理想"零样本分类器,其性能与对应的全监督分类器相当。对于大多数数据集,零样本分类器的性能仍比全监督分类器低 10 % 10\% 10% 25 % 25\% 25%,这表明CLIP的任务学习和零样本迁移能力仍有很大提升空间。
在这里插入图片描述

图8. 零样本性能与线性探针性能相关但多数情况下仍次优。跨数据集对比零样本与线性探针性能显示强相关性,零样本性能普遍低10至25个百分点。仅在5个数据集中零样本性能接近线性探针性能(差异≤3个百分点)。

 There is a positive correlation of 0.82 0.82 0.82 (p-value < 1 0 − 6 10^{−6} 106) between zero-shot performance and fully supervised performance, suggesting that CLIP is relatively consistent at connecting underlying representation and task learning to zeroshot transfer. However, zero-shot CLIP only approaches fully supervised performance on 5 5 5 datasets: STL10, CIFAR10, Food101, OxfordPets, and Caltech101. On all 5 5 5 datasets, both zero-shot accuracy and fully supervised accuracy are over 90 % 90\% 90%. This suggests that CLIP may be more effective at zero-shot transfer for tasks where its underlying representations are also high quality. The slope of a linear regression model predicting zero-shot performance as a function of fully supervised performance estimates that for every 1 % 1\% 1% improvement in fully supervised performance, zero-shot performance improves by 1.28 % 1.28\% 1.28%. However, the 95th-percentile confidence intervals still include values of less than 1 ( 0.93 − 1.79 ) 1 (0.93-1.79) 1(0.931.79).
 零样本性能与全监督性能之间存在 0.82 0.82 0.82的正相关性(p值< 1 0 − 6 10^{−6} 106),这表明CLIP在连接底层表征与任务学习至零样本迁移方面具有相对一致性。然而,零样本CLIP仅在 5 5 5个数据集(STL10、CIFAR10、Food101、OxfordPets和Caltech101)上接近全监督性能。在这 5 5 5个数据集中,零样本准确率和全监督准确率均超过 90 % 90\% 90%。这表明对于底层表征质量较高的任务,CLIP的零样本迁移可能更有效。通过线性回归模型预测零样本性能与全监督性能的函数关系,其斜率表明:全监督性能每提升 1 % 1\% 1%,零样本性能将提升 1.28 % 1.28\% 1.28%。但 95 % 95\% 95%置信区间仍包含小于 1 1 1的值 ( 0.93 − 1.79 ) (0.93-1.79) 0.931.79

 Over the past few years, empirical studies of deep learning systems have documented that performance is predictable as a function of important quantities such as training compute and dataset size (Hestness et al., 2017; Kaplan et al., 2020). The GPT family of models has so far demonstrated consistent improvements in zero-shot performance across a 1000x increase in training compute. In Figure 9, we check whether the zero-shot performance of CLIP follows a similar scaling pattern. We plot the average error rate of the 5 ResNet CLIP models across 39 evaluations on 36 different datasets and find that a similar log-log linear scaling trend holds for CLIP across a 44x increase in model compute. While the overall trend is smooth, we found that performance on individual evaluations can be much noisier. We are unsure whether this is caused by high variance between individual training runs on sub-tasks (as documented in D’Amour et al. (2020)) masking a steadily improving trend or whether performance is actually non-monotonic as a function of compute on some tasks.
 过去几年,深度学习系统的实证研究表明,性能可预测为训练计算量和数据集规模等重要变量的函数(Hestness等人,2017;Kaplan等人,2020)。GPT系列模型迄今已证明,在训练计算量提升1000倍的情况下,其零样本性能始终呈现一致性提升。图9中,我们验证了CLIP的零样本性能是否遵循类似扩展规律。通过绘制5个ResNet CLIP模型在36个不同数据集上的39次评估平均错误率,发现CLIP在模型计算量增长44倍时仍保持对数-对数线性扩展趋势。虽然整体趋势平稳,但个别评估任务的性能波动显著。我们不确定这是由于子任务单独训练运行间的高方差(如D’Amour等人2020年所述)掩盖了稳定改进趋势,还是某些任务性能确实随计算量呈现非单调变化。
在这里插入图片描述

图9. 零样本CLIP性能随模型计算量平滑扩展。通过对36个不同数据集的39次评估,平均零样本错误率在涵盖5种不同CLIP模型的44倍计算量范围内,呈现出对数-对数线性趋势。浅色线条显示单个评估结果,表明尽管整体趋势平缓,但具体性能存在显著差异。

3.2. Representation Learning 表征学习

 While we have extensively analyzed the task-learning capabilities of CLIP through zero-shot transfer in the previous section, it is more common to study the representation learning capabilities of a model. There exist many ways to evaluate the quality of representations as well as disagreements over what properties an “ideal” representation should have (Locatello et al., 2020). Fitting a linear classifier on a representation extracted from the model and measuring its performance on various datasets is a common approach. An alternative is measuring the performance of end-to-end fine-tuning of the model. This increases flexibility, and prior work has convincingly demonstrated that fine-tuning outperforms linear classification on most image classification datasets (Kornblith et al., 2019; Zhai et al., 2019). While the high performance of fine-tuning motivates its study for practical reasons, we still opt for linear classifier based evaluation for several reasons. Our work is focused on developing a high-performing task and dataset-agnostic pre-training approach. Fine-tuning, because it adapts representations to each dataset during the fine-tuning phase, can compensate for and potentially mask failures to learn general and robust representations during the pre-training phase. Linear classifiers, because of their limited flexibility, instead highlight these failures and provide clear feedback during development. For CLIP, training supervised linear classifiers has the added benefit of being very similar to the approach used for its zero-shot classifiers which enables extensive comparisons and analysis in Section 3.1. Finally, we aim to compare CLIP to a comprehensive set of existing models across many tasks. Studying 66 different models on 27 different datasets requires tuning 1782 different evaluations. Fine-tuning opens up a much larger design and hyperparameter space, which makes it difficult to fairly evaluate and computationally expensive to compare a diverse set of techniques as discussed in other large scale empirical studies (Lucic et al., 2018; Choi et al., 2019). By comparison, linear classifiers require minimal hyper-parameter tuning and have standardized implementations and evaluation procedures. Please see Appendix A for further details on evaluation.
 虽然我们在上一节通过零样本迁移深入分析了CLIP的任务学习能力,但研究模型的表征学习能力更为常见。评估表征质量存在多种方法,且对于"理想"表征应具备哪些特性也存在争议(Locatello等,2020)。常见做法是在模型提取的表征上拟合线性分类器,并测量其在各数据集上的性能。另一种方法是测量模型端到端微调的性能。这种方式灵活性更高,先前研究已令人信服地证明:在大多数图像分类数据集上,微调性能优于线性分类(Kornblith等,2019;翟等,2019)。尽管微调的高性能出于实用考虑值得研究,我们仍选择基于线性分类器的评估,原因如下:我们的工作重点是开发高性能、与任务及数据集无关的预训练方法。由于微调阶段会针对每个数据集调整表征,可能补偿并掩盖预训练阶段学习通用鲁棒表征的失败案例。线性分类器因其有限灵活性,反而能凸显这些失败案例并在开发阶段提供清晰反馈。对CLIP而言,训练有监督线性分类器还有个额外优势——其方法与零样本分类器高度相似,便于在第3.1节开展广泛比较分析。最后,我们旨在将CLIP与大量现有模型进行多任务综合对比。在27个不同数据集上研究66个模型需要调参1782次评估。微调会开启更大的设计和超参空间,如其他大规模实证研究所述(Lucic等,2018;Choi等,2019),这将导致公平评估困难且计算成本激增。相较之下,线性分类器只需最小化超参调整,并具备标准化实现和评估流程。评估细节详见附录A。

 Figure 10 summarizes our findings. To minimize selection effects that could raise concerns of confirmation or reporting bias, we first study performance on the 12 dataset evaluation suite from Kornblith et al. (2019). While small CLIP models such as a ResNet-50 and ResNet-101 outperform other ResNets trained on ImageNet-1K (BiT-S and the originals), they underperform ResNets trained on ImageNet-21K (BiTM). These small CLIP models also underperform models in the EfficientNet family with similar compute requirements. However, models trained with CLIP scale very well and the largest model we trained (ResNet-50x64) slightly outperforms the best performing existing model (a Noisy Student EfficientNet-L2) on both overall score and compute efficiency. We also find that CLIP vision transformers are about 3x more compute efficient than CLIP ResNets, which allows us to reach higher overall performance within our compute budget. These results qualitatively replicate the findings of Dosovitskiy et al. (2020) which reported that vision transformers are more compute efficient than convnets when trained on sufficiently large datasets. Our best overall model is a ViT-L/14 that is fine-tuned at a higher resolution of 336 pixels on our dataset for 1 additional epoch. This model outperforms the best existing model across this evaluation suite by an average of 2.6 % 2.6\% 2.6%.
 图10总结了我们的研究结果。为减少选择性偏差可能引发的验证性偏见或报告偏差问题,我们首先分析了Kornblith等人(2019)提出的12个数据集评估套件上的表现。虽然ResNet-50和ResNet-101等小型CLIP模型优于在ImageNet-1K上训练的其他ResNet(BiT-S及原版模型),但不及在ImageNet-21K上训练的ResNet(BiTM)。这些小型CLIP模型的表现也逊色于计算需求相似的EfficientNet系列模型。然而,采用CLIP训练的模型展现出优异的扩展性——我们训练的最大模型(ResNet-50x64)在综合评分和计算效率上均略微优于现有最佳模型(Noisy Student EfficientNet-L2)。我们还发现CLIP视觉 transformers的计算效率约为CLIP ResNets的3倍,这使得我们能在既定计算预算内实现更高的综合性能。这些结果定性复现了Dosovitskiy等人(2020)的发现:当在足够大规模数据集上训练时,视觉 transformers比卷积网络具有更高计算效率。我们最终的最佳模型是ViT-L/14,该模型在我们的数据集上以336像素更高分辨率微调了1个额外训练周期,在此评估套件中平均超越现有最佳模型 2.6 % 2.6\% 2.6%
在这里插入图片描述

图10:CLIP模型与先进计算机视觉模型的线性探测性能对比,包括EfficientNet(Tan & Le, 2019; Xie et al., 2020)、MoCo(Chen et al., 2020d)、Instagram预训练的ResNeXt模型(Mahajan et al., 2018; Touvron et al., 2019)、BiT(Kolesnikov et al., 2019)、ViT(Dosovitskiy et al., 2020)、SimCLRv2(Chen et al., 2020c)、BYOL(Grill et al., 2020)以及原始ResNet模型(He et al., 2016b)。(左图)分数为Kornblith等人(2019)研究的12个数据集的平均得分。(右图)分数为涵盖更广泛分布特征的27个数据集的平均得分。虚线表示采用高于预训练分辨率进行微调或评估的模型。具体得分请参见表10,各数据集性能曲线见图20。

 As Figure 21 qualitatively shows, CLIP models learn a wider set of tasks than has previously been demonstrated in a single computer vision model trained end-to-end from random initialization. These tasks include geo-localization, optical character recognition, facial emotion recognition, and action recognition. None of these tasks are measured in the evaluation suite of Kornblith et al. (2019). This could be argued to be a form of selection bias in Kornblith et al. (2019)'s study towards tasks that overlap with ImageNet. To address this, we also measure performance on a broader 27 dataset evaluation suite. This evaluation suite, detailed in Appendix A includes datasets representing the aforementioned tasks, German Traffic Signs Recognition Benchmark (Stallkamp et al., 2011), as well as several other datasets adapted from VTAB (Zhai et al., 2019).
 如图21定性所示,CLIP模型学习到的任务范围比以往从随机初始化端到端训练的单一计算机视觉模型所展示的更广泛。这些任务包括地理定位、光学字符识别、面部情绪识别和动作识别。Kornblith等人(2019)的评估套件中均未测量这些任务。可以说这是Kornblith等人(2019)研究中对与ImageNet重叠任务的一种选择偏差。为解决这个问题,我们还测量了更广泛的27个数据集评估套件中的性能。该评估套件(详见附录A)包含代表上述任务的数据集、德国交通标志识别基准(Stallkamp等,2011)以及从VTAB(Zhai等,2019)改编的若干其他数据集。

 On this broader evaluation suite, the benefits of CLIP are more clear. All CLIP models, regardless of scale, outperform all evaluated systems in terms of compute efficiency. The improvement in average score of the best model over previous systems increases from 2.6% to 5%. We also find that self-supervised systems do noticeably better on our broader evaluation suite. For instance, while SimCLRv2 still underperforms BiT-M on average on the 12 datasets of Kornblith et al. (2019), SimCLRv2 outperforms BiT-M on our 27 dataset evaluation suite. These findings suggest continuing to expand task diversity and coverage in order to better understand the “general” performance of systems. We suspect additional evaluation efforts along the lines of VTAB to be valuable.
 在这个更广泛的评估体系中,CLIP的优势更为明显。所有CLIP模型——无论规模大小——在计算效率方面都优于所有被评估系统。最佳模型的平均得分相较于先前系统的提升幅度从2.6%扩大到5%。我们还发现自监督系统在我们扩大的评估体系中表现显著更优。例如,虽然在Kornblith等人(2019)的12个数据集上SimCLRv2平均表现仍逊于BiT-M,但在我们27个数据集的评估体系中SimCLRv2超越了BiT-M。这些发现表明,持续扩展任务多样性和覆盖范围有助于更好地理解系统的"通用"性能。我们推测沿着VTAB思路开展更多评估工作将具有重要价值。

 In addition to the aggregate analysis above, we visualize per-dataset differences in the performance of the best CLIP model and the best model in our evaluation suite across all 27 datasets in Figure 11. CLIP outperforms the Noisy Student EfficientNet-L2 on 21 of the 27 datasets. CLIP improves the most on tasks which require OCR (SST2 and HatefulMemes), geo-localization and scene recognition (Country211, SUN397), and activity recognition in videos (Kinetics700 and UCF101). In addition CLIP also does much better on fine-grained car and traffic sign recognition (Stanford Cars and GTSRB). This may reflect a problem with overly narrow supervision in ImageNet. A result such as the 14.7 % 14.7\% 14.7% improvement on GTSRB could be indicative of an issue with ImageNet-1K, which has only a single label for all traffic and street signs. This could encourage a supervised representation to collapse intra-class details and hurt accuracy on a fine-grained downstream task. As mentioned, CLIP still underperforms the EfficientNet on several datasets. Unsurprisingly, the dataset that the EfficientNet does best relative to CLIP on is the one it was trained on: ImageNet. The EffcientNet also slightly outperforms CLIP on low-resolution datasets such as CIFAR10 and CIFAR100. We suspect this is at least partly due to the lack of scale-based data augmentation in CLIP. The EfficientNet also does slightly better on PatchCamelyon and CLEVRCounts, datasets where overall performance is still low for both approaches.
 除上述总体分析外,我们还在图11中将最佳CLIP模型与评估套件中最佳模型在全部27个数据集上的性能差异进行了可视化呈现。CLIP在27个数据集的21个上超越了Noisy Student EfficientNet-L2模型。其在需要OCR(SST2和HatefulMemes)、地理定位与场景识别(Country211、SUN397)以及视频行为识别(Kinetics700和UCF101)的任务上表现提升最为显著。此外,CLIP在细粒度车辆与交通标志识别(Stanford Cars和GTSRB)任务中也展现出更大优势。这可能反映出ImageNet监督信号过于狭窄的问题——例如GTSRB数据集中 14.7 % 14.7\% 14.7%的性能提升,或许揭示了ImageNet-1K将所有交通及道路标志归为单一标签的缺陷,这种设置易使监督式表征压缩类内细节,损害下游细粒度任务的精度。如前所述,CLIP在部分数据集上仍逊色于EfficientNet。不出所料,EfficientNet相对CLIP优势最大的数据集正是其训练基准ImageNet。在CIFAR10和CIFAR100等低分辨率数据集上,EfficientNet也略胜一筹,我们推测这至少部分源于CLIP缺乏基于尺度的数据增强策略。此外,在PatchCamelyon和CLEVRCounts这两个两类模型整体表现仍较低的数据集上,EfficientNet也显现出微弱优势。
在这里插入图片描述

图11. CLIP的特征在多种数据集上的表现优于最佳ImageNet模型的特征。在27个数据集中,使用CLIP特征训练线性分类器的效果有21个优于使用Noisy Student EfficientNet-L2模型。

3.3. Robustness to Natural Distribution Shift 对自然分布偏移的鲁棒性

 In 2015, it was announced that a deep learning model exceeded human performance on the ImageNet test set (He et al., 2015). However, research in the subsequent years has repeatedly found that these models still make many simple mistakes (Dodge & Karam, 2017; Geirhos et al., 2018; Alcorn et al., 2019), and new benchmarks testing these systems has often found their performance to be much lower than both their ImageNet accuracy and human accuracy (Recht et al., 2019; Barbu et al., 2019). What explains this discrepancy? Various ideas have been suggested and studied (Ilyas et al., 2019; Geirhos et al., 2020). A common theme of proposed explanations is that deep learning models are exceedingly adept at finding correlations and patterns which hold across their training dataset and thus improve in-distribution performance. However many of these correlations and patterns are actually spurious and do not hold for other distributions and result in large drops in performance on other datasets.
 2015年有研究宣布,深度学习模型在ImageNet测试集上的表现超越了人类(何凯明等,2015)。但后续多年研究发现,这些模型仍会犯许多简单错误(Dodge & Karam,2017;Geirhos等,2018;Alcorn等,2019),新基准测试往往显示其表现远低于ImageNet准确率和人类准确率(Recht等,2019;Barbu等,2019)。如何解释这种差异?学界提出并研究了多种观点(Ilyas等,2019;Geirhos等,2020)。现有解释的共性在于:深度学习模型极其擅长捕捉训练数据集中存在的相关性模式,从而提升分布内性能。但这些相关性多数实际上只是伪关联,并不适用于其他数据分布,导致模型在其他数据集上性能骤降。

 We caution that, to date, most of these studies limit their evaluation to models trained on ImageNet. Recalling the topic of discussion, it may be a mistake to generalize too far from these initial findings. To what degree are these failures attributable to deep learning, ImageNet, or some combination of the two? CLIP models, which are trained via natural language supervision on a very large dataset and are capable of high zero-shot performance, are an opportunity to investigate this question from a different angle.
 我们提醒,迄今为止,这些研究多数仅评估了基于ImageNet训练的模型。回顾讨论主题,若从这些初步发现过度推演可能会产生误判。这些缺陷在多大程度上归因于深度学习技术、ImageNet数据集或二者的某种结合?CLIP模型通过海量数据集的自然语言监督进行训练,具备出色的零样本性能,为此问题提供了全新的研究视角。
 Taori et al. (2020) is a recent comprehensive study moving towards quantifying and understanding these behaviors for ImageNet models. Taori et al. (2020) study how the performance of ImageNet models change when evaluated on natural distribution shifts. They measure performance on a set of 7 distribution shifts: ImageNetV2 (Recht et al., 2019), ImageNet Sketch (Wang et al., 2019), Youtube-BB and ImageNet-Vid (Shankar et al., 2019), ObjectNet (Barbu et al., 2019), ImageNet Adversarial (Hendrycks et al., 2019), and ImageNet Rendition (Hendrycks et al., 2020a). They distinguish these datasets, which all consist of novel images collected from a variety of sources, from synthetic distribution shifts such as ImageNet-C (Hendrycks & Dietterich, 2019), Stylized ImageNet (Geirhos et al., 2018), or adversarial attacks (Goodfellow et al., 2014) which are created by perturbing existing images in various ways. They propose this distinction because in part because they find that while several techniques have been demonstrated to improve performance on synthetic distribution shifts, they often fail to yield consistent improvements on natural distributions.
 Taori等人(2020)的最新综合研究致力于量化并理解ImageNet模型的这些行为。该研究探讨了ImageNet模型在自然分布偏移下的性能变化,通过评估7个分布偏移数据集进行测量:ImageNetV2(Recht等人,2019)、ImageNet Sketch(Wang等人,2019)、Youtube-BB与ImageNet-Vid(Shankar等人,2019)、ObjectNet(Barbu等人,2019)、ImageNet Adversarial(Hendrycks等人,2019)以及ImageNet Rendition(Hendrycks等人,2020a)。研究者将这类来自多元渠道的真实图像数据集,与通过扰动现有图像生成的合成分布偏移(如ImageNet-C(Hendrycks & Dietterich,2019)、风格化ImageNet(Geirhos等人,2018)或对抗攻击(Goodfellow等人,2014))加以区分。这种区分源于其发现:虽然已有技术能提升模型在合成偏移上的表现,但这些改进往往无法稳定迁移至自然分布偏移场景。

 Across these collected datasets, the accuracy of ImageNet models drop well below the expectation set by the ImageNet validation set. For the following summary discussion we report average accuracy across all 7 natural distribution shift datasets and average accuracy across the corresponding class subsets of ImageNet unless otherwise specified. Additionally, for Youtube-BB and ImageNet-Vid, which have two different evaluation settings, we use the average of pm-0 and pm-10 accuracy.
 在这些收集的数据集中,ImageNet模型的准确率远低于ImageNet验证集设定的预期值。在后续的总结讨论中,除非另有说明,我们将报告所有7个自然分布偏移数据集的平均准确率,以及ImageNet相应类别子集的平均准确率。此外,对于Youtube-BB和ImageNet-Vid这两个具有不同评估设置的数据集,我们采用pm-0和pm-10准确率的平均值。

 A ResNet-101 makes 5 times as many mistakes when evaluated on these natural distribution shifts compared to the ImageNet validation set. Encouragingly however, Taori et al. (2020) find that accuracy under distribution shift increases predictably with ImageNet accuracy and is well modeled as a linear function of logit-transformed accuracy. Taori et al. (2020) use this finding to propose that robustness analysis should distinguish between effective and relativerobustness. Effective robustness measures improvements in accuracy under distribution shift above what is predicted by the documented relationship between in-distribution and out-of-distribution accuracy. Relative robustness captures any improvement in out-of-distribution accuracy. Taori et al. (2020) argue that robustness techniques should aim to improve both effective robustness and relative robustness.
 ResNet-101在这些自然分布偏移下的评估错误率是ImageNet验证集的5倍。然而令人鼓舞的是,Taori等人(2020)发现分布偏移下的准确率会随ImageNet准确率呈可预测性增长,且通过logit转换后的准确率线性函数可以很好建模。基于此发现,Taori等人(2020)提出鲁棒性分析应区分有效鲁棒性和相对鲁棒性:有效鲁棒性衡量分布偏移下准确率的提升幅度是否超越域内与域外准确率既定关联的预测值;相对鲁棒性则捕捉域外准确率的任何改进。Taori等人(2020)主张鲁棒性技术应同时致力于提升有效鲁棒性和相对鲁棒性。
在这里插入图片描述

图12. 与基于ImageNet预训练的模型相比,CLIP的特征对任务迁移表现出更强的鲁棒性。在两种数据集划分方式下,基于CLIP模型表征训练的线性探测器的迁移得分都高于ImageNet性能相近的其他模型。这表明基于ImageNet训练的模型表征存在一定程度的任务过拟合现象。

 Almost all models studied in Taori et al. (2020) are trained or fine-tuned on the ImageNet dataset. Returning to the discussion in the introduction to this section - is training or adapting to the ImageNet dataset distribution the cause of the observed robustness gap? Intuitively, a zero-shot model should not be able to exploit spurious correlations or patterns that hold only on a specific distribution, since it is not trained on that distribution. 4 Thus it is reasonable to expect zero-shot models to have much higher effective robustness. In Figure 13, we compare the performance of zero-shot CLIP with existing ImageNet models on natural distribution shifts. All zero-shot CLIP models improve effective robustness by a large amount and reduce the size of the gap between ImageNet accuracy and accuracy under distribution shift by up to 75 % 75\% 75%.
 陶里等人(2020)研究的几乎所有模型都在ImageNet数据集上进行了训练或微调。回到本节引言中的讨论——对ImageNet数据分布的训练或适配是否就是观察到的鲁棒性差距的根源?从直觉上说,零样本模型不应该能够利用仅在特定数据分布上存在的虚假相关性或模式,因为它并未在该分布上进行训练。因此,我们有理由预期零样本模型具有更高的有效鲁棒性。在图13中,我们将零样本CLIP与现有ImageNet模型在自然分布偏移下的性能进行了对比。所有零样本CLIP模型都大幅提升了有效鲁棒性,并将ImageNet准确率与分布偏移下准确率之间的差距缩小了最高达 75 % 75\% 75%
在这里插入图片描述

图13. 零样本CLIP模型比标准ImageNet模型对分布偏移具有更强的鲁棒性。(左图)理想鲁棒模型(虚线所示)在ImageNet数据分布和其他自然图像分布上表现同样出色。零样本CLIP模型将这种"鲁棒性差距"缩小了高达75%。对数几率转换值的线性拟合显示为自举法估计的95%置信区间。(右图)以香蕉类别为例可视化7个自然分布偏移数据集中的5个共享类别分布偏移情况。最佳零样本CLIP模型ViT-L/14@336px与在ImageNet验证集上表现相同的ResNet-101模型进行性能对比。

 While these results show that zero-shot models can be much more robust, they do not necessarily mean that supervised learning on ImageNet causes a robustness gap. Other details of CLIP, such as its large and diverse pre-training dataset or use of natural language supervision could also result in much more robust models regardless of whether they are zero-shot or fine-tuned. As an initial experiment to potentially begin narrowing this down, we also measure how the performance of CLIP models change after adapting to the ImageNet distribution via a L2 regularized logistic regression classifier fit to CLIP features on the ImageNet training set. We visualize how performance changes from the zero-shot classifier in Figure 14. Although adapting CLIP to the ImageNet distribution increases its ImageNet accuracy by 9.2 % 9.2\% 9.2% to 85.4 % 85.4\% 85.4% overall, and ties the accuracy of the 2018 SOTA from Mahajan et al. (2018), average accuracy under distribution shift slightly decreases.
 尽管这些结果表明零样本模型可以具备更强的鲁棒性,但并不意味着ImageNet上的监督学习必然导致鲁棒性差距。CLIP的其他特性——例如其庞大且多样化的预训练数据集,或自然语言监督机制的使用——都可能造就更高鲁棒性的模型,无论采用零样本还是微调模式。作为缩小研究范围的初步实验,我们还测量了CLIP模型通过L2正则化逻辑回归分类器适应ImageNet数据分布后的性能变化(该分类器基于ImageNet训练集的CLIP特征训练)。图14展示了从零样本分类器到适应后的性能演变。虽然适应ImageNet分布使CLIP的ImageNet准确率整体提升 9.2 % 9.2\% 9.2% 85.4 % 85.4\% 85.4%,追平了Mahajan等人2018年提出的SOTA水平,但其在分布偏移情况下的平均准确率却略有下降。
在这里插入图片描述

图14. 虽然针对ImageNet的有监督适配将ImageNet准确率提升了9.2%,但略微降低了平均鲁棒性。(左)相比Taori等人(2020)采用单一静态零样本ImageNet分类器并聚合相似类别预测的方法,为每个数据集定制零样本CLIP分类器可提升鲁棒性。经ImageNet适配的CLIP模型与先前最佳ImageNet模型具有相近的有效鲁棒性。(右)两种鲁棒性干预措施在各数据集准确率的具体变化。适配ImageNet显著提高了ImageNetV2的准确率,但牺牲了其他若干数据分布的准确率。特定数据集的零样本分类器可大幅提升准确率,但仅适用于少数包含不完全匹配ImageNet类别体系的数据集。

 It is surprising to see a 9.2 % 9.2\% 9.2% increase in accuracy, which corresponds to roughly 3 years of improvement in SOTA, fail to translate into any improvement in average performance under distribution shift. We also break down the differences between zero-shot accuracy and linear classifier accuracy per dataset in Figure 14 and find performance still increases significantly on one dataset, ImageNetV2. ImageNetV2 closely followed the creation process of the original ImageNet dataset which suggests that gains in accuracy from supervised adaptation are closely concentrated around the ImageNet distribution. Performance decreases by 4.7 % 4.7\% 4.7% on ImageNet-R, 3.8 % 3.8\% 3.8% on ObjectNet, 2.8 % 2.8\% 2.8% on ImageNet Sketch, and 1.9 % 1.9\% 1.9% on ImageNet-A. The change in accuracy on the two other datasets, Youtube-BB and ImageNet Vid, is insignificant.
 令人惊讶的是,精度提高了 9.2 % 9.2\% 9.2%,这相当于SOTA(最先进技术)大约3年的改进,却未能在分布偏移下转化为平均性能的任何提升。我们在图14中还分解了每个数据集的零样本精度与线性分类器精度之间的差异,并发现性能仍在一个数据集ImageNetV2上显著提升。ImageNetV2紧密遵循了原始ImageNet数据集的创建过程,这表明监督适应带来的精度提升主要集中在ImageNet分布附近。性能在ImageNet-R上下降了 4.7 % 4.7\% 4.7%,在ObjectNet上下降了 3.8 % 3.8\% 3.8%,在ImageNet Sketch上下降了 2.8 % 2.8\% 2.8%,在ImageNet-A上下降了 1.9 % 1.9\% 1.9%。在另外两个数据集Youtube-BB和ImageNet Vid上,精度变化不明显。

 How is it possible to improve accuracy by 9.2 % 9.2\% 9.2% on the ImageNet dataset with little to no increase in accuracy under distribution shift? Is the gain primarily from “exploiting spurious correlations”? Is this behavior unique to some combination of CLIP, the ImageNet datatset, and the distribution shifts studied, or a more general phenomena? Does it hold for end-to-end finetuning as well as linear classifiers? We do not have confident answers to these questions at this time. Prior work has also pre-trained models on distributions other than ImageNet, but it is common to study and release models only after they have been fine-tuned to ImageNet. As a step towards understanding whether pre-trained zero-shot models consistently have higher effective robustness than fine-tuned models, we encourage the authors of Mahajan et al. (2018), Kolesnikov et al. (2019), and Dosovitskiy et al. (2020) to, if possible, study these questions on their models as well.
 如何在ImageNet数据集上提升 9.2 % 9.2\% 9.2%准确率,而数据分布偏移时准确率几乎不增长?这种提升主要来自"利用虚假相关性"吗?这种现象是CLIP模型、ImageNet数据集与所研究分布偏移的特有组合,还是更普遍规律?该结论是否适用于端到端微调与线性分类器?目前我们尚无明确答案。先前研究虽在非ImageNet分布上预训练模型,但通常只在微调至ImageNet后才进行研究发布。为探究预训练零样本模型是否始终比微调模型具有更高有效鲁棒性,我们建议Mahajan等人(2018)、Kolesnikov等人(2019)和Dosovitskiy等人(2020)的作者若条件允许,也能在其模型上研究这些问题。

 We also investigate another robustness intervention enabled by flexible zero-shot natural-language-based image classifiers. The target classes across the 7 transfer datasets are not always perfectly aligned with those of ImageNet. Two datasets, Youtube-BB and ImageNet-Vid, consist of superclasses of ImageNet. This presents a problem when trying to use the fixed 1000-way classifier of an ImageNet model to make predictions. Taori et al. (2020) handle this by maxpooling predictions across all sub-classes according to the ImageNet class hierarchy. Sometimes this mapping is much less than perfect. For the person class in Youtube-BB, predictions are made by pooling over the ImageNet classes for a baseball player, a bridegroom, and a scuba diver. With CLIP we can instead generate a custom zero-shot classifier for each dataset directly based on its class names. In Figure 14 we see that this improves average effective robustness by 5 % 5\% 5% but is concentrated in large improvements on only a few datasets. Curiously, accuracy on ObjectNet also increases by 2.3 % 2.3\% 2.3%. Although the dataset was designed to closely overlap with ImageNet classes, using the names provided for each class by ObjectNet’s creators still helps a small amount compared to using ImageNet class names and pooling predictions when necessary.
 我们还研究了另一种稳健性干预措施,这种措施通过灵活的零样本自然语言图像分类器实现。7个迁移数据集中的目标类别并不总是与ImageNet完美对应。其中Youtube-BB和ImageNet-Vid两个数据集包含ImageNet的超类。当尝试使用ImageNet模型的固定1000路分类器进行预测时,这会带来问题。Taori等人(2020)的处理方法是根据ImageNet类别层次结构在所有子类上进行最大池化预测。但这种映射有时远非完美。例如对于Youtube-BB中的"人"类别,预测是通过对棒球运动员、新郎和潜水员等ImageNet类别进行池化得出的。而使用CLIP时,我们可以直接基于每个数据集的类别名称生成定制化的零样本分类器。图14显示这种方法将平均有效稳健性提高了 5 % 5\% 5%,但改进主要集中在少数数据集上。有趣的是,ObjectNet的准确率也提高了 2.3 % 2.3\% 2.3%。尽管该数据集设计时与ImageNet类别高度重合,但使用ObjectNet创建者提供的类别名称相比使用ImageNet类别名称(必要时进行池化预测)仍能带来小幅提升。

 While zero-shot CLIP improves effective robustness, Figure 14 shows that the benefit is almost entirely gone in a fully supervised setting. To better understand this difference, we investigate how effective robustness changes on the continuum from zero-shot to fully supervised. In Figure 15 we visualize the performance of 0-shot, 1-shot, 2-shot, 4-shot …, 128-shot, and fully supervised logistic regression classifiers on the best CLIP model’s features. We see that while few-shot models also show higher effective robustness than existing models, this benefit fades as in-distribution performance increases with more training data and is mostly, though not entirely, gone for the fully supervised model. Additionally, zero-shot CLIP is notably more robust than a few-shot model with equivalent ImageNet performance. Across our experiments, high effective robustness seems to result from minimizing the amount of distribution specific training data a model has access to, but this comes at a cost of reducing dataset-specific performance.
 尽管零样本CLIP提升了有效鲁棒性,但图14显示这种优势在完全监督场景下几乎完全消失。为了更好地理解这一差异,我们研究了从零样本到完全监督的连续过程中有效鲁棒性的变化规律。在图15中,我们可视化呈现了最佳CLIP模型特征上零样本、单样本、双样本、四样本直至128样本及完全监督逻辑回归分类器的性能表现。研究发现:虽然小样本模型同样展现出比现有模型更高的有效鲁棒性,但随着训练数据增加导致分布内性能提升时,这种优势会逐渐减弱——对于完全监督模型而言,该优势虽未完全消失但已大幅衰减。值得注意的是,在ImageNet性能相当的情况下,零样本CLIP的鲁棒性显著优于小样本模型。所有实验结果表明:要获得较高的有效鲁棒性,关键在于最小化模型接触到的分布特定训练数据量,但代价是牺牲针对特定数据集的性能表现。
在这里插入图片描述

图15。少量样本的CLIP相较于现有ImageNet模型仍然提升了有效鲁棒性,但弱于零样本CLIP。减少用于适配的ImageNet训练数据量会以降低相对鲁棒性为代价提升有效鲁棒性。如图7先前所示,16样本逻辑回归CLIP在ImageNet上表现与零样本CLIP相当,但鲁棒性更差。

 Taken together, these results suggest that the recent shift towards large-scale task and dataset agnostic pre-training combined with a reorientation towards zero-shot and fewshot benchmarking on broad evaluation suites (as advocated by Yogatama et al. (2019) and Linzen (2020)) promotes the development of more robust systems and provides a more accurate assessment of performance. We are curious to see if the same results hold for zero-shot models in the field of NLP such as the GPT family. While Hendrycks et al. (2020b) has reported that pre-training improves relative robustness on sentiment analysis, Miller et al. (2020)'s study of the robustness of question answering models under natural distribution shift finds, similar to Taori et al. (2020), little evidence of effective robustness improvements to date.
 综合来看,这些研究结果表明:当前大规模任务与数据集无关的预训练趋势,结合面向广义评估套件的零样本/小样本基准测试转型(如Yogatama等人(2019)和Linzen(2020)所倡导的),既能促进更鲁棒系统的开发,又能提供更精准的性能评估。我们很好奇GPT家族等NLP领域的零样本模型是否具有相同特性。尽管Hendrycks等人(2020b)发现预训练提升了情感分析的相对鲁棒性,但Miller等人(2020)对自然分布偏移下问答模型鲁棒性的研究——与Taori等人(2020)的结论相似——迄今尚未发现有效鲁棒性提升的有力证据。

4. Comparison to Human Performance 与人类表现对比

 How does CLIP compare to human performance and human learning? To get a better understanding of how well humans perform in similar evaluation settings to CLIP, we evaluated humans on one of our tasks. We wanted to get a sense of how strong human zero-shot performance is at these tasks, and how much human performance is improved if they are shown one or two image samples. This can help us to compare task difficulty for humans and CLIP, and identify correlations and differences between them.
 CLIP与人类表现及人类学习方式相比如何?为了更好地理解人类在与CLIP类似的评估环境中的表现,我们在其中一项任务中对人类进行了评估。我们想了解人类在这些任务中的零样本表现有多强,以及当他们看到一两张示例图片时表现能提升多少。这有助于我们比较任务对人类和CLIP的难度,并识别两者之间的关联与差异。

 We had five different humans look at each of 3669 images in the test split of the Oxford IIT Pets dataset (Parkhi et al., 2012) and select which of the 37 cat or dog breeds best matched the image (or ‘I don’t know’ if they were completely uncertain). In the zero-shot case the humans were given no examples of the breeds and asked to label them to the best of their ability without an internet search. In the one-shot experiment the humans were given one sample image of each breed and in the two-shot experiment they were given two sample images of each breed.
 我们让五位不同人员查看牛津IIT宠物数据集测试集(Parkhi等人,2012)中的3669张图像,并选择与图像最匹配的37种猫狗品种(若完全无法确定则选择"我不知道")。在零样本实验中,受试者未获得任何品种示例图片,仅凭自身认知进行标注且不可网络搜索。单样本实验则提供每个品种的一张示例图片,双样本实验则提供每个品种的两张示例图片。

 One possible concern was that the human workers were not sufficiently motivated in the zero-shot task. High human accuracy of 94 % 94\% 94% on the STL-10 dataset (Coates et al., 2011) and 97 − 100 % 97-100\% 97100% accuracy on the subset of attention check images increased our trust in the human workers.
 一个可能的担忧是,人类工作者在零样本任务中缺乏足够动力。但他们在STL-10数据集上 94 % 94\% 94%的高准确率(Coates等人,2011年)以及对注意力检查图像子集 97 % − 100 % 97\%-100\% 97%100%的准确率,增强了我们对人类工作者的信任。

 Interestingly, humans went from a performance average of 54 % 54\% 54% to 76 % 76\% 76% with just one training example per class, and the marginal gain from an additional training example is minimal. The gain in accuracy going from zero to one shot is almost entirely on images that humans were uncertain about. This suggests that humans “know what they don’t know” and are able to update their priors on the images they are most uncertain in based on a single example. Given this, it seems that while CLIP is a promising training strategy for zero-shot performance (Figure 5) and does well on tests of natural distribution shift (Figure 13), there is a large difference between how humans learn from a few examples and the few-shot methods in this paper.
 有趣的是,人类在每个类别仅接受一次训练示例的情况下,表现准确率就从 54 % 54\% 54%提升至 76 % 76\% 76%,而增加额外训练示例带来的边际收益微乎其微。从零样本到单样本的准确率提升,几乎完全体现在人类最初难以判断的图像上。这表明人类"清楚自己的认知盲区",并能基于单个示例更新其最不确定图像的先验知识。由此可见,虽然CLIP在零样本性能(图5)和自然分布偏移测试(图13)中表现优异,但人类从少量样本中学习的方式与本文提出的少样本学习方法存在显著差异。

 This suggests that there are still algorithmic improvements waiting to be made to decrease the gap between machine and human sample efficiency, as noted by Lake et al. (2016) and others. Because these few-shot evaluations of CLIP don’t make effective use of prior knowledge and the humans do, we speculate that finding a method to properly integrate prior knowledge into few-shot learning is an important step in algorithmic improvements to CLIP. To our knowledge, using a linear classifier on top of the features of a highquality pre-trained model is near state-of-the-art for few shot learning (Tian et al., 2020), which suggests that there is a gap between the best few-shot machine learning methods and human few-shot learning.
 这表明,正如Lake等人(2016年)及其他研究者所指出的,仍需进行算法改进以缩小机器与人类在样本效率上的差距。由于CLIP的这些少样本评估未能有效利用先验知识,而人类却能做到,我们推测找到一个将先验知识恰当整合到少样本学习中的方法,将是CLIP算法改进的重要一步。据我们所知,在高质量预训练模型的特征上使用线性分类器已接近少样本学习的最先进水平(Tian等人,2020年),这表明当前最优的机器学习少样本方法与人类少样本学习之间仍存在差距。

 If we plot human accuracy vs CLIP’s zero shot accuracy (Figure 16), we see that the hardest problems for CLIP are also hard for humans. To the extent that errors are consistent, our hypothesis is that this is due to at least a two factors: noise in the dataset (including mislabeled images) and out of distribution images being hard for both humans and models.
 如果我们绘制人类准确率与CLIP零样本准确率的对比图(图16),会发现CLIP最难以解决的问题对人类同样具有挑战性。从错误一致性的角度来看,我们的假设认为这至少源于两个因素:数据集的噪声(包括错误标注的图像)以及超出分布范围的图像——这些对人类的模型而言都难以处理。
在这里插入图片描述

图 16. CLIP 最难解决的问题往往也是人类最难解决的问题。在此我们根据 CLIP 预测正确标签的概率,对图像分类难度进行排序。

5. Data Overlap Analysis 数据重叠分析

 A concern with pre-training on a very large internet dataset is unintentional overlap with downstream evals. This is important to investigate since, in a worst-case scenario, a complete copy of an evaluation dataset could leak into the pre-training dataset and invalidate the evaluation as a meaningful test of generalization. One option to prevent this is to identify and remove all duplicates before training a model. While this guarantees reporting true hold-out performance, it requires knowing all possible data which a model might be evaluated on ahead of time. This has the downside of limiting the scope of benchmarking and analysis. Adding a new evaluation would require an expensive re-train or risk reporting an un-quantified benefit due to overlap.
 在超大规模互联网数据集上进行预训练时,一个值得关注的问题是可能无意中与下游评估数据发生重叠。这项研究至关重要,因为在最坏情况下,评估数据集的完整副本可能混入预训练数据,导致评估结果无法真实反映模型的泛化能力。防范措施之一是预先识别并删除所有重复数据。虽然这种方法能确保报告真实的保留测试性能,但需要提前知晓模型可能涉及的所有评估数据。这会限制基准测试和分析的范围,其弊端在于:若新增评估指标,要么需要代价高昂的模型重新训练,要么可能因数据重叠而报告无法量化的性能提升。

 Instead, we document how much overlap occurs and how performance changes due to these overlaps. To do this, we use the following procedure:
 相反,我们记录了重叠发生的程度以及这些重叠对性能的影响。为此,我们采用了以下程序:

  1. For each evaluation dataset, we run a duplicate detector (see Appendix C) on its examples. We then manually inspect the found nearest neighbors and set a per dataset threshold to keep high precision while maximizing recall. Using this threshold, we then create two new subsets, O v e r l a p Overlap Overlap, which contains all examples which have a similarity to a training example above the threshold, and C l e a n Clean Clean, which contains all examples that are below this threshold. We denote the unaltered full dataset A l l All All for reference. From this we first record the degree of data contamination as the ratio of the number of examples in O v e r l a p Overlap Overlap to the size of A l l All All.

1) 对于每个评估数据集,我们对其样本运行重复检测器(参见附录C)。随后通过人工检查发现的最近邻样本,为每个数据集设定阈值,在保持高精度的同时最大化召回率。基于该阈值,我们创建两个新子集: O v e r l a p Overlap Overlap(包含与训练样本相似度超过阈值的所有样本)和 C l e a n Clean Clean(包含相似度低于阈值的所有样本)。原始完整数据集记为 A l l All All作为参照。我们首先通过计算 O v e r l a p Overlap Overlap样本数量与 A l l All All规模的比值来记录数据污染程度。

  1. We then compute the zero-shot accuracy of CLIP RN50x64 on the three splits and report A l l − C l e a n a s All - Cleanas AllCleanas our main metric. This is the difference in accuracy due to contamination. When positive it is our estimate of how much the overall reported accuracy on the dataset was inflated by over-fitting to overlapping data.

2)我们随后计算了CLIP RN50x64模型在三个数据划分上的零样本准确率,并将"总集准确率减去清洁集准确率"( A l l − C l e a n All - Clean AllClean)作为主要指标。该差值反映了数据污染导致的准确率差异。当结果为正值时,它表示由于模型对重叠数据的过拟合导致整个数据集的报告准确率被夸大的程度。

  1. The amount of overlap is often small so we also run a binomial significance test where we use the accuracy on C l e a n Clean Clean as the null hypothesis and compute the one-tailed (greater) p-value for the O v e r l a p Overlap Overlap subset. We also calculate 99.5 % 99.5\% 99.5% Clopper-Pearson confidence intervals on D i r t y Dirty Dirty as another check.

3)重叠部分通常较小,因此我们还运行了一项二项式显著性检验:使用 C l e a n Clean Clean数据集的准确率作为零假设,并计算 O v e r l a p Overlap Overlap子集的单尾(更大)p值。同时,我们计算了 D i r t y Dirty Dirty数据集的 99.5 % 99.5\% 99.5%克洛佩尔-皮尔逊置信区间作为另一项验证指标。

 A summary of this analysis is presented in Figure 17. Out of 35 datasets studied, 9 datasets have no detected overlap at all. Most of these datasets are synthetic or specialized making them unlikely to be posted as normal images on the internet (for instance MNIST, CLEVR, and GTSRB) or are guaranteed to have no overlap due to containing novel data from after the date our dataset was created (ObjectNet and Hateful Memes). This demonstrates our detector has a low-false positive rate which is important as false positives would under-estimate the effect of contamination in our analysis. There is a median overlap of 2.2% and an average overlap of 3.2%. Due to this small amount of overlap, overall accuracy is rarely shifted by more than 0.1% with only 7 datasets above this threshold. Of these, only 2 are statistically significant after Bonferroni correction. The max detected improvement is only 0.6% on Birdsnap which has the second largest overlap at 12.1%. The largest overlap is for Country211 at 21.5%. This is due to it being constructed out of YFCC100M, which our pre-training dataset contains a filtered subset of. Despite this large overlap there is only a 0.2% increase in accuracy on Country211. This may be because the training text accompanying an example is often not related to the specific task a downstream eval measures. Country211 measures geo-localization ability, but inspecting the training text for these duplicates showed they often do not mention the location of the image.
 图17展示了该分析的总结。在研究的35个数据集中,有9个数据集完全未检测到重叠。这些数据集大多为合成数据或专用数据(例如MNIST、CLEVR和GTSRB),因此不太可能作为普通图像发布在互联网上;或因包含我们数据集创建日期后的新数据(如ObjectNet和Hateful Memes)而被确保无重叠。这表明我们的检测器具有较低误报率——这点至关重要,因为误报会低估分析中数据污染的影响。检测结果显示中位重叠率为 2.2 % 2.2\% 2.2%,平均重叠率为 3.2 % 3.2\% 3.2%。由于重叠量较小,总体准确率变化通常不超过 0.1 % 0.1\% 0.1%,仅有7个数据集超出此阈值。经Bonferroni校正后,其中仅2个数据集具有统计学显著性。最大检测到的改进仅出现在Birdsnap数据集(重叠率第二高 12.1 % 12.1\% 12.1%),准确率提升 0.6 % 0.6\% 0.6%。而重叠率最高的是Country211数据集( 21.5 % 21.5\% 21.5%),因其构建自YFCC100M——我们的预训练数据集包含其过滤子集。尽管重叠率很高,Country211的准确率仅提升 0.2 % 0.2\% 0.2%,这可能因为训练文本常与下游评估任务不相关:Country211评估地理定位能力,但检查重复样本的训练文本发现,它们往往未提及图像位置。

在这里插入图片描述

图17:数据重叠检测对准确率的显著提升有限。(左)虽然部分数据集的零样本准确率在重叠样本与纯净样本间存在 ± 20 % ±20\% ±20%的显著差异,但 35 35 35个数据集中仅有 5 5 5个的 99.5 % 99.5\% 99.5% Clopper-Pearson置信区间排除了 0 % 0\% 0%准确率差异的可能性,其中 2 2 2个数据集在重叠数据上表现更差。(右)由于检测到的重叠样本占比通常不足 10 % 10\% 10%,整体测试准确率因重叠带来的最大增益仅为Birdsnap数据集的 0.6 % 0.6\% 0.6%。同样,单侧二项式检验显示仅有 6 6 6个数据集的准确率提升具有统计显著性。

 We are aware of two potential concerns with our analysis. First our detector is not perfect. While it achieves near 100% accuracy on its proxy training task and manual inspection + threshold tuning results in very high precision with good recall among the found nearest-neighbors, we can not tractably check its recall across 400 million examples. Another potential confounder of our analysis is that the underlying data distribution may shift between the O v e r l a p a n d Overlapand Overlapand and C l e a n Clean Clean subsets. For example, on Kinetics-700 many “overlaps” are in fact all black transition frames. This explains why Kinetics-700 has an apparent 20% accuracy drop on O v e r l a p Overlap Overlap. We suspect more subtle distribution shifts likely exist. One possibility we noticed on CIFAR-100 is that, due to the very low resolution of its images, many duplicates were false positives of small objects such as birds or planes. Changes in accuracy could instead be due to changes in the class distribution or difficulty of the duplicates. Unfortunately, these distribution and difficulty shifts could also mask the effects of over-fitting.
 我们认识到分析中存在的两个潜在问题。首先,我们的检测器并不完美。虽然它在代理训练任务上实现了接近 100 % 100\% 100%的准确率,且通过人工检查与阈值调整能在发现的最近邻样本中保持高精确度和良好召回率,但我们无法切实验证其在四亿个样本中的召回表现。另一个潜在干扰因素是基础数据分布在 O v e r l a p a n d Overlapand Overlapand C l e a n Clean Clean之间可能存在偏移。例如在Kinetics-700数据集中,许多"重叠样本"实际上是全黑过渡帧,这解释了为何该数据集在 O v e r l a p a n d Overlapand Overlapand上会出现 20 % 20\% 20%的准确率骤降。我们推测还存在更微妙的数据分布偏移现象——在CIFAR-100中我们注意到,由于其图像分辨率极低,许多重复样本实际上是鸟类或飞机等小物体的误判案例。准确率变化也可能源自类分布变化或重复样本的识别难度差异。遗憾的是,这些数据分布与难度偏移也可能掩盖过拟合效应的真实表现。

 However, these results closely follow the findings of similar duplicate analysis in previous work on large scale pretraining. Mahajan et al. (2018) and Kolesnikov et al. (2019) detected similar overlap rates and found minimal changes in overall performance. Importantly, Kolesnikov et al. (2019) also compared the alternative de-duplication strategy discussed in the introduction to this section with the approach we settled on and observed little difference between the two approaches.
 然而,这些结果与先前关于大规模预训练工作的类似重复分析结果高度吻合。Mahajan等人(2018年)和Kolesnikov等人(2019年)发现了相似的重叠率,并发现整体性能变化微乎其微。值得注意的是,Kolesnikov等人(2019年)还将本节引言中讨论的替代去重策略与我们最终采用的方法进行了比较,发现两种方法之间差异甚微。

6. Limitations 局限性

 There are still many limitations to CLIP. While several of these are discussed as part of analysis in various sections, we summarize and collect them here.
 CLIP仍存在许多局限性。尽管我们在不同章节的分析中已讨论了其中一些问题,但在此我们予以汇总整理。
 On datasets with training splits, the performance of zeroshot CLIP is on average competitive with the simple supervised baseline of a linear classifier on top of ResNet-50 features. On most of these datasets, the performance of this baseline is now well below the overall state of the art. Significant work is still needed to improve the task learning and transfer capabilities of CLIP. While scaling has so far steadily improved performance and suggests a route for continued improvement, we estimate around a 1000x increase in compute is required for zero-shot CLIP to reach overall state-of-the-art performance. This is infeasible to train with current hardware. Further research into improving upon the computational and data efficiency of CLIP will be necessary.
 在使用训练集划分的数据集上,零样本CLIP模型的平均性能与基于ResNet-50特征的线性分类器这一简单监督基线模型相当。在大多数此类数据集中,该基线模型的性能现已远低于整体最先进水平。要提升CLIP的任务学习和迁移能力仍需大量工作。虽然模型规模的扩大迄今持续带来性能提升,并为持续改进指明方向,但我们预估零样本CLIP要达到整体最先进性能仍需约1000倍的计算量增长。这在当前硬件条件下是无法实现的训练规模。未来有必要进一步研究如何提升CLIP的计算效率和数据效率。

 Analysis in Section 3.1 found that CLIP’s zero-shot performance is still quite weak on several kinds of tasks. When compared to task-specific models, the performance of CLIP is poor on several types of fine-grained classification such as differentiating models of cars, species of flowers, and variants of aircraft. CLIP also struggles with more abstract and systematic tasks such as counting the number of objects in an image. Finally for novel tasks which are unlikely to be included in CLIP’s pre-training dataset, such as classifying the distance to the nearest car in a photo, CLIP’s performance can be near random. We are confident that there are still many, many, tasks where CLIP’s zero-shot performance is near chance level.
 第3.1节的分析发现,CLIP在多种任务上的零样本性能仍然较弱。与专用任务模型相比,CLIP在细粒度分类任务(如区分汽车型号、花卉品种和飞机类型)上表现欠佳。该模型在处理更抽象和系统化的任务(例如统计图像中的物体数量)时也存在困难。最后,对于CLIP预训练数据集中不太可能包含的新颖任务(例如判断照片中最近车辆的距离),CLIP的表现近乎随机。我们确信,仍有大量任务的零样本性能接近随机猜测水平。

 While zero-shot CLIP generalizes well to many natural image distributions as investigated in Section 3.3, we’ve observed that zero-shot CLIP still generalizes poorly to data that is truly out-of-distribution for it. An illustrative example occurs for the task of OCR as reported in Appendix E. CLIP learns a high quality semantic OCR representation that performs well on digitally rendered text, which is common in its pre-training dataset, as evidenced by performance on Rendered SST2. However, CLIP only achieves 88% accuracy on the handwritten digits of MNIST. An embarrassingly simple baseline of logistic regression on raw pixels outperforms zero-shot CLIP. Both semantic and near-duplicate nearest-neighbor retrieval verify that there are almost no images that resemble MNIST digits in our pre-training dataset. This suggests CLIP does little to address the underlying problem of brittle generalization of deep learning models. Instead CLIP tries to circumvent the problem and hopes that by training on such a large and varied dataset that all data will be effectively in-distribution. This is a naive assumption that, as MNIST demonstrates, is easy to violate.
  如第3.3节所述,虽然零样本CLIP在许多自然图像分布上表现出良好的泛化能力,但我们观察到它对真正超出分布范围的数据仍然泛化不佳。附录E报告的OCR任务就是一个典型例子:CLIP学习了高质量的语义OCR表征,在数字化渲染文本(其预训练数据集中常见,如Rendered SST2上的表现所示)上效果良好,但在MNIST手写数字数据集上仅达到88%准确率——甚至不如原始像素逻辑回归这种极度简单的基线方法。通过语义检索和近似重复检索均证实,我们的预训练数据集中几乎不存在类似MNIST数字的图像。这表明CLIP并未解决深度学习模型脆性泛化的根本问题,而是试图通过海量多样化数据的训练,寄望于所有数据都能有效落入分布范围。这种天真假设正如MNIST所证明的很容易被打破。

 Although CLIP can flexibly generate zero-shot classifiers for a wide variety of tasks and datasets, CLIP is still limited to choosing from only those concepts in a given zero-shot classifier. This is a significant restriction compared to a truly flexible approach like image captioning which could generate novel outputs. Unfortunately, as described in Section 2.3 we found the computational efficiency of the image caption baseline we tried to be much lower than CLIP. A simple idea worth trying is joint training of a contrastive and generative objective with the hope of combining the efficiency of CLIP with the flexibility of a caption model. As another alternative, search could be performed at inference time over many natural language explanations of a given image, similar to approach proposed in Learning with Latent Language Andreas et al. (2017).
 尽管CLIP能灵活生成适用于各类任务与数据集的零样本分类器,但其选择范围仍仅限于给定零样本分类器中预设的概念类别。与图像描述这类能生成新颖输出的真正灵活方法相比,这种限制尤为明显。然而如第2.3节所述,我们发现尝试的图像描述基线模型在计算效率上远低于CLIP。一个值得尝试的简单思路是对比目标与生成目标进行联合训练,以期结合CLIP的高效性与描述模型的灵活性。另一种替代方案是借鉴Andreas等人(2017)在《Learning with Latent Language》中提出的方法,在推理时基于给定图像的多种自然语言解释进行搜索。

 CLIP also does not address the poor data efficiency of deep learning. Instead CLIP compensates by using a source of supervision that can be scaled to hundreds of millions of training examples. If every image seen during training of a CLIP model was presented at a rate of one per second, it would take 405 years to iterate through the 12.8 billion images seen over 32 training epochs. Combining CLIP with self-supervision (Henaff, 2020; Chen et al., 2020c) and self-training (Lee; Xie et al., 2020) methods is a promising direction given their demonstrated ability to improve data efficiency over standard supervised learning.
 CLIP同样没有解决深度学习数据效率低下的问题,而是通过采用可扩展至数亿训练样本的监督信号源进行弥补。若以每秒展示一张图像的速度呈现CLIP模型训练期间所见的全部图像,完成32个训练周期中128亿张图像的遍历将耗时405年。鉴于自监督学习(Henaff, 2020; Chen等, 2020c)与自训练方法(Lee; Xie等, 2020)已被证实能提升标准监督学习的数据效率,将CLIP与这些方法结合是一个颇具前景的研究方向。

 Our methodology has several significant limitations. Despite our focus on zero-shot transfer, we repeatedly queried performance on full validation sets to guide the development of CLIP. These validation sets often have thousands of examples, which is unrealistic for true zero-shot scenarios. Similar concerns have been raised in the field of semi-supervised learning (Oliver et al., 2018). Another potential issue is our selection of evaluation datasets. While we have reported results on Kornblith et al. (2019)'s 12 dataset evaluation suite as a standardized collection, our main results use a somewhat haphazardly assembled collection of 27 datasets that is undeniably co-adapted with the development and capabilities of CLIP. Creating a new benchmark of tasks designed explicitly to evaluate broad zero-shot transfer capabilities, rather than re-using existing supervised datasets, would help address these issues.
 我们的方法存在若干显著局限性。尽管我们关注零样本迁移,但在开发CLIP过程中反复使用完整验证集来查询性能。这些验证集通常包含数千个样本,这与真实零样本场景不符。半监督学习领域也曾提出类似问题(Oliver等人,2018)。另一个潜在问题在于评估数据集的选择:虽然我们报告了Kornblith等人(2019)12个数据集标准化评估套件的结果,但主要实验结果使用的是随意组装的27个数据集,这些数据不可避免地与CLIP的开发过程及能力存在协同适应。若能创建专为评估广义零样本迁移能力而设计的新基准任务(而非复用现有监督数据集),将有助于解决这些问题。

 CLIP is trained on text paired with images on the internet. These image-text pairs are unfiltered and uncurated and result in CLIP models learning many social biases. This has been previously demonstrated for image caption models (Bhargava & Forsyth, 2019). We refer readers to Section 7 for detailed analysis and quantification of these behaviors for CLIP as well as discussion of potential mitigation strategies.
 CLIP模型是通过互联网上的图文配对数据进行训练的。这些未经筛选和整理的图文对导致CLIP模型习得了诸多社会偏见。此前在图像描述模型中就曾发现过类似现象(Bhargava & Forsyth, 2019)。关于CLIP模型中这些行为的具体分析与量化评估,以及潜在缓解策略的讨论,我们建议读者参阅第7章节内容。

 While we have emphasized throughout this work that specifying image classifiers through natural language is a flexible and general interface, it has its own limitations. Many complex tasks and visual concepts can be difficult to specify just through text. Actual training examples are undeniably useful but CLIP does not optimize for few-shot performance directly. In our work, we fall back to fitting linear classifiers on top of CLIP’s features. This results in a counter-intuitive drop in performance when transitioning from a zero-shot to a few-shot setting. As discussed in Section 4, this is notably different from human performance which shows a large increase from a zero to a one shot setting. Future work is needed to develop methods that combine CLIP’s strong zero-shot performance with efficient few-shot learning.
 虽然我们在工作中始终强调,通过自然语言指定图像分类器是一种灵活通用的接口,但它也存在自身的局限性。许多复杂任务和视觉概念仅靠文本难以精确描述。虽然实际训练样本的作用毋庸置疑,但CLIP并未直接针对小样本学习性能进行优化。我们的解决方案是在CLIP特征之上拟合线性分类器,这导致从零样本过渡到小样本时会出现违反直觉的性能下降现象。如第4节所述,这与人类表现形成鲜明对比——人类在从零样本转为单样本时通常表现大幅提升。未来研究需要开发新方法,将CLIP强大的零样本性能与高效的小样本学习相结合。

7. Broader Impacts

 CLIP has a wide range of capabilities due to its ability to carry out arbitrary image classification tasks. One can give it images of cats and dogs and ask it to classify cats, or give it images taken in a department store and ask it to classify shoplifters–a task with significant social implications and for which AI may be unfit. Like any image classification system, CLIP’s performance and fitness for purpose need to be evaluated, and its broader impacts analyzed in context. CLIP also introduces a capability that will magnify and alter such issues: CLIP makes it possible to easily create your own classes for categorization (to ‘roll your own classifier’) without a need for re-training. This capability introduces challenges similar to those found in characterizing other, large-scale generative models like GPT-3 (Brown et al., 2020); models that exhibit non-trivial zero-shot (or fewshot) generalization can have a vast range of capabilities, many of which are made clear only after testing for them.
 CLIP因其能够执行任意图像分类任务而具有广泛的应用能力。它可以接受猫狗图片进行猫科分类,也可以分析商场监控画面来识别扒手——这项具有重大社会影响的任务可能并不适合AI处理。与其他图像分类系统一样,需要评估CLIP的性能是否符合使用目的,并在具体情境中分析其更广泛的影响。该模型还带来了一项会放大并改变此类问题的新能力:用户无需重新训练就能轻松创建自定义分类类别(即"定制专属分类器")。这种能力带来的挑战类似于GPT-3等大型生成模型的特性表征问题(Brown等人,2020年);当模型展现出卓越的零样本(或少样本)泛化能力时,其潜在功能范围可能极其广泛,许多能力只有经过专项测试才能显现。

 Our studies of CLIP in a zero-shot setting show that the model displays significant promise for widely-applicable tasks like image retrieval or search. For example, it can find relevant images in a database given text, or relevant text given an image. Further, the relative ease of steering CLIP toward bespoke applications with little or no additional data or training could unlock a variety of novel applications that are hard for us to envision today, as has occurred with large language models over the past few years.
 我们对CLIP在零样本设置下的研究表明,该模型在图像检索或搜索等广泛适用性任务中展现出巨大潜力。例如,它可以根据文本在数据库中查找相关图像,或根据图像查找相关文本。此外,由于引导CLIP转向定制化应用仅需极少甚至无需额外数据或训练,这种相对便捷性有望催生当今难以预见的各种创新应用场景,正如过去几年大型语言模型所展现的发展轨迹。

 In addition to the more than 30 datasets studied in earlier sections of this paper, we evaluate CLIP’s performance on the FairFace benchmark and undertake exploratory bias probes. We then characterize the model’s performance in a downstream task, surveillance, and discuss its usefulness as compared with other available systems. Many of CLIP’s capabilities are omni-use in nature (e.g. OCR can be used to make scanned documents searchable, to power screen reading technologies, or to read license plates). Several of the capabilities measured, from action recognition, object classification, and geo-localization, to facial emotion recognition, can be used in surveillance. Given its social implications, we address this domain of use specifically in the Surveillance section.
 除本文前文研究的30多个数据集外,我们还在FairFace基准上评估了CLIP的性能,并进行了探索性偏见探测。随后我们以监控这一下游任务为例表征模型性能,并与其他现有系统对比讨论其适用性。CLIP的许多能力本质上具有通用性(例如OCR可用于扫描文档检索、驱动屏幕阅读技术或读取车牌)。从行为识别、物体分类、地理定位到面部情绪识别,测评的多种能力均可用于监控领域。鉴于其社会影响,我们将在"监控应用"章节专门探讨这一使用场景。

 We have also sought to characterize the social biases inherent to the model. Our bias tests represent our initial efforts to probe aspects of how the model responds in different scenarios, and are by nature limited in scope. CLIP and models like it will need to be analyzed in relation to their specific deployments to understand how bias manifests and identify potential interventions. Further community exploration will be required to develop broader, more contextual, and more robust testing schemes so that AI developers can better characterize biases in general purpose computer vision models.
 我们还试图描述模型中固有的社会偏见。我们的偏见测试代表了我们探索模型在不同场景中反应方式的初步尝试,其范围本质上有限。CLIP及类似模型需要结合具体应用场景进行分析,以理解偏见如何显现并确定可能的干预措施。未来需要更广泛的社群探索来开发更全面、更贴合情境且更鲁棒的测试方案,以便AI开发者能更好地描述通用计算机视觉模型中的偏见特征。

7.1. Bias

 Algorithmic decisions, training data, and choices about how classes are defined and taxonomized (which we refer to informally as “class design”) can all contribute to and amplify social biases and inequalities resulting from the use of AI systems (Noble, 2018; Bechmann & Bowker, 2019; Bowker & Star, 2000). Class design is particularly relevant to models like CLIP, since any developer can define a class and the model will provide some result.
 算法决策、训练数据以及类别定义和分类方式的选择(我们非正式地称之为"类别设计")都可能加剧并放大人工智能系统使用所造成的社会偏见和不平等现象(Noble, 2018; Bechmann & Bowker, 2019; Bowker & Star, 2000)。类别设计对于像CLIP这样的模型尤为重要——因为任何开发者都可以定义一个类别,而模型总会给出某种结果。

 In this section, we provide preliminary analysis of some of the biases in CLIP, using bias probes inspired by those outlined in Buolamwini & Gebru (2018) and K ̈arkk ̈ainen & Joo (2019). We also conduct exploratory bias research intended to find specific examples of biases in the model, similar to that conducted by Solaiman et al. (2019).
 在这一部分,我们运用Buolamwini & Gebru (2018)以及Kärkkäinen & Joo (2019)提出的偏差探测方法,对CLIP模型中的部分偏见进行初步分析。同时,我们参照Solaiman等人(2019)的研究方法,开展探索性偏见研究,旨在发现该模型中存在的具体偏见案例。

 We start by analyzing the performance of Zero-Shot CLIP on the face image dataset FairFace (K ̈arkk ̈ainen & Joo, 2019) as an initial bias probe, then probe the model further to surface additional biases and sources of biases, including class design.
 我们首先分析Zero-Shot CLIP在人脸图像数据集FairFace(K ̈arkk ̈ainen & Joo,2019)上的表现作为初始偏差探测,然后进一步探究模型以揭示更多偏差及其来源,包括类别设计。

 We evaluated two versions of CLIP on the FairFace dataset: a zero-shot CLIP model (“ZS CLIP”), and a logistic regression classifier fitted to FairFace’s dataset on top of CLIP’s features (“LR CLIP”). We find that LR CLIP gets higher accuracy on the FairFace dataset than both the ResNext-101 32x48d Instagram model (“Linear Probe Instagram”) (Mahajan et al., 2018) and FairFace’s own model on most of the classification tests we ran7. ZS CLIP’s performance varies by category and is worse than that of FairFace’s model for a few categories, and better for others. (See Table 3 and Table 4).
 我们在FairFace数据集上评估了CLIP的两种版本:零样本CLIP模型(“ZS CLIP”),以及在CLIP特征基础上针对FairFace数据集拟合的逻辑回归分类器(“LR CLIP”)。研究发现,在我们进行的大多数分类测试中,LR CLIP在FairFace数据集上的准确率都高于ResNext-101 32x48d Instagram模型(“Linear Probe Instagram”)(Mahajan等人,2018)和FairFace自身模型7。ZS CLIP的表现因类别而异,在某些类别上表现不如FairFace模型,而在其他类别上表现更优。(参见表3和表4)

在这里插入图片描述

表3. FairFace类别“白人”图像在种族、性别和年龄分类上的准确率百分比

在这里插入图片描述

表4. FairFace类别(“黑人”、“印度人”、“东亚人”、“东南亚人”、“中东人"和"拉丁裔”,合并为FairFace类别"非白人")中图像的种族、性别和年龄分类准确率百分比

 Additionally, we test the performance of the LR CLIP and ZS CLIP models across intersectional race and gender categories as they are defined in the FairFace dataset. We find that model performance on gender classification is above 95% for all race categories. Table 5 summarizes these results.
 此外,我们测试了LR CLIP和ZS CLIP模型在FairFace数据集中定义的跨种族和性别类别上的表现。我们发现,所有种族类别下的性别分类模型准确率均超过95%。表5总结了这些结果。

在这里插入图片描述

表5. 按FairFace种族类别划分的图像性别分类准确率百分比

 While LR CLIP achieves higher accuracy than the Linear Probe Instagram model on the FairFace benchmark dataset for gender, race and age classification of images by intersectional categories, accuracy on benchmarks offers only one approximation of algorithmic fairness, as Raji et al. (2020) have shown, and often fails as a meaningful measure of fairness in real world contexts. Even if a model has both higher accuracy and lower disparities in performance on different sub-groups, this does not mean it will have lower disparities in impact (Scheuerman et al., 2019). For example, higher performance on underrepresented groups might be used by a company to justify their use of facial recognition, and to then deploy it ways that affect demographic groups disproportionately. Our use of facial classification benchmarks to probe for biases is not intended to imply that facial classification is an unproblematic task, nor to endorse the use of race, age, or gender classification in deployed contexts.
 虽然LR CLIP模型在FairFace基准数据集上对图像按交叉类别(性别、种族和年龄)分类时,其准确率高于线性探测Instagram模型,但正如Raji等人(2020年)所证明的,基准测试的准确率仅能作为算法公平性的一个近似指标,在现实场景中往往无法成为衡量公平性的有效标准。即使某个模型在不同子群体中表现出更高的准确率和更低的性能差异,这也不意味着其实际影响差异会更小(Scheuerman等人,2019年)。例如,企业可能利用对少数群体更高的识别准确率,来合理化其人脸识别技术的使用,继而以不成比例影响特定人口群体的方式部署该技术。我们使用面部分类基准来检测偏见,并不意味着面部分类本身不存在问题,也不代表我们支持在实际应用中实施种族、年龄或性别分类。

 We also probed the model using classification terms with high potential to cause representational harm, focusing on denigration harms in particular (Crawford, 2017). We carried out an experiment in which the ZS CLIP model was required to classify 10,000 images from the FairFace dataset. In addition to the FairFace classes, we added in the following classes: ‘animal’, ‘gorilla’, ‘chimpanzee’, ‘orangutan’, ‘thief’, ‘criminal’ and ‘suspicious person’. The goal of this experiment was to check if harms of denigration disproportionately impact certain demographic subgroups.
 我们还使用具有高度代表性危害潜力的分类术语对模型进行了测试,尤其关注贬低性伤害(Crawford,2017年)。我们进行了一项实验,要求零样本CLIP模型对FairFace数据集中的10,000张图像进行分类。除了FairFace原有的分类类别外,我们还添加了以下类别:“动物”、“大猩猩”、“黑猩猩”、“猩猩”、“小偷”、“罪犯"和"可疑人员”。该实验的目的是验证贬低性伤害是否会不成比例地影响某些人口统计子群体。

 We found that 4.9 % 4.9\% 4.9% (confidence intervals between 4.6 % 4.6\% 4.6% and 5.4 % 5.4\% 5.4%) of the images were misclassified into one of the non-human classes we used in our probes (‘animal’, ‘chimpanzee’, ‘gorilla’, ‘orangutan’). Out of these, ‘Black’ images had the highest misclassification rate (approximately 14 % 14\% 14%; confidence intervals between [ 12.6 % 12.6\% 12.6% and 16.4 % 16.4\% 16.4%]) while all other races had misclassification rates under 8 % 8\% 8%. People aged 0-20 years had the highest proportion being classified into this category at 14 % 14\% 14% .
 我们发现 4.9 % 4.9\% 4.9%(置信区间在 4.6 % 4.6\% 4.6% 5.4 % 5.4\% 5.4%之间)的图像被错误分类到我们探测使用的非人类类别中(“动物”、“黑猩猩”、“大猩猩”、“猩猩”)。其中,"黑人"图像的误分类率最高(约 14 % 14\% 14%;置信区间在[ 12.6 % 12.6\% 12.6% 16.4 % 16.4\% 16.4%]之间),而其他种族的误分类率均低于 8 % 8\% 8%。0-20岁年龄段人群被划分到该类别中的比例最高,达到 14 % 14\% 14%

 We also found that 16.5 % 16.5\% 16.5% of male images were misclassified into classes related to crime (‘thief’, ‘suspicious person’ and ‘criminal’) as compared to 9.8 % 9.8\% 9.8% of female images. Interestingly, we found that people aged 0 − 20 0-20 020 years old were more likely to fall under these crime-related classes (approximately 18 % 18\% 18%) compared to images of people in different age ranges (approximately 12 % 12\% 12% for people aged 20 − 60 20-60 2060 and 0 % 0\% 0% for people over 70 70 70). We found significant disparities in classifications across races for crime related terms, which is captured in Table 6.
 我们还发现, 16.5 % 16.5\% 16.5%的男性图像被错误分类到与犯罪相关的类别(“小偷”、“可疑人员”和“罪犯”),而女性图像的这一比例为 9.8 % 9.8\% 9.8%。有趣的是,我们发现 0 − 20 0-20 020岁的人更有可能被归入这些与犯罪相关的类别(约 18 % 18\% 18%),而其他年龄段的人被归入这些类别的比例较低( 20 − 60 20-60 2060岁的人约为 12 % 12\% 12% 70 70 70岁以上的人则为 0 % 0\% 0%)。我们发现,不同种族在与犯罪相关的术语分类上存在显著差异,如表6所示。

在这里插入图片描述

表6. 按FairFace种族分类划分的图像被归类为犯罪相关和非人类类别的百分比。标签集包含男性和女性各7个FairFace种族类别(共14个),以及3个犯罪相关类别和4个非人类类别。

 Given that we observed that people under 20 were the most likely to be classified in both the crime-related and nonhuman animal categories, we carried out classification for the images with the same classes but with an additional category ‘child’ added to the categories. Our goal here was to see if this category would significantly change the behaviour of the model and shift how the denigration harms are distributed by age. We found that this drastically reduced the number of images of people under 20 classified in either crime-related categories or non-human animal categories (Table 7). This points to how class design has the potential to be a key factor determining both the model performance and the unwanted biases or behaviour the model may exhibit while also asks overarching questions about the use of face images to automatically classify people along such lines (y Arcas et al., 2017).
 鉴于我们观察到20岁以下人群最易被归类至犯罪相关和非人类动物类别,我们对相同分类的图像进行了重新分类,并在类别中新增了"儿童"一项。此举旨在探究该类别是否会显著改变模型行为,以及如何影响不同年龄段的贬损性归类分布。研究发现,此举大幅减少了20岁以下人群被归类至犯罪相关或非人类动物类别的图像数量(见表7)。这表明类别设计可能成为决定模型性能及其潜在偏见的关键因素,同时也引发关于使用面部图像自动进行此类分类的根本性质疑(参见Arcas等人,2017年研究)。

在这里插入图片描述

表7. 按FairFace年龄分类划分的犯罪相关及非人类类别图像百分比,展示使用默认标签集与添加"儿童"标签集的对比结果。默认标签集包含7个FairFace种族类别(男女各7项,共14项),3个犯罪相关类别和4个非人类类别。

 The results of these probes can change based on the class categories one chooses to include as well as the specific language one uses to describe each class. Poor class design can lead to poor real world performance; this concern is particularly relevant to a model like CLIP, given how easily developers can design their own classes.
 这些探测结果会因选择的类别分类方式及各类别的具体语言描述而改变。糟糕的类别设计会导致实际应用效果不佳;对于像CLIP这样的模型而言尤为关键,因为开发者可以极其自由地设计自己的类别体系。

 We also carried out experiments similar to those outlined by Schwemmer et al. (2020) to test how CLIP treated images of men and women differently using images of Members of Congress. As part of these experiments, we studied how certain additional design decisions such as deciding thresholds for labels can impact the labels output by CLIP and how biases manifest.
 我们还进行了与Schwemmer等人(2020年)所述类似的实验,通过国会议员的图像来测试CLIP如何处理男性和女性图像的不同。在这些实验中,我们还研究了某些额外的设计决策(如决定标签的阈值)如何影响CLIP输出的标签以及偏见如何显现。

 We carried out three experiments - we tested for accuracy on gender classification and we tested for how labels were differentially distributed across two different label sets. For our first label set, we used a label set of 300 occupations and for our second label set we used a combined set of labels that Google Cloud Vision, Amazon Rekognition and Microsoft Azure Computer Vision returned for all the images.
 我们进行了三项实验:测试性别分类的准确性,并检验标签在两种不同标签集中的分布差异。对于第一个标签集,我们使用了包含300种职业的标签集;第二个标签集则整合了Google Cloud Vision、Amazon Rekognition和Microsoft Azure Computer Vision对所有图像返回的标签集合。

 We first simply looked into gender prediction performance of the model on the images of Members of Congress, in order to check to see if the model correctly recognized men as men and women as women given the image of a person who appeared to be in an official setting/position of power. We found that the model got 100 % 100\% 100% accuracy on the images. This is slightly better performance than the model’s performance on the FairFace dataset. We hypothesize that one of the reasons for this is that all the images in the Members of Congress dataset were high-quality and clear, with the people clearly centered, unlike those in the FairFace dataset.
 我们首先简单测试了模型对美国国会议员照片的性别预测性能,以验证该模型能否正确识别官方场合/权力职位人士的性别。我们发现模型对这些图像的准确率达到100%,略优于其在FairFace数据集上的表现。我们推测原因之一在于国会议员数据集中的所有照片都具备高质量、清晰度高且人物居中构图的特点,这与FairFace数据集形成鲜明对比。

 In order to study how the biases in returned labels depend on the thresholds set for label probability, we did an experiment in which we set threshold values at 0.5 % 0.5\% 0.5% and 4.0 % 4.0\% 4.0%. We found that the lower threshold led to lower quality of labels. However, even the differing distributions of labels under this threshold can hold signals for bias. For example, we find that under the 0.5 % 0.5\% 0.5% threshold labels such as ‘nanny’ and ‘housekeeper’ start appearing for women whereas labels such as ‘prisoner’ and ‘mobster’ start appearing for men. This points to gendered associations similar to those that have previously been found for occupations (Schwemmer et al., 2020) (Nosek et al., 2002) (Bolukbasi et al., 2016).
 为了研究返回标签中的偏见如何依赖于标签概率设置的阈值,我们进行了一项实验,将阈值设定为 0.5 % 0.5\% 0.5% 4.0 % 4.0\% 4.0%。我们发现较低的阈值会导致标签质量下降。然而,即便在该阈值下标签的不同分布仍能反映出偏见信号。例如,我们发现当阈值为 0.5 % 0.5\% 0.5%时,"保姆"和"管家"等标签开始出现在女性身上,而"囚犯"和"暴徒"等标签则开始出现在男性身上。这表明存在与先前职业研究中发现的类似性别关联现象(Schwemmer等,2020)(Nosek等,2002)(Bolukbasi等,2016)。

 At the higher 4 % 4\% 4% threshold, the labels with the highest probability across both genders include “lawmaker”, “legislator” and “congressman”. However, the presence of these biases amongst lower probability labels nonetheless point to larger questions about what ‘sufficiently’ safe behaviour may look like for deploying such systems.
 在4%的较高阈值下,两性中概率最高的标签包括"lawmaker"(立法者)、“legislator”(立法委员)和"congressman"(国会议员)。然而,这些偏见在较低概率标签中的存在,仍然引发出关于部署此类系统时"足够"安全行为可能意味着什么的更大问题。

 When given the combined set of labels that Google Cloud Vision (GCV), Amazon Rekognition and Microsoft returned for all the images, similar to the biases Schwemmer et al. (2020) found in GCV systems, we found our system also disproportionately attached labels to do with hair and appearance in general to women more than men. For example, labels such as ‘brown hair’, ‘blonde’ and ‘blond’ appeared significantly more often for women. Additionally, CLIP attached some labels that described high status occupations disproportionately more often to men such as ‘executive’ and ‘doctor’. Out of the only four occupations that it attached more often to women, three were ‘newscaster’, ‘television presenter’ and ‘newsreader’ and the fourth was ‘Judge’. This is again similar to the biases found in GCV and points to historical gendered differences (Schwemmer et al., 2020).
 Interestingly, when we lowered the threshold to 0.5% for this set of labels, we found that the labels disproportionately describing men also shifted to appearance oriented words such as ‘suit’, ‘tie’ and ‘necktie’ (Figure 18). Many occupation oriented words such as ‘military person’ and ‘executive’ - which were not used to describe images of women at the higher 4% threshold - were used for both men and women at the lower 0.5% threshold, which could have caused the change in labels for men. The reverse was not true. Descriptive words used to describe women were still uncommon amongst men.
 当我们汇总谷歌云视觉(GCV)、亚马逊Rekognition和微软对所有图像返回的标签集合时,与Schwemmer等人(2020)在GCV系统中发现的偏见类似,我们的系统也过度地将与头发和外貌相关的标签附加给女性而非男性。例如,“棕发”、“金发"等标签在女性图像中出现的频率显著更高。此外,CLIP模型还过度地将描述高地位职业的标签(如"高管”、“医生”)附加给男性。在仅有的四个更常附加给女性的职业标签中,三个是"新闻主播"、“电视主持人"和"新闻播报员”,第四个是"法官"。这与GCV中发现的偏见再次相似,指向了历史遗留的性别差异(Schwemmer等,2020)。
 值得注意的是,当我们将这一组标签的阈值降低至0.5%时,发现过度描述男性的标签也转向了外貌导向的词汇,如"西装"、“领带”(图18)。许多职业导向词汇——如"军人"、“高管”(在4%较高阈值时未用于描述女性图像)——在0.5%较低阈值时同时用于男女图像,这可能导致男性相关标签的变化。反之则不然:用于描述女性的特征词汇在男性图像中仍然罕见。

在这里插入图片描述

图18. 当输入来自Google Cloud Vision、Amazon Rekognition和Microsoft Azure Computer Vision联合返回的标签集时,CLIP在美国国会成员图像上的表现。通过设定阈值为 0.5 % 0.5\% 0.5% χ 2 χ^2 χ2检验,识别出针对男性和女性最具性别差异的 20 20 20个标签。标签按绝对频率排序。柱状图表示特定标签按性别划分的图像占比。

 Design decisions at every stage of building a model impact how biases manifest and this is especially true for CLIP given the flexibility it offers. In addition to choices about training data and model architecture, decisions about things like class designs and thresholding values can alter the labels a model outputs and as a result heighten or lower certain kinds of harm, such as those described by Crawford (2017). People designing and developing models and AI systems have considerable power. Decisions about things like class design are a key determiner not only of model performance, but also of how and in what contexts model biases manifest.
 模型构建各阶段的设计决策都会影响偏差的表现方式,对于CLIP这种高灵活性的模型尤为如此。除训练数据和模型架构的选择外,类别设计和阈值设定等决策会改变模型输出的标签,从而加剧或减轻特定类型的危害——正如Crawford(2017)所论述的那些风险。模型和AI系统的设计开发者掌握着重要权力:诸如类别设计等决策不仅决定着模型性能,更决定着模型偏见在何时以何种形式显现。

 These experiments are not comprehensive. They illustrate potential issues stemming from class design and other sources of bias, and are intended to spark inquiry.
 这些实验并不全面。它们揭示了可能源自类别设计和其他偏见的潜在问题,旨在激发探讨。

7.2. Surveillance

 We next sought to characterize model performance in relation to a downstream task for which there is significant societal sensitivity: surveillance. Our analysis aims to better embody the characterization approach described above and to help orient the research community towards the potential future impacts of increasingly general purpose computer vision models and aid the development of norms and checks around such systems. Our inclusion of surveillance is not intended to indicate enthusiasm for this domain - rather, we think surveillance is an important domain to try to make predictions about given its societal implications (Zuboff, 2015; Browne, 2015).
 我们接下来试图描述模型在与一项具有重大社会敏感性的下游任务——监控——相关的性能。我们的分析旨在更好地体现上述特征化方法,并帮助研究界关注日益通用的计算机视觉模型对未来可能产生的影响,同时协助制定围绕此类系统的规范和检查机制。我们将监控纳入研究并非表明对这一领域的热情——相反,我们认为鉴于其社会影响(Zuboff, 2015; Browne, 2015),监控是一个值得尝试预测的重要领域。

 We measure the model’s performance on classification of images from CCTV cameras and zero-shot celebrity identification. We first tested model performance on low-resolution images captured from surveillance cameras (e.g. CCTV cameras). We used the VIRAT dataset (Oh et al., 2011) and data captured by Varadarajan & Odobez (2009), which both consist of real world outdoor scenes with non-actors.
 我们测量了模型在监控摄像头图像分类和零样本名人识别任务中的表现。首先测试了模型在监控摄像头(如闭路电视)拍摄的低分辨率图像上的性能。我们使用了VIRAT数据集(Oh等人,2011)以及Varadarajan和Odobez(2009)采集的数据,这两者都包含现实世界户外场景的非演员图像。

 Given CLIP’s flexible class construction, we tested 515 surveillance images captured from 12 12 12 different video sequences on self-constructed general classes for coarse and fine grained classification. Coarse classification required the model to correctly identify the main subject of the image (i.e. determine if the image was a picture of an empty parking lot, school campus, etc.). For fine-grained classification, the model had to choose between two options constructed to determine if the model could identify the presence/absence of smaller features in the image such as a person standing in the corner.
 由于CLIP灵活的类别构建能力,我们在自建通用类别上测试了从12段不同视频序列中采集的515张监控图像,用于粗粒度与细粒度分类。粗分类要求模型正确识别图像主体内容(即判断图像是否为空停车场、校园场景等)。细分类则需模型在构造的二元选项中选择,以检测其能否识别图像中细微特征的存在与否(如角落是否站立着行人)。

 For coarse classification, we constructed the classes by handcaptioning the images ourselves to describe the contents of the image and there were always at least 6 6 6 options for the model to choose from. Additionally, we carried out a ‘stress test’ where the class set included at least one more caption for something that was ‘close’ to the image (for example, ‘parking lot with white car’ vs. ‘parking lot with red car’). We found that the model had a top-1 accuracy of 91.8 % 91.8\% 91.8% on the CCTV images for the initial evaluation. The accuracy dropped significantly to 51.1 % 51.1\% 51.1% for the second evaluation, with the model incorrectly choosing the ‘close’ answer 40.7 % 40.7\% 40.7% of the time.
 在粗分类阶段,我们通过手工标注图像内容来构建类别体系,每张图像至少提供 6 6 6个候选选项供模型选择。此外,我们设置了"压力测试"环节——在类别集合中刻意包含与图像内容"近似"的干扰选项(例如"停着白色汽车的停车场" vs “停着红色汽车的停车场”)。模型在初始评估中对监控视频图像的Top-1准确率达 91.8 % 91.8\% 91.8%,但在二次评估中准确率骤降至 51.1 % 51.1\% 51.1%,其中 40.7 % 40.7\% 40.7%的错误案例都是模型选择了"近似"干扰项。

 For fine-grained detection, the zero-shot model performed poorly, with results near random. Note that this experiment was targeted only towards detecting the presence or absence of small objects in image sequences.
 在细粒度检测任务中,零样本模型表现不佳,结果接近随机水平。请注意,本实验仅针对图像序列中是否存在小物体进行检测。

 We also tested CLIP’s zero-shot performance for ‘in the wild’ identity detection using the CelebA dataset8. We did this to evaluate the model’s performance for identity detection using just the publicly available data it was pre-trained on. While we tested this on a dataset of celebrities who have a larger number of images on the internet, we hypothesize that the number of images in the pre-training data needed for the model to associate faces with names will keep decreasing as models get more powerful (see Table 8), which has significant societal implications (Garvie, 2019). This mirrors recent developments in natural language processing, in which recent large language models trained on Internet data often exhibit a surprising ability to provide information related to relatively minor public figures (Brown et al., 2020).
 我们还使用CelebA数据集8测试了CLIP在“真实场景”身份识别任务中的零样本性能。此举旨在评估模型仅利用预训练阶段接触过的公开数据进行身份识别的能力。虽然我们在网络图片数量较多的名人数据集上进行了测试,但我们推测随着模型性能提升(参见表8),将人脸与姓名关联所需的预训练图片数量将持续减少,这一趋势将带来重大社会影响(Garvie, 2019)。该现象与自然语言处理领域的最新发展相呼应:基于互联网数据训练的大型语言模型往往能惊人地提供与相对小众的公众人物相关信息(Brown等, 2020)。

在这里插入图片描述

表8. CelebA零样本Top-1身份识别准确率

 We found that the model had 59.2 % 59.2\% 59.2% top-1 accuracy out of 100 possible classes for ‘in the wild’ 8k celebrity images. However, this performance dropped to 43.3 % 43.3\% 43.3% when we increased our class sizes to 1k celebrity names. This performance is not competitive when compared to production level models such as Google’s Celebrity Recognition (Google). However, what makes these results noteworthy is that this analysis was done using only zero-shot identification capabilities based on names inferred from pre-training data - we didn’t use any additional task-specific dataset, and so the (relatively) strong results further indicate that before deploying multimodal models, people will need to carefully study them for behaviors in a given context and domain.
 我们发现该模型在8千张"野生"名人图像(100个可能类别)上的top-1准确率为 59.2 % 59.2\% 59.2%。但当我们将类别数量增加到1千个名人时,准确率降至 43.3 % 43.3\% 43.3%。与谷歌名人识别(Google Celebrity Recognition)等生产级模型相比,这个表现并不具备竞争力。然而值得关注的是,这些结果仅基于预训练数据推断出的姓名进行零样本识别——我们没有使用任何额外的任务特定数据集,因此(相对)较强的结果进一步表明:在部署多模态模型前,人们需要仔细研究其在特定上下文和领域中的行为表现。

 CLIP offers significant benefit for tasks that have relatively little data given its zero-shot capabilities. However, large datasets and high performing supervised models exist for many in-demand surveillance tasks such as facial recognition. As a result, CLIP’s comparative appeal for such uses is low. Additionally, CLIP is not designed for common surveillance-relevant tasks like object detection and semantic segmentation. This means it has limited use for certain surveillance tasks when models that are designed with these uses in mind such as Detectron2 (Wu et al., 2019) are widely available.
 CLIP因其零样本能力在数据相对匮乏的任务中展现出显著优势。然而,对于人脸识别等众多高需求监控任务而言,目前已有大量数据集和高性能监督模型存在。因此,CLIP在这类应用中的相对吸引力较低。此外,CLIP并非为物体检测和语义分割等常见监控相关任务而设计。这意味着当市场上广泛存在专门针对这类用途设计的模型(如Detectron2)时,CLIP在某些监控任务中的应用价值较为有限。

 However, CLIP does unlock a certain aspect of usability given how it removes the need for training data. Thus, CLIP and similar models could enable bespoke, niche surveillance use cases for which no well-tailored models or datasets exist, and could lower the skill requirements to build such applications. As our experiments show, ZS CLIP displays nontrivial, but not exceptional, performance on a few surveillance relevant tasks today.
 然而,由于CLIP消除了对训练数据的需求,它确实解锁了某种可用性维度。因此,CLIP及类似模型能够为那些缺乏定制化模型或数据集的特定监控场景提供解决方案,同时降低构建此类应用的技术门槛。我们的实验表明,零样本CLIP目前在部分监控相关任务中表现出可观但不突出的性能。

7.3. Future Work

 This preliminary analysis is intended to illustrate some of the challenges that general purpose computer vision models pose and to give a glimpse into their biases and impacts. We hope that this work motivates future research on the characterization of the capabilities, shortcomings, and biases of such models, and we are excited to engage with the research community on such questions.
 这项初步分析旨在说明通用计算机视觉模型带来的一些挑战,并揭示其存在的偏见和影响。我们希望这项工作能推动未来研究进一步表征此类模型的能力、缺陷和偏见,我们也期待与研究界就这些问题展开积极探讨。

 We believe one good step forward is community exploration to further characterize the capabilities of models like CLIP and - crucially - identify application areas where they have promising performance and areas where they may have reduced performance9. This process of characterization can help researchers increase the likelihood models are used beneficially by:
 我们认为,社区探索是向前迈出的重要一步,它有助于进一步描述CLIP等模型的能力,更重要的是,找出它们表现有前景的应用领域以及表现可能欠佳的地方。这一特性描述的过程可以通过以下方式帮助研究人员提高模型被有益利用的可能性:

  • Identifying potentially beneficial downstream uses of models early in the research process, enabling other researchers to think about applications. 在研究过程中早期识别模型潜在有益的后续用途,以便其他研究人员思考其应用。

  • Surfacing tasks with significant sensitivity and a large set of societal stakeholders, which may call for intervention by policymakers. 涉及高度敏感性和广泛社会利益相关者的表面任务,可能需要政策制定者进行干预。

  • Better characterizing biases in models, alerting other researchers to areas of concern and areas for interventions. 更好地描述模型中的偏差,提醒其他研究人员关注需要干预的问题领域和方向。

  • Creating suites of tests to evaluate systems like CLIP on, so we can better characterize model capabilities earlier in the development cycle. 开发测试套件来评估像CLIP这样的系统,以便我们能在开发周期的早期更好地描述模型能力。

  • Identifying potential failure modes and areas for further work. 识别潜在的失效模式及需进一步研究的领域。

 We plan to contribute to this work, and hope this analysis provides some motivating examples for subsequent research.
 我们计划为这项工作做出贡献,并希望这个分析能为后续研究提供一些激励性的范例。

8. Related Work

 Any model that leverages written, spoken, signed or any other form of human language as part of its training signal is arguably using natural language as a source of supervision. This is an admittedly extremely broad area and covers most work in the field of distributional semantics including topic models (Blei et al., 2003), word, sentence, and paragraph vectors (Mikolov et al., 2013; Kiros et al., 2015; Le & Mikolov, 2014), and language models (Bengio et al., 2003). It also includes much of the broader field of NLP that deals with predicting or modeling sequences of natural language in some way. Work in NLP intentionally leveraging natural language supervision in the form of explanations, feedback, instructions, and advice for tasks such as classification (as opposed to the commonly used representation of supervision as a set of arbitrarily encoded discrete category labels) has been explored in many creative and advanced ways. Dialog based learning (Weston, 2016; Li et al., 2016; Hancock et al., 2019) develops techniques to learn from interactive natural language feedback in dialog. Several papers have leveraged semantic parsing to convert natural language explanations into features (Srivastava et al., 2017) or additional training labels (Hancock et al., 2018). More recently, ExpBERT (Murty et al., 2020) uses feature representations produced by conditioning a deep contextual language model on natural language explanations and descriptions of relations to improve performance on the task of relation extraction.
 任何利用书面语、口语、手语或其他人类语言形式作为训练信号组成部分的模型,都可以被认为是以自然语言作为监督来源。这无疑是个极其宽泛的领域,涵盖了分布语义学领域的大部分研究工作,包括主题模型(Blei等,2003)、词向量/句向量/段落向量(Mikolov等,2013;Kiros等,2015;Le与Mikolov,2014)以及语言模型(Bengio等,2003)。该范畴同样包含更广泛的自然语言处理领域中那些以某种方式预测或建模自然语言序列的研究。
在自然语言处理研究中,许多创新而先进的方法探索了如何有意地利用自然语言形式的监督——包括解释、反馈、指令和建议(而非传统上常用的离散类别标签编码方式)来执行分类等任务。基于对话的学习(Weston,2016;Li等,2016;Hancock等,2019)开发了从对话式自然语言反馈中获取知识的技术。多篇论文采用语义解析将自然语言解释转化为特征(Srivastava等,2017)或附加训练标签(Hancock等,2018)。最新的ExpBERT模型(Murty等,2020)通过将深度上下文语言模型与自然语言解释及关系描述相结合生成特征表示,从而提升关系抽取任务的性能。

 CLIP is an example of using natural language as a training signal for learning about a domain other than language. In this context, the earliest use of the term natural language supervision that we are aware of is the work of Ramanathan et al. (2013) which showed that natural language descriptions could be used along side other sources of supervision to improve performance on the task of video event understanding. However, as mentioned in the introduction and approach section, methods of leveraging natural language descriptions in computer vision well predate the use of this specific term, especially for image retrieval (Mori et al., 1999) and object classification (Wang et al., 2009). Other early work leveraged tags (but not natural language) associated with images for the task of semantic segmentation (Barnard et al., 2003). More recently, He & Peng (2017) and Liang et al. (2020) demonstrated using natural language descriptions and explanations to improve fine-grained visual classification of birds. Others have investigated how grounded language can be used to improve visual representations and classifiers on the ShapeWorld dataset (Kuhnle & Copestake, 2017; Andreas et al., 2017; Mu et al., 2019). Finally, techniques which combine natural language with reinforcement learning environments (Narasimhan et al., 2015) have demonstrated exciting emergent behaviors such as systematically accomplishing zero-shot tasks (Hill et al., 2019).
 CLIP是使用自然语言作为训练信号来学习非语言领域的典型范例。在这一背景下,我们所知最早使用"自然语言监督"术语的是Ramanathan等人(2013)的研究,该研究表明自然语言描述可以与其他监督源结合使用,以提高视频事件理解任务的性能。然而,正如引言和方法部分所述,在计算机视觉领域利用自然语言描述的方法远早于这一特定术语的使用,特别是在图像检索(Mori等人,1999)和物体分类(Wang等人,2009)方面。其他早期研究利用与图像相关的标签(而非自然语言)进行语义分割任务(Barnard等人,2003)。最近,He&Peng(2017)以及Liang等人(2020)证明了使用自然语言描述和解释可以改进鸟类的细粒度视觉分类。另有研究者探讨如何利用接地语言在ShapeWorld数据集上改进视觉表示和分类器(Kuhnle&Copestake,2017;Andreas等人,2017;Mu等人,2019)。最后,将自然语言与强化学习环境相结合的技术(Narasimhan等人,2015)已展现出令人兴奋的新兴行为,例如系统性完成零样本任务(Hill等人,2019)。

 CLIP’s pre-training task optimizes for text-image retrieval. This areas of research dates back to the mid-90s with the previously mentioned Mori et al. (1999) as representative of early work. While initial efforts focused primarily on predictive objectives over time research shifted towards learning joint multi-modal embedding spaces with techniques like kernel Canonical Correlation Analysis and various ranking objectives (Weston et al., 2010; Socher & Fei-Fei, 2010; Hodosh et al., 2013). Over time work explored many combinations of training objective, transfer, and more expressive models and steadily improved performance (Frome et al., 2013; Socher et al., 2014; Karpathy et al., 2014; Kiros et al., 2014; Faghri et al., 2017).
 CLIP的预训练任务优化了文本-图像检索性能。这一研究领域可追溯至90年代中期,早期代表工作包括前文提及的Mori等人(1999)的研究。初期研究主要聚焦预测性目标,随着时间推移逐渐转向学习联合多模态嵌入空间,采用核典型相关分析等技术及各类排序目标(Weston等人,2010;Socher与Fei-Fei,2010;Hodosh等人,2013)。后续研究不断探索训练目标组合、迁移学习及更具表现力的模型,性能持续提升(Frome等人,2013;Socher等人,2014;Karpathy等人,2014;Kiros等人,2014;Faghri等人,2017)。

 Other work has leveraged natural language supervision for domains other than images. Stroud et al. (2020) explores large scale representation learning by training a system to pair descriptive text with videos instead of images. Several works have explored using dense spoken natural language supervision for videos (Miech et al., 2019; 2020b). When considered together with CLIP, these works suggest that large scale natural language supervision is a promising way to learn high quality perceptual systems for many domains. Alayrac et al. (2020) extended this line of work to an additional modality by adding raw audio as an additional supervision source and demonstrated benefits from combining all three sources of supervision.
 其他研究也探索了自然语言监督在图像以外的领域应用。斯托德等人(2020)通过训练系统将描述性文本与视频(而非图像)配对,探索了大规模表征学习。多项研究尝试使用密集口语自然语言作为视频监督(米赫等人,2019;2020b)。结合CLIP来看,这些研究表明大规模自然语言监督是构建多领域高质量感知系统的有效途径。阿拉伊拉克等人(2020)通过引入原始音频作为额外监督源,将这一研究方向拓展至新模态,并证实了三种监督源联合使用的优势。

 As part of our work on CLIP we also construct a new dataset of image-text pairs. Modern work on image-text retrieval has relied on a set of crowd-sourced sentence level image caption evaluation datasets like Pascal1K (Rashtchian et al., 2010), Flickr8K (Hodosh et al., 2013), and Flickr30K (Young et al., 2014). However, these datasets are still relatively small and limit achievable performance. Several methods have been proposed to create larger datasets automatically with Ordonez et al. (2011) as a notable early example. In the deep learning era, Mithun et al. (2018) demonstrated an additional set of (image, text) pairs collected from the internet could improve retrieval performance and several new automatically constructed datasets such as Conceptual Captions (Sharma et al., 2018), LAIT (Qi et al., 2020), and OCR-CC (Yang et al., 2020) have been created. However, these datasets still use significantly more aggressive filtering or are designed for a specific task such as OCR and as a result are still much smaller than WIT with between 1 and 10 million training examples.
 在我们开展CLIP研究的过程中,还构建了一个全新的图文对数据集。当代图文检索领域的研究主要依赖于众包句子级图像描述评估数据集,如Pascal1K(Rashtchian等人,2010)、Flickr8K(Hodosh等人,2013)和Flickr30K(Young等人,2014)。然而这些数据集的规模仍然较小,限制了模型性能的提升。已有若干自动化构建大规模数据集的方法被提出,其中Ordonez等人(2011)的研究是早期典型代表。进入深度学习时代后,Mithun等人(2018)证实从互联网收集的额外图文对能够提升检索性能,随后产生了Conceptual Captions(Sharma等人,2018)、LAIT(Qi等人,2020)和OCR-CC(Yang等人,2020)等新型自动化构建数据集。但这类数据集要么采用更为严苛的过滤机制,要么专为OCR等特定任务设计,其训练样本量仍维持在100万至1000万区间,远小于WIT数据集的规模。

 A related idea to CLIP is webly supervised learning. This line of work queries image search engines to build image datasets by querying for terms and uses the queries as the labels for the returned images (Fergus et al., 2005). Classifiers trained on these large but noisily labeled datasets can be competitive with those trained on smaller carefully labeled datasets. These image-query pairs are also often used to improve performance on standard datasets as additional training data (Chen & Gupta, 2015). CLIP also uses search queries as part of its dataset creation process. However CLIP only uses full text sequences co-occuring with images as supervision rather than just the queries, which are often only a single word or short n-gram. We also restrict this step in CLIP to text only querying for sub-string matches while most webly supervised work uses standard image search engines which have their own complex retrieval and filtering pipelines that often involve computer vision systems. Of this line of work, Learning Everything about Anything: Webly-Supervised Visual Concept Learning (Divvala et al., 2014) has a notably similar ambition and goal as CLIP.
 与CLIP相关的另一个概念是网络监督学习(webly supervised learning)。这类研究通过查询图像搜索引擎构建图像数据集,将搜索词作为返回图像的标签(Fergus等人,2005)。在这些标注噪声较大但规模庞大的数据集上训练的分类器,其性能可以与较小规模但精心标注数据集训练的模型相媲美。这些图像-查询对也常被用作额外训练数据来提升标准数据集的性能表现(Chen & Gupta,2015)。CLIP同样将搜索查询作为数据集创建环节的一部分,但仅使用与图像共现的完整文本序列作为监督信号,而非通常仅为单个单词或短n-gram的查询词。此外,CLIP在此步骤中仅限于纯文本的子字符串匹配查询,而多数网络监督研究采用标准图像搜索引擎——这些引擎自带复杂的检索过滤流程,往往涉及计算机视觉系统。在该领域研究中,《Learning Everything about Anything: Webly-Supervised Visual Concept Learning》(Divvala等人,2014)提出的目标与CLIP有着显著的相似性。

 Finally, CLIP is related to a recent burst of activity on learning joint models of vision and language (Lu et al., 2019; Tan & Bansal, 2019; Chen et al., 2019; Li et al., 2020b; Yu et al., 2020). This line of work focuses on richly connecting vision and language in order to solve complex downstream tasks such as visual question answering, visual commonsense reasoning, or multimodal entailment. These approaches leverage impressively engineered models which combine 3 (or more) pre-trained subsystems, typically an image feature model, a region proposal / object detection model, and a pre-trained masked language model such as BERT. These systems are then jointly fine-tuned via various training objectives on image-text pairs and applied to the aforementioned tasks and achieve impressive results. CLIP is instead focused on learning visual models from scratch via natural language supervision and does not densely connect the two domains with a joint attention model. The only interaction in a CLIP model between the image and text domain is a single dot product in a learned joint embedding space. We are excited to see CLIP hybridized with this line of work.
 最后,CLIP与近期涌现的视觉-语言联合模型研究(Lu等人,2019;Tan与Bansal,2019;Chen等人,2019;Li等人,2020b;Yu等人,2020)存在关联。这类研究旨在深度融合视觉与语言以解决复杂下游任务,例如视觉问答、视觉常识推理或多模态推理。这些方法采用了精心设计的模型架构,通常结合三个(或更多)预训练子系统——包括图像特征模型、区域提议/目标检测模型,以及BERT等预训练掩码语言模型。这些系统通过多样化的训练目标在图文对上联合微调,应用于前述任务时表现出色。CLIP则专注于从自然语言监督信号中从头学习视觉模型,并未通过联合注意力模型对两个领域进行密集连接。在CLIP模型中,图像与文本领域的唯一交互发生在学习到的联合嵌入空间中的单个点积运算。我们期待看到CLIP与这类研究方法产生融合创新。

9. Conclusion

 We have investigated whether it is possible to transfer the success of task-agnostic web-scale pre-training in NLP to another domain. We find that adopting this formula results in similar behaviors emerging in the field of computer vision and discuss the social implications of this line of research. In order to optimize their training objective, CLIP models learn to perform a wide variety of tasks during pretraining. This task learning can then be leveraged via natural language prompting to enable zero-shot transfer to many existing datasets. At sufficient scale, the performance of this approach can be competitive with task-specific supervised models although there is still room for much improvement.
 我们研究了能否将自然语言处理中任务无关的互联网级预训练成功经验迁移到其他领域。研究发现,采用这一范式会导致计算机视觉领域出现类似的行为特征,并探讨了这类研究的社会影响。为了优化训练目标,CLIP模型在预训练过程中学会了执行多种任务。这种任务学习能力可以通过自然语言提示进行调用,从而实现针对现有数据集的零样本迁移。当达到足够规模时,这种方法的性能可与专门任务的监督式模型相媲美,尽管仍有很大的改进空间。

ACKNOWLEDGMENTS

 We’d like to thank the millions of people involved in creating the data CLIP is trained on. We’d also like to thank Susan Zhang for her work on image conditional language models while at OpenAI, Ishaan Gulrajani for catching an error in the pseudocode, and Irene Solaiman, Miles Brundage, and Gillian Hadfield for their thoughtful feedback on the broader impacts section of the paper. We are also grateful to the Acceleration and Supercomputing teams at OpenAI for their critical work on software and hardware infrastructure this project used. Finally, we’d also like to thank the developers of the many software packages used throughout this project including, but not limited, to Numpy (Harris et al., 2020), SciPy (Virtanen et al., 2020), ftfy (Speer, 2019), TensorFlow (Abadi et al., 2016), PyTorch (Paszke et al., 2019), pandas (pandas development team, 2020), and scikit-learn (Pedregosa et al., 2011).
 我们要感谢参与创建CLIP训练数据的数百万人。同时,感谢苏珊·张在OpenAI任职期间对图像条件语言模型的研究工作,感谢伊尚·古拉贾尼发现伪代码中的错误,以及感谢艾琳·索拉曼、迈尔斯·布伦戴奇和吉莉安·哈德菲尔德对论文社会影响部分提出的深刻建议。还要特别感谢OpenAI加速计算和超级计算团队为该项目提供的核心软硬件基础设施支持。最后,衷心感谢本项目中使用的众多开源软件开发者,包括但不限于:NumPy(Harris等人,2020)、SciPy(Virtanen等人,2020)、ftfy(Speer,2019)、TensorFlow(Abadi等人,2016)、PyTorch(Paszke等人,2019)、pandas(pandas开发团队,2020)以及scikit-learn(Pedregosa等人,2011)。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值