Momentum Contrast for Unsupervised Visual Representation Learning （对比学习 MoCo 论文+代码）

原创已于 2023-09-28 00:29:03 修改 · 662 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #对比学习 #无监督 #计算机视觉 #图像分割

于 2023-09-27 23:46:29 首次发布

对比学习专栏收录该内容

1 篇文章

订阅专栏

Momentum Contrast for Unsupervised Visual Representation Learning

原文地址：Momentum Contrast for Unsupervised Visual Representation Learning
代码复现：见文末代码部分
更多精彩内容请关注微信公众号：听潮庭。

文章目录

Momentum Contrast for Unsupervised Visual Representation Learning
代码

Abstract

We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

我们提出动量对比(MoCo)用于无监督视觉表征学习。从对比学习作为字典查找的角度出发，我们构建了一个带有队列和移动平均编码器的动态字典。这使得建立一个大的和一致的字典，方便对比无监督学习。MoCo在ImageNet分类的通用线性协议下提供了具有竞争力的结果。更重要的是，MoCo学习到的表征可以很好地转移到下游任务中。在PASCAL VOC、COCO和其他数据集上，MoCo在7个检测/分割任务上的表现优于有监督的预训练对手，有时甚至远远超过它。这表明在许多视觉任务中，无监督和有监督表示学习之间的差距已经很大程度上缩小了。

1. Introduction

Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by GPT and BERT. But supervised pre-training is still dominant in computer vision, where unsupervised methods generally lag behind. The reason may stem from differences in their respective signal spaces. Language tasks have discrete signal spaces (words, sub-word units, etc.) for building tokenized dictionaries, on which unsupervised learning can be based. Computer vision, in contrast, further concerns dictionary building, as the raw signal is in a continuous, high-dimensional space and is not structured for human communication (e.g., unlike words).

无监督表示学习在自然语言处理中非常成功，如GPT和BERT所示。但在计算机视觉领域，监督预训练仍然占主导地位，而非监督方法通常落后于此。其原因可能源于它们各自信号空间的差异。语言任务具有离散的信号空间(词，子词单位等)，用于构建标记化字典，无监督学习可以基于此。相比之下，计算机视觉则进一步关注字典构建，因为原始信号处于连续的高维空间中，并且不适合人类交流(例如，不像单词)。

Several recent studies present promising results on unsupervised visual representation learning using approaches related to the contrastive loss. Though driven by various motivations, these methods can be thought of as building dynamic dictionaries. The “keys” (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss.

从这个角度来看，我们假设建立字典是可取的:(i)大，(ii)在训练过程中保持一致。直观地说，更大的字典可以更好地采样底层连续的高维视觉空间，而字典中的键应该由相同或类似的编码器表示，以便它们与查询的比较是一致的。然而，使用对比损耗的现有方法可能在这两个方面中的一个方面受到限制(稍后将在上下文中讨论)。

We present Momentum Contrast (MoCo) as a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss (Figure 1). We maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The queue decouples the dictionary size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.

我们提出了动量对比(MoCo)作为一种构建具有对比损失的无监督学习的大型一致字典的方法(图1)。我们将字典维护为数据样本队列:当前小批的编码表示被排队，最老的被排队。队列将字典大小与迷你批处理大小解耦，从而允许它很大。此外，由于字典键来自于前面的几个小批量，因此提出了一种缓慢推进的键编码器，作为查询编码器的基于动量的移动平均来实现，以保持一致性。
Fig1

MoCo is a mechanism for building dynamic dictionaries for contrastive learning, and can be used with various pretext tasks. In this paper, we follow a simple instance discrimination task: a query matches a key if they are encoded views (e.g., different crops) of the same image. Using this pretext task, MoCo shows competitive results under the common protocol of linear classification in the ImageNet dataset.

MoCo是一种为对比学习构建动态字典的机制，可用于各种借口任务。在本文中，我们遵循一个简单的实例识别任务:如果它们是同一图像的编码视图(例如，不同的作物)，则查询匹配一个键。使用此借口任务，MoCo在ImageNet数据集的常见线性分类协议下显示了竞争结果。

A main purpose of unsupervised learning is to pre-train representations (i.e., features) that can be transferred to downstream tasks by fine-tuning. We show that in 7 downstream tasks related to detection or segmentation, MoCo unsupervised pre-training can surpass its ImageNet supervised counterpart, in some cases by nontrivial margins. In these experiments, we explore MoCo pre-trained on ImageNet or on a one-billion Instagram image set, demonstrating that MoCo can work well in a more real-world, billionimage scale, and relatively uncurated scenario. These results show that MoCo largely closes the gap between unsupervised and supervised representation learning in many computer vision tasks, and can serve as an alternative to ImageNet supervised pre-training in several applications.

无监督学习的一个主要目的是预训练表征(即特征)，这些表征可以通过微调转移到下游任务。我们表明，在与检测或分割相关的7个下游任务中，MoCo无监督预训练可以超过其ImageNet有监督的对应项，在某些情况下，这一差距非常大。在这些实验中，我们探索了在ImageNet或10亿张Instagram图像集上预训练的MoCo，证明MoCo可以在更真实的、10亿张图像规模和相对未经策划的场景中很好地工作。这些结果表明，MoCo在很大程度上缩小了许多计算机视觉任务中无监督和有监督表示学习之间的差距，并且可以在一些应用中作为ImageNet监督预训练的替代方案。

2. Related Work

Unsupervised/self-supervised learning methods generally involve two aspects: pretext tasks and loss functions. The term “pretext” implies that the task being solved is not of genuine interest, but is solved only for the true purpose of learning a good data representation. Loss functions can often be investigated independently of pretext tasks. MoCo focuses on the loss function aspect. Next we discuss related studies with respect to these two aspects.

无监督/自监督学习方法一般包括两个方面:借口任务和损失函数。术语“借口”意味着要解决的任务不是真正感兴趣的，而是为了学习良好的数据表示而解决的。损失函数通常可以独立于借口任务进行研究。MoCo侧重于损失函数方面。下面我们就这两个方面的相关研究进行讨论。

Loss functions. A common way of defining a loss function is to measure the difference between a model’s prediction and a fixed target, such as reconstructing the input pixels (e.g., auto-encoders) by L1 or L2 losses, or classifying the input into pre-defined categories (e.g., eight positions, color bins) by cross-entropy or margin-based losses. Other alternatives, as described next, are also possible.

损失函数。定义损失函数的一种常见方法是测量模型预测与固定目标之间的差异，例如通过L1或L2损失重建输入像素(例如，自编码器)，或者通过交叉熵或基于边缘的损失将输入分类为预定义的类别(例如，八个位置，颜色箱)。其他选择，如下面所述，也是可能的。

Contrastive losses measure the similarities of sample pairs in a representation space. Instead of matching an input to a fixed target, in contrastive loss formulations the target can vary on-the-fly during training and can be defined in terms of the data representation computed by a network. Contrastive learning is at the core of several recent works on unsupervised learning, which we elaborate on later in context (Sec. 3.1).

对比损失衡量一个表示空间中样本对的相似性。在对比损失公式中，目标可以在训练过程中动态变化，并且可以根据网络计算的数据表示来定义，而不是将输入与固定目标匹配。对比学习是最近几项关于无监督学习的研究的核心，我们将在后面的上下文中详细阐述(第3.1节)。

Adversarial losses [24] measure the difference between probability distributions. It is a widely successful technique for unsupervised data generation. Adversarial methods for representation learning are explored in [15, 16]. There are relations (see [24]) between generative adversarial networks and noise-contrastive estimation (NCE).

对抗性损失[24]衡量概率分布之间的差异。这是一项广泛成功的技术用于无监督数据生成。表征学习的对抗性方法在[15,16]中进行了探讨。生成对抗网络和噪声对比估计(NCE)之间存在关系(参见[24])[28]。

Pretext tasks. A wide range of pretext tasks have been proposed. Examples include recovering the input under some corruption, e.g., denoising auto-encoders [58], context autoencoders [48], or cross-channel auto-encoders (colorization) [64, 65]. Some pretext tasks form pseudo-labels by, e.g., transformations of a single (“exemplar”) image [17], patch orderings [13, 45], tracking [59] or segmenting objects [47] in videos, or clustering features [3, 4].

任务的借口。已经提出了各种各样的借口任务。示例包括在某些损坏情况下恢复输入，例如去噪自编码器[58]，上下文自编码器[48]或跨通道自编码器(着色)[64,65]。一些借口任务形成伪标签，例如，单个(“范例”)图像的变换[17]，补丁排序[13,45]，跟踪[59]或分割视频中的对象[47]，或聚类特征[3,4]。

Contrastive learning vs. pretext tasks. Various pretext tasks can be based on some form of contrastive loss functions. The instance discrimination method [61] is related to the exemplar-based task [17] and NCE [28]. The pretext task in contrastive predictive coding (CPC) [46] is a form of context auto-encoding [48], and in contrastive multiview coding (CMC) [56] it is related to colorization [64].

对比学习与借口任务。各种借口任务可以基于某种形式的对比损失函数。实例辨别方法[61]与基于范例的任务[17]和NCE[28]有关。对比预测编码(CPC)[46]中的托辞任务是一种语境自动编码[48]，对比多视图编码(CMC)[56]中的托辞任务与着色有关[64]。

3. Method

3.1. Contrastive Learning as Dictionary Look-up

Contrastive learning, and its recent developments, can be thought of as training an encoder for a dictionary look-up task, as described next. Consider an encoded query q and a set of encoded samples {k₀, k₁, k₂,…} that are the keys of a dictionary. Assume that there is a single key (denoted as k₊) in the dictionary that q matches. A contrastive loss [29] is a function whose value is low when q is similar to its positive key k₊ and dissimilar to all other keys (considered negative keys for q). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE is considered in this paper:

考虑一个编码查询q和一组编码样本{k₀, k₁, k₂,…}是字典的键。假设字典中有一个键(记为k₊)与q匹配。对比损失[29]是指当q与它的正键k₊相似，且与其他所有键不相似时(认为是q的负键)，其值较低的函数。通过点积度量相似性，本文考虑了一种对比损失函数InfoNCE [46]:

$\mathcal{L}_{q} =-\log \frac{exp(q \cdot k )} {\Sigma^{K}_{i=0} exp(q \cdot k_i /\tau)}$

where $\tau$ is a temperature hyper-parameter per. The sum is over one positive and K negative samples. Intuitively, this loss is the log loss of a (K+1)-way softmax-based classifier that tries to classify q as k₊. Contrastive loss functions can also be based on other forms [29, 59, 61, 36], such as margin-based losses and variants of NCE losses.

其中τ为温度超参数per[61]。和是1个正样本和K个负样本。直观地说，这个损失是试图将q分类为K+的(K+1)路基于softmax的分类器的log损失。对比损失函数也可以基于其他形式[29,59,61,36]，如基于边际的损失和NCE损失的变体。

The contrastive loss serves as an unsupervised objective function for training the encoder networks that represent the queries and keys [29]. In general, the query representation is $q=f_{q}(x^{q})$ where $f_{q}$ is an encoder network and $x^q$ is a query sample (likewise, $k=f_{k}(x^{k})$ ). Their instantiations depend on the specific pretext task. The input $x^q$ and $x^k$ can be images [29, 61, 63], patches [46], or context consisting a set of patches [46]. The networks $f_{q}$ and $f_{k}$ can be identical [29, 59, 63], partially shared [46, 36, 2], or different [56].

对比损失作为无监督目标函数，用于训练表示查询和键的编码器网络[29]。通常，查询表示为q = fq(x q)，其中fq是编码器网络，x q是查询样本(同样，k = fk(x k))。它们的实例化取决于特定的借口任务。输入的x q和x k可以是图像[29,61,63]，补丁[46]，或由一组补丁组成的上下文[46]。网络fq和fk可以是相同的[29,59,63]，部分共享的[46,36,2]，或者不同的[56]。

3.2. Momentum Contrast

From the above perspective, contrastive learning is a way of building a discrete dictionary on high-dimensional continuous inputs such as images. The dictionary is dynamic in the sense that the keys are randomly sampled, and that the key encoder evolves during training. Our hypothesis is that good features can be learned by a large dictionary that covers a rich set of negative samples, while the encoder for the dictionary keys is kept as consistent as possible despite its evolution. Based on this motivation, we present Momentum Contrast as described next.

从上面的角度来看，对比学习是一种在高维连续输入(如图像)上构建离散字典的方法。字典是动态的，因为键是随机采样的，并且键编码器在训练过程中不断发展。我们的假设是，好的特征可以通过一个包含丰富负样本集的大字典来学习，而字典键的编码器尽管在进化中仍尽可能保持一致。基于这一动机，我们提出动量对比，如下所述。

Dictionary as a queue. At the core of our approach is maintaining the dictionary as a queue of data samples. This allows us to reuse the encoded keys from the immediate preceding mini-batches. The introduction of a queue decouples the dictionary size from the mini-batch size. Our dictionary size can be much larger than a typical mini-batch size, and can be flexibly and independently set as a hyper-parameter.

作为队列的字典。我们方法的核心是将字典维护为数据样本队列。这允许我们重用前面小批量的编码密钥。队列的引入将字典大小与小批处理大小解耦。我们的字典大小可以比典型的迷你批处理大小大得多，并且可以灵活和独立地设置为超参数。

The samples in the dictionary are progressively replaced. The current mini-batch is enqueued to the dictionary, and the oldest mini-batch in the queue is removed. The dictionary always represents a sampled subset of all data, while the extra computation of maintaining this dictionary is manageable. Moreover, removing the oldest mini-batch can be beneficial, because its encoded keys are the most outdated and thus the least consistent with the newest ones.

字典中的样本逐渐被替换。当前的小批被加入字典队列，队列中最老的小批被删除。字典总是表示所有数据的一个抽样子集，而维护这个字典的额外计算是可管理的。此外，删除最旧的迷你批可能是有益的，因为它的编码密钥是最过时的，因此与最新的密钥最不一致。

Momentum update. Using a queue can make the dictionary large, but it also makes it intractable to update the key encoder by back-propagation (the gradient should propagate to all samples in the queue). A na¨ıve solution is to copy the key encoder fk from the query encoder fq, ignoring this gradient. But this solution yields poor results in experiments (Sec. 4.1). We hypothesize that such failure is caused by the rapidly changing encoder that reduces the key representations’ consistency. We propose a momentum update to address this issue.Formally, denoting the parameters of f_k as θ_k and those of f_q as θ_q, we update θ_k by:

动量更新。使用队列可以使字典变大，但它也使得通过反向传播(梯度应该传播到队列中的所有样本)更新密钥编码器变得难以处理。一个比较简单的解决方案是从查询编码器fq复制键编码器fk，忽略这个梯度。但是这个解决方案在实验中产生了很差的结果(第4.1节)。我们假设这种失败是由于快速变化的编码器降低了密钥表示的一致性造成的。我们建议对这一问题进行动量更新。形式上，将f_k的参数记为θ_k, f_q的参数记为θ_q，我们将θ_k更新为:
$\theta_k \leftarrow m\theta_k+(1-m)\theta_q$

Here m $\in$ [0, 1) is a momentum coefficient. Only the parameters θ_q are updated by back-propagation. The momentum update in Eqn.(2) makes θ_k evolve more smoothly than θ_q. As a result, though the keys in the queue are encoded by different encoders (in different mini-batches), the difference among these encoders can be made small. In experiments, a relatively large momentum (e.g., m = 0.999, our default) works much better than a smaller value (e.g., m = 0.9), suggesting that a slowly evolving key encoder is a core to making use of a queue.

这里m $\in$ [0,1)为动量系数。只有参数θ_q通过反向传播更新。Eqn.(2)中的动量更新使得θ_k比θ_q更平滑地演化。因此，尽管队列中的键由不同的编码器编码(在不同的mini-batch中)，但这些编码器之间的差异可以很小。在实验中，相对较大的动量(例如m = 0.999，我们的默认值)比较小的动量(例如m = 0.9)要好得多，这表明缓慢发展的密钥编码器是利用队列的核心。

Relations to previous mechanisms. MoCo is a general mechanism for using contrastive losses. We compare it with two existing general mechanisms in Figure 2. They exhibit different properties on the dictionary size and consistency.

与先前机制的关系。MoCo是使用对比损耗的一般机制。我们将其与图2中现有的两种通用机制进行比较。它们在字典大小和一致性上表现出不同的属性。
Fig 2
图2。三种对比损失机制的概念比较(经验比较见图3和表3)。这里我们举例说明一对查询和键。这三种机制的不同之处在于如何维护密钥以及如何更新密钥编码器。(a):计算查询和键表示的编码器通过反向传播端到端更新(两个编码器可以不同)。(b):密钥表示从存储库中采样[61]。©: MoCo通过动量更新编码器对新键进行动态编码，并维护一个键队列(未在此图中说明)。

The end-to-end update by back-propagation is a natural mechanism (e.g., [29, 46, 36, 63, 2, 35], Figure 2a). It uses samples in the current mini-batch as the dictionary, so the keys are consistently encoded (by the same set of encoder parameters). But the dictionary size is coupled with the mini-batch size, limited by the GPU memory size. It is also challenged by large mini-batch optimization [25]. Some recent methods [46, 36, 2] are based on pretext tasks driven by local positions, where the dictionary size can be made larger by multiple positions. But these pretext tasks may require special network designs such as patchifying the input [46] or customizing the receptive field size [2], which may complicate the transfer of these networks to downstream tasks.

通过反向传播进行的端到端更新是一种自然机制(例如[29,46,36,63,2,35]，图2a)。它使用当前迷你批处理中的样本作为字典，因此键被一致地编码(由相同的编码器参数集)。但是字典大小与mini-batch大小相结合，受到GPU内存大小的限制。它也受到大型小批量优化的挑战[25]。最近的一些方法[46,36,2]是基于由局部位置驱动的借口任务，其中多个位置可以使字典大小变大。但这些借口任务可能需要特殊的网络设计，如修补输入[46]或定制接受域大小[2]，这可能会使这些网络向下游任务的转移复杂化。

Another mechanism is the memory bank approach proposed by [61] (Figure 2b).*** A memory bank consists of the representations of all samples in the dataset***. The dictionary for each mini-batch is randomly sampled from the memory bank with no back-propagation, so it can support a large dictionary size. However, the representation of a sample in the memory bank was updated when it was last seen, so the sampled keys are essentially about the encoders at multiple different steps all over the past epoch and thus are less consistent. A momentum update is adopted on the memory bank in [61]. Its momentum update is on the representations of the same sample, not the encoder. This momentum update is irrelevant to our method, because MoCo does not keep track of every sample. Moreover, our method is more memory-efficient and can be trained on billion-scale data, which can be intractable for a memory bank. Sec. 4 empirically compares these three mechanisms.

另一种机制是由[61]提出的记忆库方法(图2b)。内存库由数据集中所有样本的表示组成。每个mini-batch的字典都是从内存库中随机采样的，没有反向传播，因此它可以支持较大的字典大小。然而，当最后一次看到样本时，内存库中样本的表示是更新的，因此采样的密钥本质上是关于过去一个epoch中多个不同步骤的编码器的，因此不太一致。文献[61]对内存库采用动量更新。它的动量更新是在相同样本的表示上，而不是在编码器上。这个动量更新与我们的方法无关，因为MoCo并没有跟踪每个样本。此外，我们的方法具有更高的内存效率，可以在十亿规模的数据上进行训练，这对于内存库来说是难以处理的。第4节对这三种机制进行了实证比较。

3.3. Pretext Task

Contrastive learning can drive a variety of pretext tasks. As the focus of this paper is not on designing a new pretext task, we use a simple one mainly following the instance discrimination task in [61], to which some recent works [63, 2] are related.

对比学习可以驱动各种各样的借口任务。由于本文的重点不是设计一个新的借口任务，我们使用了一个简单的借口任务，主要是遵循[61]中的实例区分任务，最近的一些研究[63,2]与实例区分任务有关。

Following [61], we consider a query and a key as a positive pair if they originate from the same image, and otherwise as a negative sample pair. Following [63, 2], we take two random “views” of the same image under random data augmentation to form a positive pair. The queries and keys are respectively encoded by their encoders, f_q and f_k. The encoder can be any convolutional neural network [39].

根据[61]，如果查询和键来自同一图像，我们将其视为正对，否则将其视为负样本对。按照[63,2]，我们在随机数据增强的情况下，对同一张图像随机取两个“视图”，形成正对。查询和键分别由它们的编码器f_q和f_k编码。编码器可以是任何卷积神经网络[39]。

Algorithm 1 provides the pseudo-code of MoCo for this pretext task. For the current mini-batch, we encode the queries and their corresponding keys, which form the positive sample pairs. The negative samples are from the queue.

算法1为这个借口任务提供了MoCo的伪代码。对于当前的小批量，我们对查询及其对应的键进行编码，它们形成正样本对。负样本来自队列。

Technical details. We adopt a ResNet as the encoder, whose last fully-connected layer (after global average pooling) has a fixed-dimensional output (128-D). This output vector is normalized by its L2-norm. This is the representation of the query or key. The temperature τ in Eqn.(1) is set as 0.07 [61]. The data augmentation setting follows [61]: a 224×224-pixel crop is taken from a randomly resized image, and then undergoes random color jittering, random horizontal flip, and random grayscale conversion, all available in PyTorch’s torchvision package.

技术细节。我们采用ResNet[33]作为编码器，其最后一个全连接层(经过全局平均池化)具有固定维输出(128-D[61])。该输出向量由l2范数归一化[61]。这是查询或键的表示形式。设Eqn.(1)中的温度τ为0.07[61]。数据增强设置如下[61]:从随机调整大小的图像中截取224×224-pixel裁剪，然后经历随机颜色抖动，随机水平翻转和随机灰度转换，这些都可以在PyTorch的torchvision包中使用。

Shuffling BN. Our encoders fq and fk both have Batch Normalization (BN) [37] as in the standard ResNet [33]. In experiments, we found that using BN prevents the model from learning good representations, as similarly reported in [35] (which avoids using BN). The model appears to “cheat” the pretext task and easily finds a low-loss solution. This is possibly because the intra-batch communication among samples (caused by BN) leaks information.

洗牌BN。我们的编码器fq和fk都与标准ResNet[33]一样具有批处理归一化(BN)[37]。在实验中，我们发现使用BN会阻止模型学习良好的表示，正如[35]中类似的报道(避免使用BN)。该模型似乎“欺骗”了借口任务，并很容易找到低损失的解决方案。这可能是因为样品之间的批内通信(由BN引起)泄露了信息。

We resolve this problem by shuffling BN. We train with multiple GPUs and perform BN on the samples independently for each GPU (as done in common practice). For the key encoder f_k, we shuffle the sample order in the current mini-batch before distributing it among GPUs (and shuffle back after encoding); the sample order of the mini-batch for the query encoder fq is not altered. This ensures the batch statistics used to compute a query and its positive key come from two different subsets. This effectively tackles the cheating issue and allows training to benefit from BN.

我们通过变换BN来解决这个问题。我们使用多个GPU进行训练，并对每个GPU的样本独立执行BN(如常见实践所做的那样)。对于密钥编码器fk，我们在当前的小批量中打乱采样顺序，然后将其分配给gpu(并在编码后打乱);查询编码器fq的小批量示例顺序没有改变。这确保了用于计算查询及其正键的批统计信息来自两个不同的子集。这有效地解决了作弊问题，并使培训受益于BN。

We use shuffled BN in both our method and its end-toend ablation counterpart (Figure 2a). It is irrelevant to the memory bank counterpart (Figure 2b), which does not suffer from this issue because the positive keys are from different mini-batches in the past.

我们在我们的方法和其端到端消融对应的方法中都使用了洗牌BN(图2a)。它与对应的内存库无关(图2b)，后者不受此问题的影响，因为正键过去来自不同的小批量。

Experiments

Todo

Discussion and Conclusion

Our method has shown positive results of unsupervised learning in a variety of computer vision tasks and datasets. A few open questions are worth discussing. MoCo’s improvement from IN-1M to IG-1B is consistently noticeable but relatively small, suggesting that the larger-scale data may not be fully exploited. We hope an advanced pretext task will improve this. Beyond the simple instance discrimination task [61], it is possible to adopt MoCo for pretext tasks like masked auto-encoding, e.g., in language [12] and in vision [46]. We hope MoCo will be useful with other pretext tasks that involve contrastive learning.

我们的方法在各种计算机视觉任务和数据集中显示了无监督学习的积极结果。有几个悬而未决的问题值得讨论。MoCo从IN-1M到IG-1B的改进一直很明显，但相对较小，这表明更大规模的数据可能没有得到充分利用。我们希望一个先进的借口任务将改善这一点。除了简单的实例识别任务[61]外，还可以将MoCo用于伪装自编码等借口任务，例如语言[12]和视觉[46]。我们希望MoCo对其他涉及对比学习的借口任务有用。

代码

class MoCo(nn.Module)
	def __init__(self，backbone)
		super().__init__()
		self.backbone= backbone		
		self.projection_head= ProjectionHead(2048, 2048, 512)
		
		self.backbone_momentum=copy.deepcopy(self.backbone)
		self.projection_head_momentum=copy.deepcopy(self.projection_head)
		
		deactivate_requires_grad(self.backbone_momentum)
        deactivate_requires_grad(self.projection_head_momentum)
        
	def forward(self,x):
		query=self.backbone(x).flaten(start_dim=1)
		query=self.projection_head(query)
		return query

	def forward_momentum(self,x):
		key = self.backbone_momentum(x).flatten(start_dim=1)
        key = self.projection_head_momentum(key).detach()
        return key