Noise Contrastive Estimation (NCE) 、负采样（NEG）和InfoNCE

转载已于 2022-03-10 16:33:20 修改 · 3.4k 阅读

CC 4.0 BY-SA版权

原文链接：https://stats.stackexchange.com/questions/244616/how-sampling-works-in-word2vec-can-someone-please-make-me-understand-nce-and-ne

文章标签：

#机器学习

于 2018-04-06 17:45:41 首次发布

机器学习同时被 2 个专栏收录

74 篇文章

订阅专栏

tensorflow

19 篇文章

订阅专栏

本文详细介绍了NCE（Noise-Contrastive Estimation）和InfoNCE两种技术。NCE通过对比真实样本与随机样本，简化了softmax计算过程，有效应用于大规模词汇的词向量训练。InfoNCE则在无监督学习中使用对比预测编码，通过预测上下文与目标之间的关系来学习数据的特征表示。

总结一下自己的理解：

NCE需要根据频次分布等进行采样，NEG不考虑，InfoNCE是一种通过无监督任务来学习(编码)高维数据的特征表示(representation)，而通常采取的无监督策略就是根据上下文预测未来或者缺失的信息

下面转自知乎：

在tensorflow里，这两个loss的实现都异常简洁，当然其中藏着各种细节可以挖掘，在这里我们主要从算法实现角度看着两个loss实现。


def nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, num_true=1,
             sampled_values=None,  remove_accidental_hits=False, partition_strategy="mod",
             name="nce_loss"):
  logits, labels = _compute_sampled_logits(
      weights=weights, biases=biases, labels=labels, inputs=inputs,
      num_sampled=num_sampled, num_classes=num_classes, num_true=num_true,
      sampled_values=sampled_values, subtract_log_q=True, remove_accidental_hits=remove_accidental_hits,
      partition_strategy=partition_strategy, name=name)

  sampled_losses = sigmoid_cross_entropy_with_logits(
      labels=labels, logits=logits, name="sampled_losses")


 # sampled_losses is batch_size x {true_loss, sampled_losses...}
  # We sum out true and sampled losses.
  return _sum_rows(sampled_losses)

NCE的实现就由这三部分组成，从名字上看，_compute_sampled_logits负责采样，sigmoid_cross_entropy_with_logits负责做logistic regression，然后计算用cross entropy loss，_sum_rows求和。

这三个函数我们由简入繁，慢慢剥皮。

首先_sum_rows代码很简单，先看comments，_sum_rows的arg是sampled loss，一个[batch_size, 1+sample_num]的矩阵。_sum_rows函数直接构造了一个ones tensor与sampled loss做矩阵乘法，最后reshape成一个[batch_size]的vector输出，vector中每一个element是true loss与 sampled loss之和。
由上面我们可以推测和之前的了解，sigmoid_cross_entropy_with_logits应该是用对logits和labels求了logistic loss。

The issue

There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).

Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.

A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.

The (a) solution

In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

From: https://stats.stackexchange.com/questions/244616/how-sampling-works-in-word2vec-can-someone-please-make-me-understand-nce-and-ne

以下转载自知乎：word2vec中的负例采样为什么可以得到和softmax一样的效果？ - 知乎

关于InfoNCE：

InfoNCE 是在 Representation Learning with Contrastive Predictive Coding 这篇论文中提出的，这里不会具体介绍 CPC ，而是着重说明如何借鉴 NCE 的思想提出 InfoNCE 并用于 CPC 中的，如果还不太了解的可以看我的这篇文章 ”对 CPC (对比预测编码) 的理解“。

简单来说，CPC(对比预测编码) 就是一种通过无监督任务来学习(编码)高维数据的特征表示(representation)，而通常采取的无监督策略就是根据上下文预测未来或者缺失的信息，NLP 中已经利用这种思想来学习 word 的 representation