Noise Contrastive Estimation (NCE) 、负采样(NEG)和InfoNCE

本文详细介绍了NCE(Noise-Contrastive Estimation)和InfoNCE两种技术。NCE通过对比真实样本与随机样本,简化了softmax计算过程,有效应用于大规模词汇的词向量训练。InfoNCE则在无监督学习中使用对比预测编码,通过预测上下文与目标之间的关系来学习数据的特征表示。

总结一下自己的理解:

NCE需要根据频次分布等进行采样,NEG不考虑,InfoNCE是一种通过无监督任务来学习(编码)高维数据的特征表示(representation),而通常采取的无监督策略就是根据上下文预测未来或者缺失的信息

下面转自知乎:

 

 

在tensorflow里,这两个loss的实现都异常简洁,当然其中藏着各种细节可以挖掘,在这里我们主要从算法实现角度看着两个loss实现。


def nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, num_true=1,
             sampled_values=None,  remove_accidental_hits=False, partition_strategy="mod",
             name="nce_loss"):
  logits, labels = _compute_sampled_logits(
      weights=weights, biases=biases, labels=labels, inputs=inputs,
      num_sampled=num_sampled, num_classes=num_classes, num_true=num_true,
      sampled_values=sampled_values, subtract_log_q=True, remove_accidental_hits=remove_accidental_hits,
      partition_strategy=partition_strategy, name=name)

  sampled_losses = sigmoid_cross_entropy_with_logits(
      labels=labels, logits=logits, name="sampled_losses")


 # sampled_losses is batch_size x {true_loss, sampled_losses...}
  # We sum out true and sampled losses.
  return _sum_rows(sampled_losses)

NCE的实现就由这三部分组成,从名字上看,_compute_sampled_logits负责采样,sigmoid_cross_entropy_with_logits负责做logistic regression,然后计算用cross entropy loss,_sum_rows求和。

这三个函数我们由简入繁,慢慢剥皮。

  • 首先_sum_rows代码很简单,先看comments,_sum_rows的arg是sampled loss,一个[batch_size, 1+sample_num]的矩阵。_sum_rows函数直接构造了一个ones tensor与sampled loss做矩阵乘法,最后reshape成一个[batch_size]的vector输出,vector中每一个element是true loss与 sampled loss之和。
  • 由上面我们可以推测和之前的了解,sigmoid_cross_entropy_with_logits应该是用对logits和labels求了logistic loss。 

 

The issue

There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).

Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.

A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.

The (a) solution

In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

From: https://stats.stackexchange.com/questions/244616/how-sampling-works-in-word2vec-can-someone-please-make-me-understand-nce-and-ne

以下转载自知乎:word2vec中的负例采样为什么可以得到和softmax一样的效果? - 知乎

关于InfoNCE:

InfoNCE 是在 Representation Learning with Contrastive Predictive Coding 这篇论文中提出的,这里不会具体介绍 CPC ,而是着重说明如何借鉴 NCE 的思想提出 InfoNCE 并用于 CPC 中的,如果还不太了解的可以看我的这篇文章 ”对 CPC (对比预测编码) 的理解“

简单来说,CPC(对比预测编码) 就是一种通过无监督任务来学习(编码)高维数据的特征表示(representation),而通常采取的无监督策略就是根据上下文预测未来或者缺失的信息,NLP 中已经利用这种思想来学习 word 的 representation

部分转载自: Noise Contrastive Estimation 前世今生——从 NCE 到 InfoNCE - 知乎 

### 负采样算法的分类及常见类型 负采样算法是一种广泛应用于自然语言处理、推荐系统以及图神经网络等领域的重要技术。其设计可以从简单到复杂不等,具体取决于应用场景需求。 #### 1. 随机负采样 (Random Negative Sampling) 随机负采样是最基础的一种方法,它通过均匀分布或其他概率分布从候选集中随机抽取样本作为负样本[^1]。这种方法的优点在于实现简单且计算成本低,但在某些情况下可能无法有效捕捉难例(hard negatives),从而影响模型性能。 #### 2. 噪声对比估计 (Noise Contrastive Estimation, NCE) 噪声对比估计是一种更为复杂的负采样策略,旨在优化目标函数以区分真实数据点与生成的噪音样本之间的差异[^3]。相比简单的随机抽样,NCE能够提供更加精确的学习信号,适用于需要高精度场景的任务。 #### 3. 自适应负采样 (Adaptive Negative Sampling) 自适应负采样会根据模型当前的状态动态调整选取哪些项目作为负面实例来参与训练过程[^1]。这类方法通常考虑到了正负样本之间距离远近关系或者预测错误率等因素,在提高收敛速度的同时也增强了最终效果。 #### 4. 图结构下的负采样 (Graph-based Negative Sampling) 针对特定领域如社交网络分析或生物信息学研究中的链接预测等问题,则需采用专门设计用于处理图形数据特性的负采样机制[^4]。例如 GraphSAGE PinSage 中提到过的 on-the-fly convolutions 技巧就是一种典型例子,该技巧利用局部子图构建减少了全局矩阵运算带来的巨大开销问题。 #### 5. 层次软最大值损失法(Hierarchical Softmax Loss Methodology) 层次软最大值损失法则提供了另一种替代方案去解决大规模词汇表所带来的挑战——即不再单独对待每一个单词而是按照一定规则组织成树形结构后再进行高效检索操作[^3]。尽管如此,此方法仍存在一些局限性比如难以并行化执行等缺点需要注意克服。 综上所述,不同的应用背景决定了各自适合使用的负采样形式;而随着深度学习理论技术不断发展进步,未来或许还会有更多创新性的解决方案被提出用来改进现有框架内的不足之处。 ```python import numpy as np def random_negative_sampling(vocab_size, num_negatives=5): """ 实现基本的随机负采样 """ return np.random.choice(range(vocab_size), size=num_negatives) # 示例调用 vocab_size = 10000 neg_samples = random_negative_sampling(vocab_size=vocab_size) print(neg_samples[:10]) # 打印前十个负样本索引 ```
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值