Noise Contrastive Estimation

统计机器学习中经常遇到熵的概念,在介绍NCE和InfoNCE之前,对熵以及相关的概念做简单的梳理。信息量用于度量不确定性的大小,熵可以看作信息量的期望,香农信息熵的定义:对于随机遍历XXX,香农信息的定义为 I(X)=−log(P(X))I(X) = -log(P(X))I(X)=log(P(X)),香农熵的定义为香农信息的期望H(X)=E(I(X))=∑xP(x)I(x)=−∑xP(x)log(P(x))H(X) = E(I(X))= \sum_{x} P(x)I(x) = -\sum_{x} P(x)log(P(x))H(X)=E(I(X))=xP(x)I(x)=xP(x)log(P(x))。当随机变量为均匀分布时,熵最大。香农编码定理表明熵是传输一个随机变量状态值所需的比特位下界。

对于多维随机变量,联合熵的定义为:H(X,Y)=E(I(X,Y))=∑x,yP(x,y)I(x,y)=−∑x,yP(x,y)log(P(x,y))H(X,Y) =E(I(X,Y))= \sum_{x,y}P(x,y)I(x,y) = -\sum_{x,y}P(x,y)log(P(x,y))H(X,Y)=E(I(X,Y))=x,yP(x,y)I(x,y)=x,yP(x,y)log(P(x,y))

条件熵:

H(Y∣X)=Ex(H(Y∣X=x))=−∑xp(x)∑yp(y∣x)logP(y∣x)H(Y|X) = E_{x}(H(Y|X=x)) = -\sum_{x} p(x)\sum_{y}p(y|x)logP(y|x)H(YX)=Ex(H(YX=x))=xp(x)yp(yx)logP(yx)

=−∑x∑yp(x)p(y∣x)log(p(y∣x))=-\sum_{x}\sum_{y}p(x)p(y|x)log(p(y|x))=xyp(x)p(yx)log(p(yx))

=−∑x,yp(x,y)log(p(y∣x))=-\sum_{x,y}p(x,y)log(p(y|x))=x,yp(x,y)log(p(yx))

有定义可推出条件熵的一下性质:

H(Y∣X)=−∑x,yp(x,y)log(p(y∣x))H(Y|X) = -\sum_{x,y}p(x,y)log(p(y|x))H(YX)=x,yp(x,y)log(p(yx))

=−∑x,yp(x,y)log(p(x,y)/p(x)))=-\sum_{x,y}p(x,y)log(p(x,y)/p(x)))=x,yp(x,y)log(p(x,y)/p(x)))

=−∑x,yp(x,y)log(p(x,y))+∑x,yp(x,y)log(p(x))=-\sum_{x,y}p(x,y)log(p(x,y)) + \sum_{x,y}p(x,y)log(p(x))=x,yp(x,y)log(p(x,y))+x,yp(x,y)log(p(x))

=−∑x,yp(x,y)log(p(x,y))+∑xlog(p(x))∑yp(x,y)=-\sum_{x,y}p(x,y)log(p(x,y)) + \sum_{x}log(p(x))\sum_{y}p(x,y)=x,yp(x,y)log(p(x,y))+xlog(p(x))yp(x,y)

=H(X,Y)+∑xlog(p(x))p(x)=H(X,Y) + \sum_{x}log(p(x))p(x)=H(X,Y)+xlog(p(x))p(x)

=H(X,Y)−H(X)=H(X,Y)-H(X)=H(X,Y)H(X)

相对熵(也成KL散度):

p(x),q(x)p(x),q(x)p(x),q(x)是离散随机变量的两个概率分布,则p相对q的相对熵为

KL(p∣∣q)=Ex∼p(x)(log(p(x)q(x)))=∑xp(x)log(p(x)q(x))KL(p||q) = E_{x\sim p(x)}(log(\frac{p(x)}{q(x)})) = \sum_{x}p(x)log(\frac{p(x)}{q(x)})KL(pq)=Exp(x)(log(q(x)p(x)))=xp(x)log(q(x)p(x))

  • 如何p,q相同,则KL(p∣∣q)=KL(q∣∣p)=0KL(p||q) = KL(q||p)= 0KL(pq)=KL(qp)=0
  • KL(p∣∣q)≠KL(q∣∣p)KL(p||q)\ne KL(q||p)KL(pq)=KL(qp)
  • KL(p∣∣q)≥0KL(p||q) \ge 0KL(pq)0

证明:

KL(p∣∣q)=Ex∼p(x)(logp(x)q(x))KL(p||q)=E_{x\sim p(x)}(log\frac{p(x)}{q(x)})KL(pq)=Exp(x)(logq(x)p(x))

=∑xp(x)logp(x)q(x)=\sum_{x}p(x)log\frac{p(x)}{q(x)}=xp(x)logq(x)p(x)

=−∑xp(x)log(q(x)p(x))=-\sum_{x}p(x)log(\frac{q(x)}{p(x)})=xp(x)log(p(x)q(x))

=−Ex∼p(x)(logq(x)p(x))=-E_{x\sim p(x)}(log\frac{q(x)}{p(x)})=Exp(x)(logp(x)q(x))

≥−logEx∼p(x)(q(x)p(x))−−−Jensen不等式\ge -logE_{x\sim p(x)}(\frac{q(x)}{p(x)}) --- Jensen不等式logExp(x)(p(x)q(x))Jensen

=−log∑xp(x)q(x)p(x)=-log \sum_{x}p(x)\frac{q(x)}{p(x)}=logxp(x)p(x)q(x)

=−log∑xq(x)=0=-log\sum_xq(x)=0=logxq(x)=0

交叉熵:

H(p,q)=−∑xp(x)log(q(x))H(p,q) = -\sum_xp(x)log(q(x))H(p,q)=xp(x)log(q(x))

交叉熵有如下性质:

KL(p∣∣q)=∑xp(x)logp(x)q(x)KL(p||q)=\sum_{x}p(x)log\frac{p(x)}{q(x)}KL(pq)=xp(x)logq(x)p(x)

=∑xp(x)log(p(x))−∑xp(x)log(q(x))=\sum_xp(x)log(p(x))-\sum_xp(x)log(q(x))=xp(x)log(p(x))xp(x)log(q(x))

=−∑xp(x)log(q(x))−(−∑xp(x)log(p(x)))=-\sum_xp(x)log(q(x)) - (-\sum_xp(x)log(p(x)))=xp(x)log(q(x))(xp(x)log(p(x)))

=H(p,q)−H(p)=H(p,q) - H(p)=H(p,q)H(p)

≤H(p,q)\le H(p, q)H(p,q)

因此最小化交叉熵H(p,q)H(p,q)H(p,q),在最小化KL散度的上界。事实上机器学习中,可以认为 p(x)p(x)p(x) 为真实的数据分布,q(x)q(x)q(x) 为通过模型建模的数据分布,参数待估计(学习)。一个常见的参数估计(学习)策略就是最小化交叉熵,可以证明样本固定的情况下,最小化交叉熵等价于最小化相对熵(KL散度),也等价于最大化似然。

实际情况是数据的真实分布 p(x)p(x)p(x) 是未知的,因此学习中用来自总体的样本$ {x_i, i=1, 2,…,n}$近似计算,希望学习到的分布 q(x)q(x)q(x) 与样本分布一致。

根据最小交叉熵策略,损失函数定义为:

Loss(θ)=H(p,q)=−∑xp(x)log(q(x;θ))=−Ex∼p(x)(log(q(x;θ)))Loss(\theta) = H(p,q)=-\sum_xp(x)log(q(x;\theta))=-E_{x\sim p(x)}(log(q(x;\theta)))Loss(θ)=H(p,q)=xp(x)log(q(x;θ))=Exp(x)(log(q(x;θ)))

≈−1n∑i=1nlog(q(xi;θ))\approx -\frac{1}{n}\sum_{i=1}^nlog(q(x_i;\theta))n1i=1nlog(q(xi;θ))

θ∗=argminθ−1n∑i=1nlog(q(xi;θ))\theta^* = argmin_{\theta}-\frac{1}{n}\sum_{i=1}^nlog(q(x_i;\theta))θ=argminθn1i=1nlog(q(xi;θ))

根据最大似然估计策略,样本的似然函数为:

L(θ)=∑i=1nlog(q(xi;θ))L(\theta) = \sum_{i=1}^nlog(q(x_i;\theta))L(θ)=i=1nlog(q(xi;θ))

θMLE=argmaxθ∑i=1nlog(q(xi;θ))\theta_{MLE} = argmax_{\theta}\sum_{i=1}^nlog(q(x_i;\theta))θMLE=argmaxθi=1nlog(q(xi;θ))

可以看出最小交叉熵估计等价于最大似然估计。

互信息:

连个随机变量X,Y的互信息定义为

I(X,Y)=Ex,y∼p(x,y)(logp(x,y)p(x)p(y))=∑x∑yp(x,y)logp(x,y)p(x)p(y)I(X,Y) = E_{x,y\sim p(x,y)}(log\frac{p(x,y)}{p(x)p(y)})=\sum_x\sum_yp(x,y)log\frac{p(x,y)}{p(x)p(y)}I(X,Y)=Ex,yp(x,y)(logp(x)p(y)p(x,y))=xyp(x,y)logp(x)p(y)p(x,y)

互信息满足:

  • 对称性:I(X;Y)=I(Y;X)I(X;Y)=I(Y;X)I(X;Y)=I(Y;X)
  • 半正定性:I(X:Y)≥0I(X:Y)\ge 0I(X:Y)0,当X,YX,YXY互相独立时,I(X;Y)=0)I(X;Y)=0)I(X;Y)=0)

证明:

I(X;Y)=Ex,y∼p(x,y)(logp(x,y)p(x)p(y))I(X;Y)=E_{x,y\sim p(x,y)}(log\frac{p(x,y)}{p(x)p(y)})I(X;Y)=Ex,yp(x,y)(logp(x)p(y)p(x,y))

−Exy∼p(x,y)(logp(x)p(y)p(x,y))-E_{xy\sim p(x,y)}(log \frac{p(x)p(y)}{p(x,y)})Exyp(x,y)(logp(x,y)p(x)p(y))

≥logEx,y∼p(x,y)(p(x)p(y)p(x,y))−−−Jensen不等式\ge log E_{x,y\sim p(x,y)}(\frac{p(x)p(y)}{p(x,y)}) --- Jensen不等式logEx,yp(x,y)(p(x,y)p(x)p(y))Jensen

=log∑x,yp(x,y)p(x)p(y)p(x,y))=log\sum_{x,y}p(x,y)\frac{p(x)p(y)}{p(x,y)})=logx,yp(x,y)p(x,y)p(x)p(y))

=log1=0=log 1 = 0=log1=0

Noise Contrastive Estimate

语言模型中建模给定上下文 ccc 下单词 www的条件概率:

p(w∣c)=pθ(w∣c)=exp(sθ(w,c))∑w′∈Vexp(sθ(w,c))(1)p(w|c)=p_{\theta}(w|c)=\frac{exp(s_{\theta}(w,c))}{\sum_{w'\in V}exp(s_{\theta}(w,c))}\quad\quad (1)p(wc)=pθ(wc)=wVexp(sθ(w,c))exp(sθ(w,c))(1)

下面我们考虑给定上下文 ccc 后的参数 θ\thetaθ 的估计问题。

考虑极大似然估计方法,样本的的对数似然函数为:

L(θ)=∑i=1nlogpθ(wi∣c)L(\theta) = \sum_{i=1}^n logp_{\theta}(w_i|c)L(θ)=i=1nlogpθ(wic)

=∑i=1nlogpθ(wi∣c) = \sum_{i=1}^n logp_{\theta}(w_i|c)=i=1nlogpθ(wic)

=∑i=1nlogexp(sθ(wi,c)−∑i=1nlog∑w′∈Vexp(sθ(w’,c)))= \sum_{i=1}^n logexp(s_{\theta}(w_i,c) - \sum_{i=1}^n log\sum_{w'\in V}exp(s_{\theta}(w’,c)))=i=1nlogexp(sθ(wi,c)i=1nlogwVexp(sθ(w,c)))

=∑i=1nsθ(wi,c)−n∗log∑w′∈Vexp(sθ(w′,c)))= \sum_{i=1}^n s_{\theta}(w_i,c) - n * log\sum_{w'\in V}exp(s_{\theta}(w',c)))=i=1nsθ(wi,c)nlogwVexp(sθ(w,c)))

其中 n 为样本容量。

最大似然估计:

θ^MLE=argmaxθL(θ)\hat{\theta}_{MLE} = argmax_{\theta}L(\theta)θ^MLE=argmaxθL(θ)

通过梯度下降方式求解估计值θ^MLE\hat{\theta}_{MLE}θ^MLE,首先梯度计算为:

∂L(θ)∂θ=∑i=1n[∂sθ(wi,c)∂θ]−n∗∑w′∈V1∑w′∈Vexp(sθ(w′,c))exp(sθ(w′,c))∂sθ(w′,c)∂θ\frac{\partial L(\theta)}{\partial \theta}=\sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n * \sum_{w'\in V}\frac{1}{\sum_{w'\in V}exp(s_{\theta}(w',c))}exp(s_{\theta}(w',c))\frac{\partial s_{\theta}(w',c)}{\partial \theta}θL(θ)=i=1n[θsθ(wi,c)]nwVwVexp(sθ(w,c))1exp(sθ(w,c))θsθ(w,c)

=∑i=1n[∂sθ(wi,c)∂θ]−n∗∑w′∈Vpθ(w′,c)∂sθ(w′,c)∂θ=\sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n * \sum_{w'\in V}p_{\theta}(w',c)\frac{\partial s_{\theta}(w',c)}{\partial \theta}=i=1n[θsθ(wi,c)]nwVpθ(w,c)θsθ(w,c)

上面的计算分为两部分,前半部分针对每个训练样本(wi,c)(w_i,c)(wi,c),后半部分则需要对整个词表 V 进行计算,当词表较大时,计算量较大,因此提出了一些近似计算的方式:

  • 通过采样的方式计算,也就是从pθ(w∣c)p_{\theta}(w|c)pθ(wc)中采用样本,用1k∑i=1k∂sθ(wi,c)∂θ\frac{1}{k}\sum_{i=1}^k\frac{\partial s_{\theta}(w_i,c)}{\partial\theta}k1i=1kθsθ(wi,c)结果近似第二部分;实际操作中从pθ(w∣c)p_{\theta}(w|c)pθ(wc)中采样的操作比较复杂,因此会用另一个分布q(w∣c)q(w|c)q(wc)作为采样分布,这种方式称为sampled softmax

∂L(θ)∂θ==∑i=1n[∂sθ(wi,c)∂θ]−n∗∑w′∈Vpθ(w′,c)∂sθ(w′,c)∂θ\frac{\partial L(\theta)}{\partial\theta} = =\sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n * \sum_{w'\in V}p_{\theta}(w',c)\frac{\partial s_{\theta}(w',c)}{\partial \theta}θL(θ)==i=1n[θsθ(wi,c)]nwVpθ(w,c)θsθ(w,c)

=∑i=1n[∂sθ(wi,c)∂θ]−n∗Ew′∼pθ(w∣c)∂sθ(w′,c)∂θ=\sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n * E_{w'\sim p_{\theta}(w|c)}\frac{\partial s_{\theta}(w',c)}{\partial \theta}=i=1n[θsθ(wi,c)]nEwpθ(wc)θsθ(w,c)

≈∑i=1n[∂sθ(wi,c)∂θ]−n∗1k∑i=1k∂sθ(wi′,c)∂θ\approx \sum_{i=1}^n[\frac{\partial s_{\theta}(w_i,c)}{\partial \theta}] - n *\frac{1}{k} \sum_{i=1}^k\frac{\partial s_{\theta}(w'_i,c)}{\partial \theta}i=1n[θsθ(wi,c)]nk1i=1kθsθ(wi,c)

  • 噪音对比学习(Noise Contrastive Estimate)

NCE 将训练样本(wi,c)(w_i,c)(wi,c)作为正样本,记为D=1D = 1D=1,然后从一个已知确定的与c无关的噪音分布q(w)q(w)q(w)中采样样本作为负样本,记为D=0D = 0D=0。随机采样的负样本个数为kkk,则新样本中:

p(D=1)=nn+k∗np(D = 1) = \frac{n}{n + k * n}p(D=1)=n+knn

p(D=0)=1−p(D=1)=k∗nn+k∗np(D = 0) = 1 - p(D = 1) = \frac{k*n}{n + k * n}p(D=0)=1p(D=1)=n+knkn

p(w∣D=1,c)=pθ(w∣c)p(w|D = 1,c) = p_{\theta}(w|c)p(wD=1,c)=pθ(wc)

p(w∣D=0,c)=q(w)p(w|D = 0, c) = q(w)p(wD=0,c)=q(w)

p(D=1∣w,c)=p(D=1)p(w∣D=1,c)p(D=0)p(w∣D=0,c)+p(D=1)p(w∣D=1,c)p(D = 1 | w, c) = \frac{p(D = 1)p(w | D = 1,c)}{p(D = 0)p(w|D = 0,c) + p(D = 1)p(w|D = 1,c)}p(D=1w,c)=p(D=0)p(wD=0,c)+p(D=1)p(wD=1,c)p(D=1)p(wD=1,c)

=pθ(w∣c)pθ(w∣c)+kq(w)=\frac{p_{\theta}(w|c)}{p_{\theta}(w|c) + kq(w)}=pθ(wc)+kq(w)pθ(wc)

p(D=0∣w,c)=1−p(D=1∣w,c)=p(D=0)p(w∣D=0,c)p(D=1)p(w∣D=0,c)+p(D=1)p(w∣D=0,c)p(D = 0 | w, c) = 1 - p(D = 1|w, c) = \frac{p(D = 0)p(w|D = 0, c)}{p(D = 1)p(w|D = 0, c) + p(D = 1)p(w|D = 0, c)}p(D=0w,c)=1p(D=1w,c)=p(D=1)p(wD=0,c)+p(D=1)p(wD=0,c)p(D=0)p(wD=0,c)

=kq(w)pθ(w∣c)+kq(w)=\frac{kq(w)}{p_{\theta}(w|c) + kq(w)}=pθ(wc)+kq(w)kq(w)

原有样本和随机负采样的样本组成新的样本集合,样本容量为 n + k * n,下面依然采用最大似然估计,新的样本的对数似然为:

Lc(θ)=∑i=1nlogpθ(D=1∣wi,c)+∑i=n+1k∗nlogpθ(D=0∣wi,c)L^c(\theta) = \sum_{i = 1}^{n}logp_{\theta}(D = 1|w_i,c) + \sum_{i=n+1}^{k * n}logp_{\theta}(D = 0|w_i,c)Lc(θ)=i=1nlogpθ(D=1wi,c)+i=n+1knlogpθ(D=0wi,c)

其中

pθ(w∣c)=exp(sθ(w,c))∑w′∈Vexp(sθ(w′,c))p_{\theta}(w|c) = \frac{exp(s_{\theta}(w,c))}{\sum_{w'\in V}exp(s_{\theta}(w',c))}pθ(wc)=wVexp(sθ(w,c))exp(sθ(w,c))

=uθ(w,c)Z(c)=\frac{u_{\theta}(w,c)}{Z(c)}=Z(c)uθ(w,c)

p(D=1∣w,c)=pθ(w∣c)pθ(w∣c)+kq(w)p(D = 1|w,c) = \frac{p_{\theta}(w|c)}{p_{\theta}(w|c) + kq(w)}p(D=1w,c)=pθ(wc)+kq(w)pθ(wc)

p(D=0∣w,c)=kq(w)pθ(w∣c)+kq(w)p(D = 0|w,c) = \frac{kq(w)}{p_{\theta}(w|c) + kq(w)}p(D=0w,c)=pθ(wc)+kq(w)kq(w)

可以看到以上的似然函数中依然需要计算归一化分母Z(c)=∑w′∈Vexp(sθ(w′,c))Z(c) = \sum_{w'\in V}exp(s_{\theta}(w',c))Z(c)=wVexp(sθ(w,c)),计算量依然较大,下面我们对模型做一次改造:

Z(c)=θcZ(c) = \theta^cZ(c)=θc

pθ(w∣c)=exp(sθ(w,c))/θc=pθ0(w∣c)/θcp_{\theta}(w|c) = exp(s_{\theta(w,c)})/\theta^c= p_{\theta^0}(w|c) / \theta^cpθ(wc)=exp(sθ(w,c))/θc=pθ0(wc)/θc

新的函数的参数包括两部分组成:θ=θ0,θc\theta = {\theta^0,\theta^c}θ=θ0,θc

pθ0(w∣c)p_{\theta^0}(w|c)pθ0(wc)为未归一化的函数 pθ0(w∣c)=uθ0(w∣c)p_{\theta^0}(w|c) = u_{\theta^0}(w|c)pθ0(wc)=uθ0(wc)

Lc(θ)=∑i=1nlogpθ(D=1∣wi,c)+∑i=n+1k∗nlogpθ(D=0∣wi,c)L^c(\theta) = \sum_{i = 1}^{n}logp_{\theta}(D = 1|w_i,c) + \sum_{i=n+1}^{k*n}logp_{\theta}(D = 0|w_i,c)Lc(θ)=i=1nlogpθ(D=1wi,c)+i=n+1knlogpθ(D=0wi,c)

=∑i=1nlogpθ(wi∣c)pθ(wi∣c)+kq(wi)+∑w∼q(wi)logkq(w)pθ(wi∣c)+kq(wi)=\sum_{i = 1}^{n}log\frac{p_{\theta}(w_i|c)}{p_{\theta}(w_i|c) + kq(w_i)} + \sum_{w\sim q(w_i)}log\frac{kq(w)}{p_{\theta}(w_i|c) + kq(w_i)}=i=1nlogpθ(wic)+kq(wi)pθ(wic)+wq(wi)logpθ(wic)+kq(wi)kq(w)

作者实验发现取θc=1\theta^c = 1θc=1也有不错的效果,因此模型只剩下一部分参数待估计θ0\theta^0θ0

InfoNCE

请添加图片描述

  • 表示学习:通过预测future的任务,学习到好的表示

  • 作者认为直接建模条件概率p(x∣c)p(x|c)p(xc)方式对于提取x,cx, cx,c的共享信息并不是最优的方法

  • 作者提出建模x,cx, cx,c之间的建模密度比p(x∣c)p(x)\frac{p(x|c)}{p(x)}p(x)p(xc)fk(xt+k,ct)∝xt+k∣ctp(xt+k)f_k(x_{t+k},c_t) \propto \frac{x_{t+k}|c_t}{p(x_{t+k})}fk(xt+k,ct)p(xt+k)xt+kct,提升密度比提升x,c 的互信息:I(X,C)=∑x∑cp(x,c)logp(x∣c)p(x)I(X,C) = \sum_x\sum_cp(x,c)log\frac{p(x|c)}{p(x)}I(X,C)=xcp(x,c)logp(x)p(xc)

  • 实践中作者采用的模型为fk(xt+k,ct)=exp(zt+kTWkct),zt=genc(xt),ct=gar(z≤t)f_k(x_{t+k},c_t) = exp(z^T_{t+k}W_kc_t), z_t = g_{enc}(x_t), c_t = g_{ar}(z_{\le t})fk(xt+k,ct)=exp(zt+kTWkct),zt=genc(xt),ct=gar(zt)

  • 给定一个训练batch样本X={x1,x2,...,xN}X=\{x_1,x_2,...,x_N\}X={x1,x2,...,xN},包含一个从p(xt+k∣ct)p(x_{t+k}|c_t)p(xt+kct)中采样的样本,作为正样本,N-1 个来自p(xt+k)p(x_{t+k})p(xt+k)的样本作为负样本,InfoNCE的Loss定义为:LN=−E[logfk(xt+k,ct)∑x∈Xfk(xj,ct)]L_N = -E[log\frac{f_k(x_{t+k},c_t)}{\sum_{x\in X}f_k(x_j, c_t)}]LN=E[logxXfk(xj,ct)fk(xt+k,ct)]

  • I(xt+k,ct)≥logN−LNI(x_{t+k},c_t) \ge logN -L_NI(xt+k,ct)logNLN

证明:

LN=−E[logfk(xt+k,ct)fk(xt+k,ct)+∑xj∈Xnegfk(xj,ct)]L_N = -E[log \frac{f_k(x_{t+k},c_t)}{f_k(x_{t+k},c_t) + \sum_{x_j\in X_{neg}f_k(x_j,c_t)}}]LN=E[logfk(xt+k,ct)+xjXnegfk(xj,ct)fk(xt+k,ct)]

=E[logfk(xt+k,ct)+∑xj∈Xnegfk(xj,ct)fk(xt+k,ct)]=E [log \frac{f_k(x_{t+k},c_t) + \sum_{x_j\in X_{neg}}f_k(x_{j},c_t)}{f_k(x_{t+k},c_t)}]=E[logfk(xt+k,ct)fk(xt+k,ct)+xjXnegfk(xj,ct)]

=Elog[1+p(xt+k)p(xt+k∣ct)∑xj∈Xnegp(xj∣ct)p(xj)]=Elog[1 + \frac{p(x_{t+k})}{p(x_{t+k}|c_t)}\sum_{x_j\in X_{neg}}\frac{p(x_j|c_t)}{p(x_j)}]=Elog[1+p(xt+kct)p(xt+k)xjXnegp(xj)p(xjct)]

≈Elog[1+p(xt+k)p(xt+k∣ct)(N−1)Exj∼p(xj)p(xj∣ct)p(xj)]\approx Elog[1 + \frac{p(x_{t+k})}{p(x_{t+k}|c_t)}(N-1)E_{x_j\sim p(x_j)}\frac{p(x_j|c_t)}{p(x_j)}]Elog[1+p(xt+kct)p(xt+k)(N1)Exjp(xj)p(xj)p(xjct)]

=Elog[1+p(xt+k)p(xt+k∣ct)(N−1)]=Elog[1 + \frac{p(x_{t+k})}{p(x_{t+k}|c_t)}(N-1)]=Elog[1+p(xt+kct)p(xt+k)(N1)]

≥Elog[p(xt+k)p(xt+k∣ct)N]\ge Elog[\frac{p(x_{t+k})}{p(x_{t+k}|c_t)}N]Elog[p(xt+kct)p(xt+k)N]

=−H(xt+k,ct)+logN=-H(x_{t+k},c_t) + logN=H(xt+k,ct)+logN

因此最小化L_N等价于最大化互信息I(xt+k,ct)I(x_{t+k},c_t)I(xt+k,ct)的下限。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值