基于Negative Sampling的word2vec
CBOW
Negative Sample的含义:对于一个中心词w和其上下文Context(w),我们希望在给定Context(w)的条件下,得到w的概率越大越好,得到其他词的概率越小越好,这个w就作为正样本,除w之外的其他词就作为负样本。不过不去词典中所有的词,而是从词典中抽取neg个词作为负样本,使用负采样算法(后面会说)。
通过逻辑回归来区分上述正样本和负样本:
P
(
u
∣
C
o
n
t
e
x
t
(
w
)
)
=
σ
(
X
w
T
θ
u
)
,
u
∈
w
∪
u
∈
n
e
g
(
w
)
P(u|Context(w)) =\sigma(X_w^T\theta^u), \ \ {u \in w \cup u \in neg(w)}
P(u∣Context(w))=σ(XwTθu), u∈w∪u∈neg(w)
X
w
X_w
Xw为Context中的词向量之和。
σ
\sigma
σ的含义为:给定一个输入,预测值与真实值相等的概率,那么当真实值为负祥本时,期望预测值不等于真实值的概率(即预测值等于正样本的概率)为
1
−
σ
(
X
w
T
θ
u
)
1 - \sigma(X_w^T\theta^u)
1−σ(XwTθu)
那么给定一个输入,期望其为正样本
w
w
w的概率为:
P
(
u
∣
c
o
n
t
e
x
t
(
w
)
)
=
{
σ
(
X
w
T
θ
u
)
,
u
=
w
1
−
σ
(
X
w
T
θ
u
)
,
u
≠
w
P(u|context(w)) = \begin{cases} \sigma(X_w^T\theta^u) , u = w \ \\ 1 - \sigma(X_w^T\theta^u), u \neq w \end{cases}
P(u∣context(w))={σ(XwTθu),u=w 1−σ(XwTθu),u̸=w
=
σ
(
X
w
T
θ
u
)
y
u
(
1
−
σ
(
X
w
T
θ
u
)
)
1
−
y
u
= \sigma(X_w^T\theta^u)^{y^u}(1-\sigma (X_w^T\theta^u))^{1-y^u}
=σ(XwTθu)yu(1−σ(XwTθu))1−yu
(当
u
=
w
u=w
u=w时,令
y
u
=
1
y^u=1
yu=1,否则,
y
u
=
0
y^u=0
yu=0)
对于给定的语料库
C
C
C我, 们期望:
L
=
∏
w
∈
C
∏
u
∈
w
∪
u
∈
n
e
g
(
w
)
P
(
c
o
n
t
e
x
t
(
w
)
,
u
)
=
σ
(
X
w
T
θ
)
∏
u
∈
n
e
g
(
w
)
(
1
−
σ
(
X
w
T
θ
u
)
)
(
1
)
=
∏
w
∈
C
∏
u
∈
w
∪
u
∈
n
e
g
(
w
)
σ
(
X
w
T
θ
u
)
y
u
(
1
−
σ
(
X
w
T
θ
u
)
)
1
−
y
u
(
2
)
L= \prod_{w \in C}\prod_{u \in w \cup u \in neg(w)} P(context(w), u) = \sigma(X_w^T\theta)\prod_{u \in neg(w)}(1-\sigma(X_w^T\theta^u)) \ \ \ \ \ \ (1) \\ =\prod_{w \in C}\prod_{u \in w \cup u \in neg(w)} \sigma(X_w^T\theta^u)^{y^u}(1-\sigma (X_w^T\theta^u))^{1-y^u} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)
L=w∈C∏u∈w∪u∈neg(w)∏P(context(w),u)=σ(XwTθ)u∈neg(w)∏(1−σ(XwTθu)) (1)=w∈C∏u∈w∪u∈neg(w)∏σ(XwTθu)yu(1−σ(XwTθu))1−yu (2)
值越大越好。
对
L
L
L取对数,依然令其为
L
L
L:
L
=
∑
w
∈
C
∑
u
∈
w
∪
u
∈
n
e
g
(
w
)
{
y
u
l
o
g
(
σ
(
X
w
T
θ
u
)
)
+
(
1
−
y
u
)
l
o
g
(
1
−
σ
(
X
w
T
θ
u
)
)
}
L = \sum_{w \in C}\sum_{u \in w \cup u \in neg(w)} \{y^u log(\sigma (X_w^T\theta^u)) + (1-y^u)log(1-\sigma(X_w^T\theta^u))\}
L=w∈C∑u∈w∪u∈neg(w)∑{yulog(σ(XwTθu))+(1−yu)log(1−σ(XwTθu))}
对
L
L
L求偏导:
∂
L
∂
θ
u
=
[
y
u
−
σ
(
X
w
T
θ
u
)
]
X
w
\frac{\partial L}{\partial \theta^u} = [y^u - \sigma(X_w^T\theta^u)]X_w
∂θu∂L=[yu−σ(XwTθu)]Xw
∂
L
∂
X
w
=
∑
u
∈
w
∪
u
∈
n
e
g
(
w
)
[
y
u
−
σ
(
X
w
T
θ
u
)
]
θ
u
\frac{\partial L}{\partial X_w} = \sum_{u \in w \cup u \in neg(w)}[y^u - \sigma(X_w^T\theta^u)]\theta^u
∂Xw∂L=u∈w∪u∈neg(w)∑[yu−σ(XwTθu)]θu
那么:
θ
u
=
θ
u
+
η
[
y
u
−
σ
(
X
w
T
θ
u
)
]
X
w
\theta^u =\theta^u + \eta [y^u - \sigma(X_w^T\theta^u)]X_w
θu=θu+η[yu−σ(XwTθu)]Xw
V
(
w
^
)
=
V
(
w
^
)
+
η
∑
u
∈
w
∪
u
∈
n
e
g
(
w
)
[
y
u
−
σ
(
X
w
T
θ
u
)
]
θ
u
,
w
^
∈
C
o
n
t
e
x
t
(
w
)
V(\hat w) = V(\hat w) + \eta \sum_{u \in w \cup u \in neg(w)} [y^u - \sigma(X_w^T\theta^u)]\theta^u, \hat w \in Context(w)
V(w^)=V(w^)+ηu∈w∪u∈neg(w)∑[yu−σ(XwTθu)]θu,w^∈Context(w)
V
(
w
^
)
V(\hat w)
V(w^)为Context(w)中的每个词的词向量,这里是要更新每个词向量,而不是它们的和
X
w
X_{w}
Xw.
(注:其实
θ
u
\theta^u
θu用
θ
w
u
\theta^u_w
θwu表示更好)
skip_gram
给定
w
w
w,求得到
C
o
n
t
e
x
t
(
w
)
Context(w)
Context(w)的概率。这里求
C
o
n
t
e
x
t
(
w
)
Context(w)
Context(w)。怎么求
C
o
n
t
e
x
t
(
w
)
Context(w)
Context(w)呢?这里是根据
w
w
w,求
C
o
n
t
e
x
t
(
w
)
Context(w)
Context(w)中的每一个单词u,然后得到它们概率的乘积。对每一个正样本u,都用负采样算法求得neg个负样本
N
E
G
(
u
)
NEG(u)
NEG(u)。
使用二分类回归,我们的期望是正样本的概率
P
(
u
∣
w
)
P(u|w)
P(u∣w)越大越好,负样本的概率
P
(
N
E
G
(
u
)
∣
w
)
P(NEG(u)|w)
P(NEG(u)∣w)越小越好,即
(
1
−
P
(
N
E
G
(
u
)
∣
w
)
)
(1-P(NEG(u)|w))
(1−P(NEG(u)∣w))越大越好。
令
g
(
w
)
=
∏
u
∈
C
o
n
t
e
x
t
(
w
)
{
P
(
u
∣
w
)
⋅
∏
z
∈
N
E
G
(
u
)
(
1
−
P
(
z
∣
w
)
)
}
=
∏
u
∈
C
o
n
t
e
x
t
(
w
)
∏
z
∈
{
u
}
∪
N
E
G
(
u
)
σ
(
V
(
w
)
θ
w
z
)
y
z
(
1
−
σ
(
V
(
w
)
θ
w
z
)
(
1
−
y
z
)
)
令g(w) = \prod_{u \in Context(w)}\{ P(u|w)\cdot\prod_{z \in NEG(u)}(1-P(z|w)) \} \\ = \prod_{u \in Context(w)} \prod_{z \in \{u\} \cup NEG(u)}\sigma(V(w)\theta_w^z)^{y^z}(1-\sigma(V(w)\theta_w^z)^{(1-y^z)})
令g(w)=u∈Context(w)∏{P(u∣w)⋅z∈NEG(u)∏(1−P(z∣w))}=u∈Context(w)∏z∈{u}∪NEG(u)∏σ(V(w)θwz)yz(1−σ(V(w)θwz)(1−yz))
y
z
=
1
,
当
z
=
u
;
否
则
有
y
z
=
0
y^z=1,当z=u;否则有y^z=0
yz=1,当z=u;否则有yz=0
令:
L
=
l
o
g
∏
w
∈
C
∏
u
∈
C
o
n
t
e
x
t
(
w
)
∏
z
∈
{
u
}
∪
N
E
G
(
u
)
σ
(
V
(
w
)
θ
w
z
)
y
z
(
1
−
σ
(
V
(
w
)
θ
w
z
)
(
1
−
y
z
)
)
L = log\prod_{w \in C} \prod_{u \in Context(w)} \prod_{z \in \{u\} \cup NEG(u)}\sigma(V(w)\theta_w^z)^{y^z}(1-\sigma(V(w)\theta_w^z)^{(1-y^z)})
L=logw∈C∏u∈Context(w)∏z∈{u}∪NEG(u)∏σ(V(w)θwz)yz(1−σ(V(w)θwz)(1−yz))
然后求偏导。。。
参考文章:
word2vec 中的数学原理详解(五)基于 Negative Sampling 的模型
word2vec原理(三) 基于Negative Sampling的模型