Word Vectors详解(2)

Skip-Gram模型解析

最新推荐文章于 2023-03-20 12:45:14 发布

原创最新推荐文章于 2023-03-20 12:45:14 发布 · 1.6k 阅读

0 ·

CC 4.0 BY-SA版权

其他专栏收录该内容

8 篇文章

订阅专栏

本文深入探讨了自然语言处理中的Skip-Gram模型，这是一种利用中心词预测上下文词的概率统计模型。文中详细介绍了模型的工作原理，包括词向量的生成、得分向量计算及概率转换等步骤，并对比了其与CBOW模型的不同之处。此外，还讨论了如何通过负采样解决计算成本高的问题。

3.3 Skip-Gram Model

Another approach is to create a model such that use the center word to generate the context.

Let’s discuss the Skip-Gram model above. The setup is largely the same but we essentially swap our $x$ and $y$ i.e. $x$ in the CBOW are now $y$ and viceversa. The input one hot vector (center word) we will represent with an $x$ (since there is only one). And the output vectors as $y^{(j)}$ . We define $\cal{V,U}$ the same as in CBOW.

How does it work:
1. We get our embedded word vectors for the center word: $v_{c}=\cal{V}x\in\mathbb{R}^{|\it V|}$
2. Generate a score vector $z=\cal{U}v_c$ . As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score.
4. Turn the scores into probabilities $\widehat{y}=softmax(z)\in\mathbb{R}^{|\it{V}|}$ .
5. We desire our probabilities generated $\widehat{y}$ to match the true probabilities, the one hot vector of the actual output.

m i n i m i z e J = - log P (w c - m, . . ., w c + m | w c) = - log \prod j = 0, j \neq m 2 m P (u c - m + j | v c) = - log \prod j = 0, j \neq m 2 m exp ( u T c v c ) \sum | V | j = 1 exp ( u T j v c ) = - \sum j = 0, j \neq m 2 m u T c - m + j v c + 2 m log \sum j = 1 | V | exp (u T j v c)

$\begin{split} \rm{minimize} \ \it{J} &= -\log P(w_{c-m},...,w_{c+m}|w_c) \\ &= -\log \prod^{2m}_{j=0,j \neq m} P(u_{c-m+j}|v_c) \\ &= -\log \prod^{2m}_{j=0,j \neq m} \frac{\exp(u^T_c v_c)}{\sum_{j=1}^{|V|}\exp(u^T_j v_c)} \\ &= -\sum^{2m}_{j=0,j \neq m} u^T_{c-m+j} v_c + 2m\log\sum_{j=1}^{|V|}\exp(u^T_j v_c) \end{split}$

Note that

J = - \sum j = 0, j \neq m 2 m log P (u c - m + j | v c) = - \sum j = 0, j \neq m 2 m H (y ˆ, y c - m + j)

$\begin{split} J &= -\sum^{2m}_{j=0,j \neq m} \log P(u_{c-m+j}|v_c) \\ &= -\sum^{2m}_{j=0,j \neq m} H(\widehat{y},y_{c-m+j}) \end{split}$
Where

H(yˆ,yc−m+j) $H(\widehat{y},y_{c-m+j})$ is the cross-entropy between the probability vector

yˆ $\widehat{y}$ and the one-hot vector

yc−m+j $y_{c-m+j}$

Skip-gram treats each context word equally : the models computes the probability for each word of appearing in the context independently of its distance to the center word
img_0009-w221

shortage

Loss functions $J$ for CBOW and Skip-Gram are expensive to compute because of the softmax normalization, where we sum over all $|V|$ scores!

To solve this problem, a simple idea is we could instead just approximate it. We have a method called Negative Sampling

Negative Sampling

For every training step, instead of looping over the entire vocabulary, we can just sample several negative examples! We “sample” from a noise distribution ( $P_n(w)$ ) whose probabilities match the ordering of the frequency of the vocabulary.

While negative sampling is based on Skip-Gram model or CBOW, it is in fact optimizing a different objective.

Consider a pair $(w,c)$ of word and context. Did this pair came from the training data? Let’s denote by $P(D=1|w,c)$ the probability that $(w,c)$ came from the corpus data. Correspondingly, $P(D=0|w,c)$ will be the probability that $(w,c)$ did not come from the corpus data. Model $P(D=1|w,c)$ with the sigmoid function:

P (D = 1 | w, c, θ) = σ (u T c v w)

$P(D=1|w,c,\theta) = \sigma(u^T_c v_w)$

Now, we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it indeed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not. We take a simple maximum likelihood approach of these two probabilities. (Here we take $\theta$ to be the parameters of the model, and in our case it is $\cal{V}$ and $\cal{U}$ )

这里写图片描述

Note that maximizing the likelihood is the same as minimizing the negative log likelihood

J = - \sum (w, c) \in D log 1 1 + exp ( - u T w v c ) - \sum (w, c) \in D ˜ log 1 1 + exp ( u T w v c )

$J = -\sum_{(w,c)\in D} \log\frac{1}{1+\exp(-u^T_w v_c )} -\sum_{(w,c)\in \widetilde{D}} \log\frac{1}{1+\exp(u^T_w v_c )}$
Note that

D˜ $\widetilde{D}$ is a “false” or “negative” corpus. Where we would have sentences like “stock boil fish is toy”. Unnatural sentences that should get a low probability of ever occurring. We can generate

D˜ $\widetilde{D}$ on the fly by randomly sampling this negative from the word bank.

For skip-gram, our new objective function for observing the context word c-m+j given the center word c would be:

- log σ (u T c - m + j v c) - \sum k = 1 K log σ (- u ˜ T k v c)

$-\log\sigma(u^T_{c-m+j} v_c) -\sum_{k=1}^K \log\sigma(-\widetilde{u}^T_k v_c )$

For CBOW, our new objective function for observing the center word $u_c$ given the context vector $\widehat{v} = \frac{v_{c-m}+v_{c-m+1}+...+v_{c+m}}{2m}$ would be

- log σ (u T c v ˆ) - \sum k = 1 K log σ (- u ˜ T k v ˆ)

$-\log\sigma(u^T_c \widehat{v}) -\sum_{k=1}^K \log\sigma(-\widetilde{u}^T_k \widehat{v} )$

In the above formulation, { $\widetilde{u}^T_k\widehat{v}|k=1...K$ } are sampled from $P_n(w)$ . There is much discussion of what makes the best approximation, what seems to work best is the Unigram Model raised to the power of 3/4. Why 3/4? Here’s an example that might help gain some intuition:

i s : 0.9 3 / 4 = 0.92 c o n s t i t u t i o n : 0.09 3 / 4 = 0.16 b o m b a s t i c : 0.01 3 / 4 = 0.032

$is: 0.9^{3/4}=0.92 \\ constitution: 0.09^{3/4}=0.16 \\ bombastic: 0.01^{3/4}=0.032$
“bombastic” is now 3x more likely to be sampled while “is” only went up marginally.