Efficient estimation of word representations in vector space-优快云博客

本文链接：https://blog.youkuaiyun.com/ClearSSS/article/details/109134063

本文介绍Mikolov等人提出的两种新型模型架构CBOW和Skip-gram，用于学习高质量的词向量。这两种模型不仅降低了计算成本，还能通过简单的代数运算在词向量空间中寻找语义相似的词。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Sharp tools make good work.

工欲善其事，必先利其器

Today I’ll explore word vectors presented by Mikolov et al. in the paper - “Efficient estimation of word representations in vector space”. Two novel model architectures for learning vector representations of words are proposed in this paper which significantly improve the quality of word vectors at lower computational cost, and the vectors are measured in a word similarity task using a word offset technique where simple algebraic operations are performed on the word vectors.

In this paper, Mikolov et al. give a short summary of previously proposed model architectures, including the well-known NNLM and RNNLM, and propose two new log-linear models called CBOW and Skip-gram.

The CBOW is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words, and the objective of this model is to use words from the history and future simultaneously to correctly classify the middle word in the vocabulary. Unlike standard bag-of-words model, it uses continuous distributed representation of the context.

The Skip-gram is similar to CBOW, but instead of predicting the current word based on the context, it tries to maximize classification of a word based on another word in the same sentence. More precisely, each current word is used as an input to a log-linear classifier consisting of continuous projection layer, and the result is used for predicting words within a certain range before and after the current word. Note that increasing the range improves quality of the resulting word vectors, but it also increases the computational cost.

Below the model architecture of two models is shown.
在这里插入图片描述

Below I present my code implementing the Skip-gram and the CBOW.

class Skip_gram(nn.Module):
    def __init__(self):
        super(Skip_gram, self).__init__()
        self.embedding = nn.Embedding(MAX_VOCAB_SIZE, EMBEDDING_SIZE)
        self.embedding.weight.data.uniform_(-INIT_RANGE, INIT_RANGE)

        self.outLayer = nn.Linear(EMBEDDING_SIZE, MAX_VOCAB_SIZE)

    def forward(self, X):
        # X -> B
        embedded = self.embedding(X)  # B x EMBEDDING_SIZE

        output = self.outLayer(embedded)  # B x MAX_VOCAB_SIZE

        return F.softmax(output, -1)


class CBOW(nn.Module):
    def __init__(self):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(MAX_VOCAB_SIZE, EMBEDDING_SIZE)
        self.embedding.weight.data.uniform_(-INIT_RANGE, INIT_RANGE)

        self.outLayer = nn.Linear(EMBEDDING_SIZE, MAX_VOCAB_SIZE)

    def forward(self, X):

        # X -> B x 2C
        embedded = self.embedding(X)  # B x 2C x EMBEDDING_SIZE
        embedded = embedded.sum(1) / (2 * C)  # B x EMBEDDING_SIZE

        output = self.outLayer(embedded)  # B x MAX_VOCAB_SIZE
        return F.softmax(output, -1)

To compare the quality of different versions of word vectors, previous papers typically use a table showing example words and their most similar words, and understand them intuitively. Since it has been observed that there can be many different types of similarities between words, for example, word big is similar to bigger in the same sense that small is similar to smaller. Mikolov et al. raise a question about how to find a word that is similar to small in the same sense as biggest is similar to big, and the question can be answered by simply computing vector $X$ = vector(“biggest”) - vector(“big”) + vector(“small”) that is used to search in the vector space to find the word closest to $X$ measured by cosine distance. When the word vectors are well trained, it is possible to find the correct answer using this method.

From my perspective, the most valuable contribution of this paper is that it proposes two novel and computationally efficient model architectures to obtain word vectors of high quality, with the expectation that many existing NLP applications can benefit from the model architectures described above, such as machine translation, information retrieval and question answering systems, and may enable other future applications yet to be invented.