2019-CS224n-Assignment1

最新推荐文章于 2025-11-06 23:46:01 发布

原创

最新推荐文章于 2025-11-06 23:46:01 发布 · 4.3k 阅读

23 ·

CC 4.0 BY-SA版权

文章标签：

#NLP #CS224N #词向量 #共现矩阵 #word2vec

本文深入探讨了词向量的概念及其实现，包括基于计数的共现矩阵和基于预测的word2vec方法，通过代码实践理解词向量的生成过程，分析词向量在NLP任务中的应用，以及存在的偏见问题。

去年冬季学习了cs224n的2017课程，做了三个assignments，用的是TensorFlow。今年cs224n再次放课，一共有5个assignments，使用PyTorch，主讲还是Manning，特别喜欢这个老师，讲课生动有趣还挺可爱的哈哈哈~~

Assignment1(点击下载) 的任务是探索词向量。以基于计数的共现矩阵和基于预测的word2vec两种方式，计算词的相似度，研究近义词、反义词等等性质，从代码层面来理解它们，有更深刻的记忆。

作业是ipynb文件，所以要用jupyter打开，可以参考chaibubble的如何打开ipynb文件。

注意：python版本 >= 3.5

词向量

词向量是下游NLP任务(如问答、文本生成、翻译等) 的基本组件，词向量的好坏能在很大程度上影响下游任务的性能。这里我们将探索两类词向量：共现矩阵 和 word2vec 。

术语解释： “word vectors” 和 “word embeddings” 通常可以互换使用。“embedding” 这个词的内在含义是将词编码到一个底维空间中。“概念上而言，它是指把一个维数为所有词的数量的高维空间嵌入到一个维数低得多的连续向量空间中，每个单词或词组被映射为实数域上的向量。”——维基百科

Part 1：基于计数的词向量

大多数词向量模型都是基于一个观点：

You shall know a word by the company it keeps (Firth, J. R. 1957:11)

大多数词向量的实现的核心是 相似词 ，也就是同义词，因为它们有相似的上下文。这里我们介绍一种策略叫做 共现矩阵 (更多信息可以查看这里或这里 )

这部分要实现的是，给定语料库，根据共现矩阵计算词向量，得到语料库中每个词的词向量，流程如下：

计算语料库的单词集
计算共现矩阵
使用SVD降维
分析词向量

问题1.1：实现 dicintct_words

计算语料库的单词数量、单词集

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    corpus = [w for sent in corpus for w in sent]
    corpus_words = list(set(corpus))
    corpus_words = sorted(corpus_words)
    num_corpus_words = len(corpus_words)

    # ------------------

    return corpus_words, num_corpus_words

问题1.2：实现compute_co_occurrence_matrix

计算给定语料库的共现矩阵。具体来说，对于每一个词 w，统计前、后方 window_size 个词的出现次数

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {
   
   }
    
    # ------------------
    # Write your implementation here.
    M = np.zeros(shape=(num_words, num_words), dtype=np.int32)
    for i in range(num_words):
        word2Ind[words[i]] = i
    
    for sent in corpus:
        for p in range(len(sent)):
            ci = word2Ind[sent[p]]
            
            # preceding
            for w in sent[max(0, p - window_size):p]:
                wi = word2Ind[w]
                M[ci][wi] += 1
            
            # subsequent
            for w in sent[p +