基于共现矩阵和预测的词向量：GloVe与自定义方法比较-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_46866349/article/details/137237774

本文介绍了使用共现矩阵生成词向量的方法，以及GloVe预训练词向量的应用，通过余弦相似度评估单词间的相似性，并展示了词向量在NLP任务中的直观应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

Word Vectors

1. import repos

from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = '<START>'
END_TOKEN = '<END>'

np.random.seed(0)
random.seed(0)

Word Vectors 和 Word Embeddings 通常可以互换使用
词向量通常用作下游NLP任务的基本组成部分，例如问题回答，文本生成，翻译等，因此对它们的优缺点进行一些直观了解非常重要。
在这里，您将探索两种类型的词向量: 从 co-occurrence matrices 派生的词向量，以及GloVe现成的词向量。gensim 是一个加载现成词向量的库, nltk 库可用于加载各种语料, 作为示例, 这里使用 reuters (路透社, 商业和金融新闻)语料库。
如下展示了一个 co-occurrence matrices 的示例, 窗口大小为 1, 对于一个文档内的某一个单词w(token), 我们每次都统计w周围 n 个单词(左边n个加上右边n个: [w-n, w+n])与之共同出现的次数.

Document 1: "all that glitters is not gold"
Document 2: "all is well that ends well"

*	`<START>`	all	that	glitters	is	not	gold	well	ends	`<END>`
`<START>`	0	2	0	0	0	0	0	0	0	0
all	2	0	1	0	1	0	0	0	0	0
that	0	1	0	1	0	0	0	1	1	0
glitters	0	0	1	0	1	0	0	0	0	0
is	0	1	0	1	0	1	0	1	0	0
not	0	0	0	0	1	0	1	0	0	0
gold	0	0	0	0	0	1	0	0	0	1
well	0	0	1	0	1	0	0	0	1	1
ends	0	0	1	0	0	0	0	1	0	0
`<END>`	0	0	0	0	0	0	1	1	0	0

这里, all和 <START> 共同出现的次数为2, 可以看到矩阵是对称而且稀疏的, 并且其大小为 V, V 是语料库中所有可能出现的单词数量. 注意: 在NLP中，我们经常添加 <START> 和 <END> 标记来表示句子，段落或文档的开头和结尾。在这种情况下，我们想象 <START> 和 <END> 标记封装每个文档，例如，“<START> All that glitters is not gold <END>”，并将这些 token 包括在我们的共现计数中。

2. Read corpus and calculate co-occurrence matrices

2-1 read_corpus

在这里，我们将使用路透社 (商业和金融新闻) 语料库。语料库由10,788个新闻文档组成，总计130万个单词。这些文档涵盖90个类别，分为train和test。有关详细信息，请参阅 https://www.nltk.org/book/ch02.html 我们在下面提供了一个 read_corpus 函数，该函数仅从 “黄金” 类别 (即有关黄金，采矿等的新闻文章) 中提取文章。该函数还向每个文档添加 <START> 和 <END> 标记，以及将单词转为小写。您不必执行任何其他类型的预处理。

def read_corpus(category="gold"):
    files = reuters.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]
reuters_corpus = read_corpus()
print(reuters_corpus[1])
"""
['<START>', 'belgium', 'to', 'issue', 'gold', 'warrants', ',', 'sources', 'say', 'belgium', 'plans',
 'to', 'issue', 'swiss', 'franc', 'warrants', 'to', 'buy', 'gold', ',', 'with', 'credit', 'suisse',
 'as', 'lead', 'manager', ',', 'market', 'sources', 'said', '.', 'no', 'confirmation', 'or',
 'further', 'details', 'were', 'immediately', 'available', '.', '<END>']
"""

2-2 vocabulary

遍历语料库 reuters_corpus 统计所有出现的单词, 为它们排序, 返回词表和词表长度.

def distinct_words(corpus):
    corpus_words = []
    n_corpus_words = -1
    for doc in corpus:
        corpus_words += doc
    corpus_words = list(set(corpus_words))
    corpus_words.sort()
    n_corpus_words = len(corpus_words)
    return corpus_wor