词向量笔记_vectors.bin词向量-优快云博客

本文深入探讨词向量的模型与方法，包括CBOW、Skip-gram、Negative Sampling和Hierarchical Softmax，以及它们在语义、语法方面的表现。文章介绍了Paragraph Vector（PV-DM和PV-DBOW）在句子和文档表示上的应用，同时讨论了词向量的训练、使用和评估。此外，还涵盖了doc2vec与word2vec在不同任务中的适用性和优势，并提供了训练和使用词向量的实践指导。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

词向量要点：
一. Efficient Estimation of Word Representations in Vector Space

vector(”King”) - vector(”Man”) + vector(”Woman”) is close to vec(“Queen”)
构建词向量的早期方法有LSA、LDA；
NNLM计算过于复杂，因为有隐藏层和投影层；因此本文将推出没有隐藏层的模型（可能不像神经网络那么精确，但更高效）；RNNLM没有投影层
CBOW的权重矩阵（输入层和投影层之间）对于每个词是共享的，各个词投影到相同的位置（进行叠加）
X = vector(”biggest”) - vector(”big”)+ vector(”small”);France is to Paris as Germany is to Berlin
可用于machine translation, information retrieval and question answering，sentiment analysis and paraphrase detection 等nlp任务；
cbow和skipgram分别适用于什么场景？
Skip-gram在semantic方面更好，cbow在syntactic方面更好

二. Distributed Representations of Words and Phrases and their Compositionality

提出Noise Contrastive Estimation (NCE）和negative sampling（NEG），比hierarchical softmax（HS）更简单；其中NEG是对NCE的简化，二者最大区别 NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.
提高向量质量和训练速度；（ subsampling of frequent words during training）
支持短语表示；（treat the phrases as individual tokens during the training）、vec(“Russia”) + vec(“river”) is close to vec(“Volga River”)、vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”)
Skip-gram需要最大化平均似然概率，其中的p(Wt+j|Wt) 需要用（full）softmax计算，但计算量太大，因此采用Hierarchical Softmax（从需要评估W个output nodes到只需要log2W个）； binary tree representation、random walk、Huffman tree ；
subsampling：频率越高的词被丢弃（不被采样）的概率也越大，因为a、the这些词没有太大信息价值；可以提高训练速度，同时训练出的词向量也更准确；
类比推理任务中，NEG比HS准确率更好；甚至比NCE略好；
非线性的RNN比线性的Skip-gram在类比推理任务效果更好（当训练集更大）；
以上这些对CBOW同样适用；因为模型简单，计算高效，适用于更大数据集，因此效果也更好；不同具体问题有不同最优超参数；

词向量有没有现成的、成熟的指标评价质量好坏？
The quality of these representations is measured in a word similarity task
To measure quality of the word vectors, we define a comprehensive test set that contains 5 types of semantic questions, and 9 types of syntactic questions.
（Semantic-Syntactic Word Relationship test set）

三. Distributed Representations of Sentences and Documents