词向量要点:
一. Efficient Estimation of Word Representations in Vector Space
- vector(”King”) - vector(”Man”) + vector(”Woman”) is close to vec(“Queen”)
- 构建词向量的早期方法有LSA、LDA;
- NNLM计算过于复杂,因为有隐藏层和投影层;因此本文将推出没有隐藏层的模型(可能不像神经网络那么精确,但更高效);RNNLM没有投影层
- CBOW的权重矩阵(输入层和投影层之间)对于每个词是共享的,各个词投影到相同的位置(进行叠加)
- X = vector(”biggest”) - vector(”big”)+ vector(”small”);France is to Paris as Germany is to Berlin
- 可用于machine translation, information retrieval and question answering,sentiment analysis and paraphrase detection 等nlp任务;
- cbow和skipgram分别适用于什么场景?
Skip-gram在semantic方面更好,cbow在syntactic方面更好
二. Distributed Representations of Words and Phrases and their Compositionality
-
提出Noise Contrastive Estimation (NCE)和negative sampling(NEG),比hierarchical softmax(HS)更简单;其中NEG是对NCE的简化,二者最大区别 NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.
-
提高向量质量和训练速度;( subsampling of frequent words during training)
-
支持短语表示;(treat the phrases as individual tokens during the training)、vec(“Russia”) + vec(“river”) is close to vec(“Volga River”)、vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”)
-
Skip-gram需要最大化平均似然概率,其中的p(Wt+j|Wt) 需要用(full)softmax计算,但计算量太大,因此采用Hierarchical Softmax(从需要评估W个output nodes到只需要log2W个); binary tree representation、random walk、Huffman tree ;
-
subsampling:频率越高的词被丢弃(不被采样)的概率也越大,因为a、the这些词没有太大信息价值;可以提高训练速度,同时训练出的词向量也更准确;
-
类比推理任务中,NEG比HS准确率更好;甚至比NCE略好;
-
非线性的RNN比线性的Skip-gram在类比推理任务效果更好(当训练集更大);
-
以上这些对CBOW同样适用;因为模型简单,计算高效,适用于更大数据集,因此效果也更好;不同具体问题有不同最优超参数;
词向量有没有现成的、成熟的指标评价质量好坏?
The quality of these representations is measured in a word similarity task
To measure quality of the word vectors, we define a comprehensive test set that contains 5 types of semantic questions, and 9 types of syntactic questions.
(Semantic-Syntactic Word Relationship test set)
三. Distributed Representations of Sentences and Documents
- 目的和重点是训练Paragraph Vector(特征提取器);机器学习需要固定长度的特征向量,和词袋比,具有以下优点:有词序、有词义;
- 文本可以是各种长度,ranging from sentences to document;
- 对于text representations、sentiment analysis task以及text classication task很有用;
- Distributed Memory Model of Paragraph Vectors (PV-DM) 和