Paper reading summary for From Word Embeddings to Document Distance
Paper Link: http://proceedings.mlr.press/v37/kusnerb15.html
This paper gives a new way of computing the distance between two text documents. Word Mover’s Distance(WMD) This new representation method allows us to use an existing solver that is already very efficient. This method also has no hyperparameters and is easy to implement with low classification error rates.
The author utilizes a word representation called work2vec. This is a word embedding procedure that uses a shallow neural network to maximize the log probability of neighboring words. So the distance between two words that have similar meanings would be small. WMD between documents is defined as the total distance all the words in one document need to travel to the other document. The distance metric can be reduced to an instance of Earth Mover’s Distance which is a well-studied transportation problem.
One traditional way of representing documents is Bag of Words (BOW). But it fails to capture the similarity between two sentences when there are no common words. One example the author gives is “Obama speaks to the media in Illinois” and “The Present greets the press in Chicago”. These two sentences have essentially the same meaning but do not share a single word. We would be able to capture this closeness using WMD.

The objective function also has a lower bound which can be calculated quickly. This is very useful in terms of early termination and pruning to speed up the process of finding the k-nearest neighbors.
The author then uses the WMD to do classification on 8 document data sets and compares the results with the other 7 baselines. It turned out that WMD on average has the lowest error. They also test different word embedding mechanisms and find that word2vec models show better performance.
This new model allows us to have more accurate results when searching similar text documents. And model achieves it by really understand the meaning behind each sentence rather than simply comparing the words or the frequency of words in the documents.
本文介绍了一种新的文档距离计算方法——Word Mover's Distance (WMD),该方法利用word2vec进行词嵌入,能有效捕捉文档间的语义相似度,尤其适用于语义相近但词汇不同的情况。实验表明,WMD在多种文档分类任务中表现出较低的错误率。
5267

被折叠的 条评论
为什么被折叠?



