[NLP]论文笔记-A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS

最新推荐文章于 2024-04-26 08:52:59 发布

CristianoJason

最新推荐文章于 2024-04-26 08:52:59 发布

阅读量3.2k

点赞数

CC 4.0 BY-SA版权

分类专栏： NLP 文章标签： nlp

本文链接：https://blog.youkuaiyun.com/CristianoJason/article/details/77995306

A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS

这篇文章是在学习 Stanford NLP&DL（cs224n）过程中课后的一篇推荐论文，看到 slides 中讲到 sentence embedding 对于像情感分析等依赖于或者说可以当做句子分类的任务很有帮助，于是选择了这篇论文。
对于 sentence embedding 我之前没有什么了解，这篇可以说是我第一次接触这个话题。如果让我计算一个句子的表示，我觉得可以从

句长

组成句子的单词

句式结构，即一个句子的 parse tree

等几个方面入手，比如将单词的 embedding 和句法的 embedding 拼接输入到神经网络当中训练出 sentence embedding。
在看完文章中所提出的模型后，除了我想到的单词之外，作者强调了该模型对词序的依赖不是很强，感觉有点像句式结构不是很重要的意思，这一点我还没有想的很明白。
概括来说，这篇文章主要设计了一种无监督的 sentence embedding 方法，即如何加权句子中的每个单词从而计算得到整个句子的向量表示，所设计的模型在文本相似度、文本蕴含、文本分类等任务上表现的都很好。文章中大量篇幅涉及到所提出的加权机制，具体在下面的笔记中会介绍。

Abstraction

特点

无监督，对单词加权，词袋模型，与词序无关。
简介

Use word embeddings computed using one of the popular methods on unlabeled corpus like Wikipedia, represent the sentence by a weighted average of the word vectors, and then modify them a bit using PCA/SVD.
theoretical explanation

using a latent variable generative model for sentences, an extension of Arora et al. TACL’16.

？？？问题：
允许出现一些虽然不在上下文中但出现频率很高的单词（allow for words occurring out of context, as well as high probabilities for words like and, not in all contexts.）

Introduction

起源

在PPDB上对标准的 word embedding 进行修改,构建一个 word averaging 模型来构建 sentence embedding。效果还不错，但受限于修改这个过程，一般直接的对词向量进行平均的效果并不好。
算法
1. 计算词向量的加权平均值
2. common component removal: remove the projections of the average vectors on their first principal component.
单词 $w$ 的权重（SIF）：

weight(w)=aa+p(w)

其中 $a$ 为参数， $p(w)$ 为单词 $w$ 的词频。

优势：
- 通过调整 $a$ 可以使 $weight(w)$ 达到最优解；
- （实验证明）不同领域的语料得到的 $p(w)$ 不会影响对应的权重计算。

word embedding

在 Random Walk 模型中对潜在的变量进行近似推理
- Random Walk 是在文章中生成缺失词的产生式模型。
Phrase/Sentence/Paragraph embeddings

通过 word embedding 计算 paraphrastic sentence embedding，并且根据 paraphrase对 word embedding 更新，初始化和训练过程中均为有监督的。