[2014ACL]Chen Li and Yang Liu.Improving Text Normalization via Unsupervised Model and Discriminative Reranking
这篇文章有关于normalizing informal text(changing non-standard words to standard ones),提出了两个方法去改善normalization的性能。第一个方法是一个无监督的方法Unsupervised Corpus-based Similarity for Normalization,利用word vector representation,计算non-standard token和词表里的proper word的语义相似度,按倒序排列形成词对list。第二个方法是提出reranking策略Reranking for System Combination,结合了来自于不同系统的结果。在试验中,word级别和句子级别的优化策略都被利用上。
第一种方法:
试验数据集如下:
The following data sets are used in our experiments.
We use Data 1 and Data 2 as test data, and
Data 3 as training data for all the supervised models.
• Data 1: 558 pairs of non-standard tokens and
standard words collected from 549 tweets in
2010 by (Han and Baldwin, 2011).
• Data 2: 3,962 pairs of non-standard tokens
and standard words collected from 6,160
tweets between 2009 and 2010 by (Liu et al.,
2011).
• Data 3: 2,333 unique pairs of non-standard
tokens and standard words, collected from
2,577 Twitter messages (selected from the
Edinburgh Twitter corpus) used in (Pennell
and Liu, 2011). We made some changes on
this data, removing the pairs that have more
than one proper words, and sentences that
only contain such pairs.3
• Data 4: About 10 million twitter messages
selected from the the Edinburgh Twitter corpus
mentioned above, consisting of 3 million
unique tokens. This data is used by the unsupervised
method to create the mapping table,
and also for building the word-based language
model needed in sentence level normalization.
比较结果: