论文读书笔记-主题-非标准词的规范化(non-standard words normalization)

[2014ACL]Chen Li and Yang Liu.Improving Text Normalization via Unsupervised Model and Discriminative Reranking

这篇文章有关于normalizing informal text(changing non-standard words to standard ones),提出了两个方法去改善normalization的性能。第一个方法是一个无监督的方法Unsupervised Corpus-based Similarity for Normalization,利用word vector representation,计算non-standard token和词表里的proper word的语义相似度,按倒序排列形成词对list。第二个方法是提出reranking策略Reranking for System Combination,结合了来自于不同系统的结果。在试验中,word级别和句子级别的优化策略都被利用上。

第一种方法:


试验数据集如下:

The following data sets are used in our experiments.
We use Data 1 and Data 2 as test data, and
Data 3 as training data for all the supervised models.
• Data 1: 558 pairs of non-standard tokens and
standard words collected from 549 tweets in
2010 by (Han and Baldwin, 2011).
• Data 2: 3,962 pairs of non-standard tokens
and standard words collected from 6,160
tweets between 2009 and 2010 by (Liu et al.,
2011).
• Data 3: 2,333 unique pairs of non-standard
tokens and standard words, collected from
2,577 Twitter messages (selected from the
Edinburgh Twitter corpus) used in (Pennell
and Liu, 2011). We made some changes on
this data, removing the pairs that have more
than one proper words, and sentences that
only contain such pairs.3
• Data 4: About 10 million twitter messages
selected from the the Edinburgh Twitter corpus
mentioned above, consisting of 3 million
unique tokens. This data is used by the unsupervised
method to create the mapping table,
and also for building the word-based language
model needed in sentence level normalization.

比较结果:



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值