中文分词

最新推荐文章于 2025-05-15 12:53:57 发布

fancyerII

最新推荐文章于 2025-05-15 12:53:57 发布

阅读量1.5k

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习 NLP 文章标签： character features translation performance c training

本文链接：https://blog.youkuaiyun.com/fancyerII/article/details/3884371

NLP 同时被 2 个专栏收录

5 篇文章

订阅专栏

机器学习

4 篇文章

订阅专栏

中文分词是很多NLP和IR任务的一个必要且重要的步骤。不过什么是“词”，现在还是存在争论的。拿sighan2005的两个分词标准——北大计算语言所的标注和 Penn Treebank（CTB）来说，他们就有很多差异。链接为

http://sighan.cs.uchicago.edu/bakeoff2005/data/pku_spec.pdf

http://www.cis.upenn.edu/~chinese/segguide.3rd.ch.pdf

我的个人看法是，世上本没有词，用的人多了就成了词。比如pku的一个切分说明，“大哥”不用切分，而“三哥”应该切分。比如两年前没有“雷人”这个词，现在也许该是一个词了。

具体哪些是词哪些不是，这大概是语言学家的事情。

Optimizing Chinese Word Segmentation for Machine Translation Performance 指出，对于不同的任务，应该选择不同的分词“粒度”。对于IR任务，较短的词效果比较好，而对于ASR来说，长的比较好。而对于基于短语的机器翻译，作者提出了自己的分词“粒度”，我对机器翻译不是很了解和感兴趣，所以没有具体看这篇文章。不过他的观点还是比较赞同的。对于IR，比如搜索引擎。其实用户的查询一般都是名词动词，因为查询的对象是概念，而名词表示概念，而动词表示了名词直接的关系。（我没有做过调查，只是猜测）所以对于搜索引擎来说，也许把名词切分的准确更重要。另外，分词对通用搜索来说也不是特别关键，因为用户的查询也会被分词，所以即使分词了也很有可能被检索出来。这也许是google和百度不怎么花力气做的原因吧。

说了一堆废话，还是看看文章吧。

1, A CRF Chinese Word Segmenter for Sighan bakeoff 2005

Features

character identity n-grams, morphological and character reduplication features

a, character identity features. c0,c1,c-1,c-2 c0c1 c-1c0 c-1c1 c-2c-1 c2c0

b, morphological features

c-1+c0 unknown word feature if c-1c0 appear in the training corpus as a bi-gram

c-1,c0,c1 individual character features whether it's a individual can be learned from training

corpus. That's, if this character alone always form a word.

c-1 prefix feature, c0 suffix feature. How to generate affix?

To construct a table containing

affixes of unknown words, rather than using

threshold-filtered affix tables in a separate unknown

word model as was done in Gao et al.

(2004), we first extracted rare words from a corpus

and then collected the first and last characters

to construct the prefix and suffix tables.

我的理解是经常出现的很可能是固定的词，而比较少出现的可能是前后缀构成的词。当然前后缀构成的词在训练语料中也可能经常出现。比如老，老张可能出现很多因为姓张的人多，但老万出现的比较少。所以还是有可能发掘出老是前缀。如果老只和张一块出现，那么就没有必要认为老是前缀。

c, reduplication features c-1c0 看看， c-1c1 看一看

Most features appeared in the first-order templates

with a few of character identity features in

the both zero-order and first-order templates.

We also did normalization of punctuations due

to the fact that Mandarin has a huge variety of

punctuations.

2, Optimizing Chinese Word Segmentation for MT Performance

Different NLP applications have different needs for segmentation.

Chinese information retrieval (IR) systems benefit

from a segmentation that breaks compound words

into shorter “words” (Peng et al., 2002), paralleling

the IR gains from compound splitting in languages

like German (Hollink et al., 2004), whereas

automatic speech recognition (ASR) systems prefer

having longer words in the speech lexicon (Gao et

al., 2005).

In this paper, we investigated what segmentation

properties can improve machine translation performance.

First, we found that neither character-based

nor a standard word segmentation standard are optimal

for MT, and show that an intermediate granularity

is much more effective. Using an already competitive

CRF segmentation model, we directly optimize

segmentation granularity for translation quality,

and obtain an improvement of 0.73 BLEU point

on MT05 over our lexicon-based segmentation baseline.

Second, we augment our CRF model with

lexicon and proper noun features in order to improve

segmentation consistency, which provide a

0.32 BLEU point improvement.

3, A Dual-layer CRFs based Joint Decoding Method for Cascaded Segmentation and Labeling Tasks

思想很简单和实用：分词的好坏直接影响到词性标注，词性标注可以作为分词的参考。对于分词来说，即使某种切分概率很大，但是如词性标注的概率很小的话，它也可能不是最好的结果。也可以这么看，词性标注有一定的句法信息。比如人在做分词时（可以回忆一下高中给文言文加标点），会尝试概率较大的几种切分方法，然后尝试“理解”它，如果不能“理解”，那么抛弃这种切分方法。什么叫“理解”呢？当然是很玄的东西了，这里可以简单的认为对它做句法分析能得到较大的置信度，也就是较大的概率。当然做句法分析代价太大，词性标注可以认为带有一些句法的结构信息。

当然把这切分和标注任务联合建模到CRFs也是有人尝试过的，比如DCRFs，但缺点是联合解码代价太大；还有个

Semi-CRFs，这篇文章是用来做NER的，据说比普通的CRFs慢不了多少，时间是普通CRFs的常数倍。把它用到POS tagger上不知道效果怎么样，也许有时间可以试试。

本文的做法很简单，解码是求联合概率的最优解

<S0,T0>=argmax P(S,T|C) C=c1c2...cn表示这个句子的n个字 S=s1s2..sn si属于{B,I}，表示这个字是词的开始还是词的继续。 T表示对“词”的词性标注 T=t1t2..tm m<=n

然后分解这个概率 P(S,T|C)=P(T|S,C)P(S|C)

由于对于一个句子切分方法S很多，所以求它的最优解也很麻烦，所以作者用了近似的方法，求概率最大的N-best切分方法，把S限定在这N-best里面，这样搜索空间就小了很多。

训练时也是先训练P(S|C) 然后训练 P(T|S,C)。当然这两个都是用CRFs来建模的。

我觉得这篇文章有意思的地方是它使用到的特征。

分词模型的特征：

字特征cn n=[-2,2];

cncn+1 n=[-2,1];

c-1c1