使用NLTK对英文文章分句，避免缩略词标点符号干扰

最新推荐文章于 2025-10-31 13:25:24 发布

原创

最新推荐文章于 2025-10-31 13:25:24 发布 · 4k 阅读

10 ·

CC 4.0 BY-SA版权

在处理英文文章分句时，使用NLTK可能会因缩略词如'i.e.'的标点导致错误切分。为解决此问题，可以利用nltk.tokenize.punkt并自定义缩写词列表，确保缩写词不包含末尾的'.'，以正确地进行句子划分。

对于英文语料，我们想要获得句子时，可以通过正则或者NLTK工具切分。例如，NLTK：

from nltk.tokenize import sent_tokenize

document=''
sentences=sent_tokenize(document)

NLTK会根据“.?!”等符号切分。但是当句子中含有缩写词时，可能会产生错误的切分：

sent_tokenize('fight among communists and anarchists (i.e. at a series of events named May Days).')

输出：
['fight among communists and anarchists (i.e.',
 'at a series of events named May Days).']

句子在i.e.后边被切分了。为了避免这种情况，我们需要使用nltk.tokenize.punkt并且自定义缩写词表：

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['i.e']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize('fight among communists and anarchists (i.e. at a series of events named May Days).')

输出：
['fight among communists and anarchists (i.e. at