一、使用内置的{不用训练}
详见《自然语言处理Python进阶》6.2.2示例代码.
二、使用Brill{基于规则且需要训练的}
Brill是基于规则的标注器, 随后在训练过程中进行优化.
1. 栗子:
import nltk
import nltk.tag.brill
from nltk.corpus import brown
from nltk.corpus import treebank
# 合并多个语料库的tagged_sents.
treebank_tagged_sents = treebank.tagged_sents(tagset="universal") # 这里可以指定categories.
brown_tagged_sents = brown.tagged_sents(tagset="universal") # 如果不指定categories 和 tagset的话. 默认使用的是全部数据集. brown有51606.
all_tagged_sents = treebank_tagged_sents + brown_tagged_sents
# 划分数据集.
size = int(len(all_tagged_sents) * 0.9)
print("train_data_size is {}".format(size))
train_sents = all_tagged_sents[:size]
dev_sents = all_tagged_sents[size:]
# 开始训练.
backoff = nltk.RegexpTagger([ # {基础的规则,实际上. 最后肯定不只这些.}
(r'^-?[0-9]+(.[0-9]+)?$', 'NUM'), # cardinal numbers
(r'(The|the)$', 'DET'), # 限定词. 定冠词?
(r'.*able$', 'ADJ'), # adjectives
(r'.*ness$', 'NN'), # nouns formed from adjectives
(r'.*ly$', 'ADV'), # adverbs 副词.
(r'.*s$', 'NNS'), # plural nouns
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # past tense verbs
(r'.*', 'NN') # nouns (default)
])
baseline_tagger = nltk.UnigramTagger(train_sents, backoff=backoff)
tt = nltk.tag.brill_trainer.BrillTaggerTrainer(baseline_tagger, nltk.tag.brill.brill24())
brill_tagger = tt.train(train_sents, max_rules=20, min_acc=0.99)
# 评估效果.
print(brill_tagger.evaluate(brown_tagged_sents[size:]))
# 测试.
brown_sents = brown.sents(categories="news") # 返回的是[["word", "word", .., "word"], [另一句话]]
print(brown_sents[2007])
print(brill_tagger.tag(brown_sents[2007]))
2. 代码参考:
-
“universal” 对应的 pog 参考: https://www.nltk.org/_modules/nltk/tag/mapping.html
-
这个链接中使用的 pog 比较旧了, 但是好像更常用: https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk
三、不错的关于词性标注的博客:
https://blog.youkuaiyun.com/lyb3b3b/article/details/75117241
https://github.com/apachecn/nlp-py-2e-zh/blob/2a65e14e495506684017a310961d186cf5d3e818/docs/5.md
四、推荐书籍:
《自然语言处理Python进阶》