《精通Python自然语言处理》
Deepti Chopra(印度)
王威 译
第十章 NLP系统评估:性能分析
10.1 NLP系统评估要点
创建黄金标准注释语料库是一项主要的任务,而且其实成本也是非常昂贵的。它通过手工标注给定的测试数据来完成该操作。以这种方式筛选的标记被视为标准标记,其可用于表示大范围的信息。
10.1.1 NLP工具的评估(词性标注器、词干提取器及形态分析器)
训练一个一元语法标注器:
import nltk
from nltk.corpus import brown
sentences=brown.tagged_sents (categories= 'news')
sent=brown.sents (categories='news' )
unigram_sent = nltk.UnigramTagger (sentences)
print(unigram_sent.tag (sent [2008]))
print(unigram_sent.evaluate (sentences))
使用分离的数据对一元语法标注器执行训练和测试:
import nltk
from nltk.corpus import brown
sentences = irown.tagged_sents (categories= 'news’)
sz=int (len (sentences)*0.8)
print(sz)
training_sents = sentences[:sz]
testing_sentssentences [sz:]
unigram_tagger = nltk.UnigramTagger (training_sents)
print(unigram_cagger.evaluate (testing_sents))
使用bigram(二元语法)标注器:
import nltk
from nltk.corpus import brown
sentences = brcwn.tagged_sents (categories= 'news')
s2 = int (len (sentences)*0.8)
training_sents = sentences[:sz]
testing_sents = sentences[sz:]
bigram_tagger = nltk.UnigramTagger (training_sents)
bigram_tagger = nltk.BigramTagger (training_sents)
print(bigram_tagger.tag (sentences[2008]))
un_sent = sentences [4203]
print(bigram_tagger.tag(un_sent))
print(bigram_tagger.evaluate (testing sents))
实现组合标注器:
import nltk
from nltk.corpus import brown
sentence = brown.tagged_sents (categories = ' news')
sz=int (len(sentences)*0.8)
training_sents = sentences[:sz]
tesling_sents = sentences[sz:]
s0 = nltk.DefaultTagger('NNP')
s1 = nltk.UnLgramTagger (training_sents, backoff = s0)
s2 = n1k.BigramTagger (training_sents,backoff = s1)
print(s2.evaluate(testing_sents))
语块解析器的评估:
import nltk
chunkparser = nltk.RegexpParser ("")
print (nltk.chunk.accuracy(chunkparser, nltk.corpus.con112000.chunked_sents(
'train.txt', chunk_types=('NP',))))
朴素语块解释器的评估:
import nltk
grammar = r"NP: (< [CDJNP]. *>+}"
cp = nltk.RegexpParser (grammar)
print(nltk.chunk.accuracy(cp, nltk.corpus.con112000.chunked_sents(
'train.txt', chunk_types = ('NP',))))
计算分块数据的条件频率分布:
def chunk_tags(train) :
"""Generate a following tags list chat appears inside chunks"""
cfreqdist = nltk.ConditionalFreqDist()
for t in train:
for word, tag, chunktag in nltk.chunk.tree2conlltags(t):
if chtag == "O":
cfreqdist[tag].inc (False)
else:
cfreqdist[tag].inc (True)
return [tag for tag in cfreqdist.conditions() if cfreqdist [tag] .max() == True]
training_sents = nltk.corpus.conll2000.chunked_sents('train.txt', chunk_types = ('NP',))
print(chunked_tags (train_sents))
执行chunker评估:
import nltk
correct = nltk.chunk.tagstr2tree ("[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]")
print (correct. flatten())
grammar = r"NP: {< [CDJNP] . *>+}”
cp = nltk.RegexpParser (grammar)
grammar = r"NP: {<PRP|DT| POS| JJ|CD|N.*>+)”
chunk_parser = nltk.RegexpParser (grammar)
tagged_tok = [("the", "DT"), (“little", "JJ"), ( "cat","NN"), ("sat", "VBD"), ("on","IN"), ("the", "DT"), "mat", "NN")]
chunkscore = nltk.chunk.ChunkScore()
guesaed = cp.parse(correct.flatten())
chunkscore.score(correct, guessed)
print (chunkscore)
评估一元语法chunker和二元语法chunker:
chunker_data = [[(t,c) for w, t, c in nltk.chunk.tree2conlltags (chtree)]
for chtree in nltk.corpus.conll2000.chunked_sents(