《精通Python自然语言处理( Deepti Chopra)》读书笔记(第十章):NLP系统评估

《精通Python自然语言处理》

Deepti Chopra(印度)
王威 译


第十章 NLP系统评估:性能分析


10.1 NLP系统评估要点

创建黄金标准注释语料库是一项主要的任务,而且其实成本也是非常昂贵的。它通过手工标注给定的测试数据来完成该操作。以这种方式筛选的标记被视为标准标记,其可用于表示大范围的信息。

10.1.1 NLP工具的评估(词性标注器、词干提取器及形态分析器)
训练一个一元语法标注器:
import nltk
from nltk.corpus import brown
sentences=brown.tagged_sents (categories= 'news')
sent=brown.sents (categories='news' )
unigram_sent = nltk.UnigramTagger (sentences)
print(unigram_sent.tag (sent [2008]))
print(unigram_sent.evaluate (sentences))     
使用分离的数据对一元语法标注器执行训练和测试:
import nltk
from nltk.corpus import brown
sentences = irown.tagged_sents (categories= 'news’)
sz=int (len (sentences)*0.8)
print(sz)
training_sents = sentences[:sz]
testing_sentssentences [sz:]
unigram_tagger = nltk.UnigramTagger (training_sents)
print(unigram_cagger.evaluate (testing_sents))
使用bigram(二元语法)标注器:
import nltk
from nltk.corpus import brown
sentences = brcwn.tagged_sents (categories= 'news')
s2 = int (len (sentences)*0.8)
training_sents = sentences[:sz]
testing_sents = sentences[sz:]
bigram_tagger = nltk.UnigramTagger (training_sents)
bigram_tagger = nltk.BigramTagger (training_sents)
print(bigram_tagger.tag (sentences[2008]))

un_sent = sentences [4203]
print(bigram_tagger.tag(un_sent))

print(bigram_tagger.evaluate (testing sents))
实现组合标注器:
import nltk
from nltk.corpus import brown
sentence = brown.tagged_sents (categories = ' news')
sz=int (len(sentences)*0.8)
training_sents = sentences[:sz]
tesling_sents = sentences[sz:]
s0 = nltk.DefaultTagger('NNP')
s1 = nltk.UnLgramTagger (training_sents, backoff = s0)
s2 = n1k.BigramTagger (training_sents,backoff = s1)
print(s2.evaluate(testing_sents))
语块解析器的评估:
import nltk
chunkparser = nltk.RegexpParser ("")
print (nltk.chunk.accuracy(chunkparser, nltk.corpus.con112000.chunked_sents(
	'train.txt', chunk_types=('NP',))))
朴素语块解释器的评估:
import nltk
grammar = r"NP: (< [CDJNP]. *>+}"
cp = nltk.RegexpParser (grammar)
print(nltk.chunk.accuracy(cp, nltk.corpus.con112000.chunked_sents(
		'train.txt', chunk_types = ('NP',))))
计算分块数据的条件频率分布:
def chunk_tags(train) :
	"""Generate a following tags list chat appears inside chunks"""
	cfreqdist = nltk.ConditionalFreqDist()
	for t in train:
		for word, tag, chunktag in nltk.chunk.tree2conlltags(t):
			if chtag == "O":
				cfreqdist[tag].inc (False)
			else:
				cfreqdist[tag].inc (True)
	return [tag for tag in cfreqdist.conditions() if cfreqdist [tag] .max() == True]
training_sents = nltk.corpus.conll2000.chunked_sents('train.txt', chunk_types = ('NP',))
print(chunked_tags (train_sents))
执行chunker评估:
import nltk
correct = nltk.chunk.tagstr2tree ("[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]")
print (correct. flatten())
grammar = r"NP: {< [CDJNP] . *>+}”
cp = nltk.RegexpParser (grammar)
grammar = r"NP: {<PRP|DT| POS| JJ|CD|N.*>+)”
chunk_parser = nltk.RegexpParser (grammar)
tagged_tok = [("the", "DT"), (“little", "JJ"), ( "cat","NN"), ("sat", "VBD"), ("on","IN"), ("the", "DT"), "mat", "NN")]
chunkscore = nltk.chunk.ChunkScore()
guesaed = cp.parse(correct.flatten())
chunkscore.score(correct, guessed)
print (chunkscore) 
评估一元语法chunker和二元语法chunker:
chunker_data = [[(t,c) for w, t, c in nltk.chunk.tree2conlltags (chtree)]
			for chtree in nltk.corpus.conll2000.chunked_sents(
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值