1.读入句子,记录 单词个数,非终结点到单词的映射,非终结点到非终结点的映射
2.根据得出的统计文件里确认不是RARE的单词
1)替换原文件出现少于5次的单词,匹配patt = '"(\w+)"]',
l = input.readline()
words = re.findall(patt,l)
for word in words:
if word in self.notRareword:
pass
else:
wordall = '"'+word+'"'+']'
l=l.replace(wordall,'"_RARE_"]')
print l
2)计算非终结点到RARE单词的映射概率
3.动态规划求最大概率划分
1)求pi(i,i,X) = X到第i个单词的概率
2)pi(i,j,x), 1<=i<j<=n,i<=s<j
pi(i,j,x) = pi(i,s,y1)*pi(s+1,j,y2)*p(x,y1,y2)的最大值
bp(i,j,x) = (x,y1,y2,s)
4.输出
def print(self,i,j,X,words):
str =""
if i!=j:
newY1 = self.bp[(i,j,X)][1]
newY2 = self.bp[(i,j,X)][2]
news = self.bp[(i,j,X)][3]
str = str+"["+'"'+X+'", '+self.print(i,news,newY1,words)+', '+self.print(news+1,j,newY2,words)+']'
else:
str = str+"["+'"'+X+'", '+'"'+words[i-1]+'"' +']'
return str
结果:
Type Total Precision Recall F1-Score
===============================================================
ADJP 7 0.000 0.000 0.000
ADVP 12 0.333 0.167 0.222
NP 571 0.629 0.676 0.651
PP 188 0.774 0.803 0.789
PRT 4 0.333 0.250 0.286
QP 1 0.000 0.000 0.000
S 26 0.333 0.231 0.273
SBAR 9 0.167 0.222 0.190
SBARQ 236 0.983 0.966 0.974
SQ 236 0.897 0.881 0.889
VP 168 0.634 0.351 0.452
WHADJP 26 0.852 0.885 0.868
WHADVP 60 0.952 0.983 0.967
WHNP 178 0.879 0.860 0.869
WHPP 5 1.000 0.800 0.889
total 1727 0.767 0.742 0.754
2.根据得出的统计文件里确认不是RARE的单词
1)替换原文件出现少于5次的单词,匹配patt = '"(\w+)"]',
l = input.readline()
words = re.findall(patt,l)
for word in words:
if word in self.notRareword:
pass
else:
wordall = '"'+word+'"'+']'
l=l.replace(wordall,'"_RARE_"]')
print l
2)计算非终结点到RARE单词的映射概率
3.动态规划求最大概率划分
1)求pi(i,i,X) = X到第i个单词的概率
2)pi(i,j,x), 1<=i<j<=n,i<=s<j
pi(i,j,x) = pi(i,s,y1)*pi(s+1,j,y2)*p(x,y1,y2)的最大值
bp(i,j,x) = (x,y1,y2,s)
4.输出
def print(self,i,j,X,words):
str =""
if i!=j:
newY1 = self.bp[(i,j,X)][1]
newY2 = self.bp[(i,j,X)][2]
news = self.bp[(i,j,X)][3]
str = str+"["+'"'+X+'", '+self.print(i,news,newY1,words)+', '+self.print(news+1,j,newY2,words)+']'
else:
str = str+"["+'"'+X+'", '+'"'+words[i-1]+'"' +']'
return str
结果:
Type Total Precision Recall F1-Score
===============================================================
ADJP 7 0.000 0.000 0.000
ADVP 12 0.333 0.167 0.222
NP 571 0.629 0.676 0.651
PP 188 0.774 0.803 0.789
PRT 4 0.333 0.250 0.286
QP 1 0.000 0.000 0.000
S 26 0.333 0.231 0.273
SBAR 9 0.167 0.222 0.190
SBARQ 236 0.983 0.966 0.974
SQ 236 0.897 0.881 0.889
VP 168 0.634 0.351 0.452
WHADJP 26 0.852 0.885 0.868
WHADVP 60 0.952 0.983 0.967
WHNP 178 0.879 0.860 0.869
WHPP 5 1.000 0.800 0.889
total 1727 0.767 0.742 0.754