这里用的是几何加权平均数
我们生活中常用的算术加权平均数是
w
0
x
0
+
w
1
x
1
.
.
w
0
+
w
1
.
.
\frac{w_{0}x_{0}+w_{1}x_{1}..}{w_{0}+w_{1}..}
w0+w1..w0x0+w1x1..
而几何加权平均数是
∏
w
x
(
∑
w
)
\sqrt[\displaystyle(\sum{w})]{\prod{wx}}{}
(∑w)∏wx
那我们在这个基础上演变一下
∏
x
w
(
∑
w
)
=
e
l
n
(
∏
x
w
(
∑
w
)
)
=
e
l
n
(
∏
x
w
)
∑
w
=
e
∑
w
l
n
x
∑
w
\sqrt[(\sum{w})]{\prod{x^{w}}}{} \\=\displaystyle e^{ln(\sqrt[(\sum{w})]{\prod{x^{w}}})} \\=\displaystyle e^{\frac{ln({\prod{x^{w}}})}{\sum{w}}} \\\displaystyle =e^{\frac{\sum{wlnx}}{\sum{w}}}
(∑w)∏xw=eln((∑w)∏xw)=e∑wln(∏xw)=e∑w∑wlnx
一般资料上的n-gram加权平均没有分母,还有为什么要表现成这个形式
,不写成 ∑ i = 1 N P n w n \sum_{i=1}^{N}P_{n}^{w_{n}} ∑i=1NPnwn这种形式,应该是为了简化运算吧,这种形式要做好多次乘方和加法
from collections import Counter
from math import exp
import nltk
def bleuMultiGram(candidate,reference,maxn,weight):
sum=0
if(weight):
for i in range (1,maxn+1):
sum+=exp(weight[i-1]*bleu(candidate,reference,i))
else:
for i in range(1, maxn + 1):
sum += exp(bleu(candidate, reference, i)) #默认权重都为1
if(len(ngram(candidate,1))>len(ngram(reference,1))):
return sum
else:
return exp(1-len(ngram(candidate,1))/len(ngram(reference,1)))
def bleu(candidate, reference, n=1):
if(len(reference)==0):
return False;
candidateList=ngram(candidate,n)
referenceList=ngram(reference,n)
cnt=NumOfIntersection(candidateList,referenceList)/ len(candidateList);
return cnt
def ngram(str,n=1):
'''
返回一个字符串的ngram切分
:param str:
:param n:
:return:list of str
'''
if len(str)<n:
return []
str=str.split(' ')
string=""
list=[]
for i in range(len(str) - n + 1):
for j in range(0, n):
string += str[i + j]+" " if j<n else str[i + j]
list.append(string)
string=""
return list;
def NumOfIntersection(candidate,reference):
'''
返回两个list的相同元素个数(不去重)
:param candidate:
:param reference:
:return:int
'''
candidateCounter=dict(Counter(candidate))
referenceCounter=dict(Counter(reference))
cnt=0
for key in candidateCounter.keys() &referenceCounter.keys():
cnt+=min(candidateCounter[key],referenceCounter[key])
return cnt;
print(bleu("the the the the", "the cat is standing on the ground", 1))