BLEU详解以及简单的代码实现翻译评估

最新推荐文章于 2024-08-19 20:02:52 发布

weixin_ry5219775

最新推荐文章于 2024-08-19 20:02:52 发布

阅读量1.6k

点赞数

CC 4.0 BY-SA版权

原文链接：https://blog.youkuaiyun.com/allocator/article/details/79657792

本文深入解析BLEU指标，探讨其在机器翻译领域的应用，包括原始BLEU计算方法、改良型BLEU(n-gram)及短译句惩罚因子，并提供源代码实现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

整个步骤:
我爱北京天安门
1.第一层循环分词长度的比如长度为1 我
2.第二层循环对预测中的分词(gram)进行统计
3.第三层循环对某个具体的分词进行统计（对预测结果中的每个词都做如下处理）
3.1 对预测中的某个词统计其在预测句子中的频率
3.2 预测中的某个词在所有参考句子中的频率得到一个列表
3.3 在预测句子中得到的频率和所有参考句子中的频率都相互比较取二者的最小值
得到一个列表（防止,the,the,the）
3.4 对3.3所的列表取最大值作为分子（预测结果中所有词的结果相加）
3.5 在预测中出现的频率作为分母（预测结果中所有词的结果相加相当于预测结果的长度?）
2.1 3.4的结果除以3.5的结果准确率
4. 把每种长度分词的准确率做加权平均到结果

https://blog.youkuaiyun.com/dejing6575/article/details/101474040
https://blog.youkuaiyun.com/allocator/article/details/79657792

在这里插入图片描述
就是去掉the the the 的影响

精准度:预测中有多少是预测对了的

             <div id="article_content" class="article_content clearfix">
        <link rel="stylesheet" href="https://csdnimg.cn/release/phoenix/template/css/ck_htmledit_views-211130ba7a.css">
                        <div id="content_views" class="markdown_views">
                <!-- flowchart 箭头图标 勿删 -->
                <svg xmlns="http://www.w3.org/2000/svg" style="display: none;">
                    <path stroke-linecap="round" d="M5,0 0,2.5 5,5z" id="raphael-marker-block" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);"></path>
                </svg>
                                        <p></p><div class="toc">

引子

最近在做一个深度学习的小项目: Caption generation model 其中在快速评估模型的时候使用到了Bleu这一个指标, 于是花了一点时间来研究了这个指标代表的意义以及如何计算这个指标, 附带源码.

何为BLEU

在机器翻译领域, 我们需要一个指标来衡量机器翻译的结果与专业人工翻译结果的差异, 一般情况下,如果需要比较好的结果都需要专业的翻译人员介入评估模型翻译的好坏, 但是这样需要大量的人力参与. 因此是否有一种机器的评估方法用于判断翻译的好坏而不需要人工介入, 这种评估方法虽然准确度并不高, 但是可以粗略的反应出当前模型的好坏. 于是Bleu就是这样的情况下作为一个比较好的用于衡量翻译文本质量的指标.维基百科BLEU
BLEU(bilingual evaluation understudy) 中文名称为双语互译质量辅助工具, 计算这个指标, 需要使用机器翻译好的文本(称作candidate docs)以及一些专业翻译人员翻译的文本(称作reference docs). 本质上讲BLEU 就是用来衡量机器翻译文本与参考文本之间的相似程度的指标,取值范围在0-1, 取值越靠近1表示机器翻译结果越好. 然而这个指标也是经过多次的更新升级.

最初的BLEU

最初的BLEU计算特别简单, 通常的讲, 当我们自己计算两个文本之间的相似程度的时候, 我们会考虑单词的频率, 最早的BLEU就是采用了这种思想, 计算方法是: 使用一个累加器表示candidate中的词在reference doc中出现的次数, 从candidate doc 中的第一个词开始比较, 如果在参考文本中出现过, 那么计数加1. 最后使用这个累加值除以candidate doc 中的单词数目即可计算得到文本的BLEU取值, 我们称之为Precision, for example:

根据这个例子和上述的算法, 可以很容易的计算当前文本的precision. 整个candidate doc 的单词长度为7, 而且每一个单词都在reference doc里面出现过, 所以此时累加值为7, 因此准去度为:

P = 7 7 = 1

但是实际上这个翻译非常不理想, 这也是最初的BLEU评估指标不完善的地方, 当遇到出现较多常见词汇时, 翻译质量不高的译文还能够得到较高的precision, 因此也诞生了后续的改良型BLEU计算指标的算法.

改良型BLEU(n-gram)

改良型BLEU. 上面提到的计算BLEU的方法是以单个词为基准进行计算. 改良型的BLEU引入将多个词组合在一起形成一个gram的思想, 比如最初版的Bleu的计算可以看做一个单词构成的gram(这是一种特殊情况), 这种特殊组合被叫做uni-gram, 两个单词的组合叫做bi-gram 以此类推. 因此就可以构成1个单词长度到n个单词长度的多种单词组合(每一种单词长度可能存在不同的组合). 每一种长度的gram都可以计算出一个相应的precision Pn

长度为n

以上公式是长度为n

一般情况下权值全部取1, 因此可以得到简化的几何平均精度

P a v g = e \sum N - 1 n = 0 l n P n N

短译句的惩罚因子

如上的改良型BLEU的计算公式基本可以解决翻译中的常见的词汇对翻译结果评估的影响, 比如参考第一个翻译例子, 如果我们采用改良型的BLEU计算方法求得的评估值为. 但是上述的方法针对于翻译结果为短句依然会得出不准确的评估. for example:

根据上述的计算平均精度的公式,可以计算出:

P a v g = 1

因此在这个基础上引入了对于短翻译句子的惩罚因子. 此处定义一个概念, 当candidate doc 长度等于任何一个reference doc的长度的时候, 我们称此时为最佳匹配, 此时不需要对翻译的结果进行惩罚, 当candidate doc 的长度不等于任何reference doc 的长度的时候, 需要引入一个参考长度(记做Reflen

综上所述, 含有惩罚因子的BLEU最终的计算公式如下:

P a v g = θ * e \sum N - 1 n = 0 l n P n N

关于参考长度的选择其实没有固定的准则. 如果是比较严格的情况下可以选择将最长的reference doc的长度作为参考长度, 一旦翻译文档不是最佳匹配的情况都会受到短句惩罚因子的影响.
引入短句的惩罚因子, 对上述的短句翻译例子的最终平均精度计值为(此时选择的参考长度为reference doc中的最长句子长度,值为7):

P a v g = 0.082

从结果可以看出使用了惩罚因子过后, 很大程度上降低了短翻译句子的精度, 使得模型的评估更加准确.

总结

BLEU是一个非常简单快速粗略的评估指标, 当面对多个翻译模型且需要快速选择模型的场景, 可以使用这个指标来评估模型的好坏, 但是在需要精确评估翻译文本质量的场景, 这个指标就不是那么适用了.

附录(源代码)

# -*- coding:utf-8 -*-
"""
Description:
    1) 使用nltk包中的bleu计算工具来进行辅助计算
"""
import numpy as np
import re
from nltk.translate.bleu_score import corpus_bleu

def my_bleu_v1(candidate_token, reference_token):
    """
    :param candidate_set:
    :param reference_set:
    :description:
    最简单的计算方法是看candidate_sentence 中有多少单词出现在参考翻译中, 重复的也需要计算. 计算出的数量作为分子
    分母是候选句子中的单词数量
    :return: 候选句子单词在参考句子中出现的次数/候选句子单词数量
    """
    # 分母是候选句子中单词在参考句子中出现的次数 重复出现也要计算进去
    count = 0
    for token in candidate_token:
        if token in reference_token:
            count += 1
    a = count
    # 计算候选翻译的句子中单词的数量
    b = len(candidate_token)
    return a/b


def calculate_average(precisions, weights):
    """Calculate the geometric weighted mean."""
    tmp_res = 1
    for id, item in enumerate(precisions):
        tmp_res = tmp_res*np.power(item, weights[id])
    tmp_res = np.power(tmp_res, np.sum(weights))
    return tmp_res


def calculate_candidate(gram_list, candidate):
    """Calculate the count of gram_list in candidate."""
    gram_sub_str = ' '.join(gram_list)
    return len(re.findall(gram_sub_str, candidate))


def calculate_reference(gram_list, references):
    """Calculate the count of gram_list in references"""
    gram_sub_str = ' '.join(gram_list)
    gram_count = []
    for item in references:
        # calculate the count of the sub string
        gram_count.append(len(re.findall(gram_sub_str, item)))
    return gram_count


def my_bleu_v2(candidate_sentence, reference_sentences, max_gram, weights,mode=0):
    """
    :param candidate_sentence:
    :param reference_sentence:
    :description: 上诉的最初版本的bleu指标存在比较大的缺陷 如常用词语(the on) 等 由于出现的频率比较高
    会导致翻译结果比较差的时候也能够得到较高的bleu值
    改进行的bleu方法中使用到了n-grams precision方式更改分母的计算法则 使得不是简单的计算单个词汇出现次数
    原有的初始方法是一一个词为基准计算分母 现在改进方法采用n 个词作为一个组用于计算分母 其中n可以从1取到最大
    这样如果事先决定了所要计算gram的最大长度(N) 那么可以在candidate sentence 和 reference sentences 上计算出每一个
    长度的gram 的精度 然后对精度进行几何加权平均即可
    :return:
    """
    candidate_corpus = list(candidate_sentence.split(' '))
    # number of the reference sentences
    refer_len = len(reference_sentences)
    candidate_tokens_len = len(candidate_corpus)
    # 首先需要计算各种长度的gram 的precision值
    if mode == 0:
        # method1 to calculate the bleu
        # 计算当前gram 在candiate_sentence中出现的次数 同时计算这个gram 在所有的reference sentence中的出现的次数
        # 每一次计算时将当前candidate_sentence中当前gram出现次数与在当前reference sentence中出现的gram次数选择最小值
        # 作为这个gram相对于 参考文献j的截断次数
        # 然后将所有的参考文献对应的截断次数做最大值 作为这个gram在整个参考文献上的综合截断值 这个值就是当前gram对应的分子
        # 分母依然是这个gram 在candidate sentence中出现的次数
        # 在计算当前长度(n)的其他的gram的综合截断次数 然后加起来作为长度为n的gram的综合截断次数值 分母是所有长度为n的gram的相加的值
        # 两个值相除即可得到这个长度为n的gram 的precision值
        # procedure
        gram_precisions= []
        for i in range(max_gram):
            # calculate each gram precision
            # set current gram length
            curr_gram_len = i+1
            # calculate current gram length mole(分子)
            curr_gram_mole = 0
            # calculate current gram length deno(分母)
            curr_gram_deno = 0
            for j in range(0, candidate_tokens_len, curr_gram_len):
                if j + curr_gram_len > candidate_tokens_len:
                    continue
                else:
                    curr_gram_list = candidate_corpus[j:j+curr_gram_len]
                    gram_candidate_count = calculate_candidate(curr_gram_list, candidate_sentence)
                    # print(' current gram candidate count')
                    # print(gram_candidate_count)
                    gram_reference_count_list = calculate_reference(curr_gram_list, reference_sentences)
                    # print(' current gram reference count list')
                    # print(gram_reference_count_list)
                    truncation_list = []
                    for item in gram_reference_count_list:
                        truncation_list.append(np.min([gram_candidate_count, item]))
                    curr_gram_mole += np.max(truncation_list)
                    curr_gram_deno += gram_candidate_count
            print(' current length %d and gram mole %d and deno %d' % (i+1, curr_gram_mole, curr_gram_deno))
            gram_precisions.append(curr_gram_mole/curr_gram_deno)
        print('all the precisions about the grams')
        print(gram_precisions)

        # method2 to calculate the bleu
        # 第二种计算方法与第一种计算方法本质上的区别在于计算截断计数的区别(最终结果是一样的)
        # 先计算当前n长度的gram在所有的参考文献中的出现次数的最大值 然后在与当前gram在candidate sentence中出现的次数的最小值
        # 作为综合截断计数 本质上讲两种方法得到的结果是一样的 不在缀述

    # 其次对多元组合(n-gram)的precision 进行加权取平均作为最终的bleu评估指标
    # 一般选择的做法是计算几何加权平均 exp(sum(w*logP))
        average_res = calculate_average(gram_precisions, weights)
        print(' current average result')
        print(average_res)
    # 最后引入短句惩罚项 避免短句翻译结果取得较高的bleu值, 影响到整体评估
    # 涉及到最佳的匹配长度 当翻译的句子的词数量与任意的参考翻译句子词数量一样的时候 此时无需惩罚项
    # 如果不相等 那么需要设置一个参考长度r 当翻译的句子长度(c) 大于 r 的时候不需要进行惩罚 而 当c小于r
    # 需要在加权平均值前乘以一个惩罚项exp(1-r/c) 作为最后的bleu 指标输出
    # r 的选择可以这样确定 当翻译句子长度等于任何一个参考句子长度时不进行惩罚 但是当都不等于参考句子长度时
    # 可以选择参考句子中最长的句子作为r 当翻译句子比r 长时不进行惩罚 小于r时进行惩罚
    bp = 1
    reference_len_list = [len(item.split(' ')) for item in reference_sentences]
    if candidate_tokens_len in reference_len_list:
        bp = 1
    else:
        if candidate_tokens_len < np.max(reference_len_list):
            bp = np.exp(1-(np.max(reference_len_list)/candidate_tokens_len))
    return bp*average_res


if __name__ == '__main__':
    candidate_sentence = 'hello this is my code'
    reference_sentence = 'hello this code is not mine'
    candidate_token = candidate_sentence.split(' ')
    reference_token = reference_sentence.split(' ')
    bleu_v1_score = my_bleu_v1(candidate_token, reference_token)
    print('bleu version 1 score is %.2f ' % bleu_v1_score)


    # full bleu test on references and candidate
    predict_sentence = 'how old is the man'
    train_sentences = ['this is a dog and not is a cat', 'this is a cat and not is a dog', 'it is a dragon', 'i like play ball']
    bleu_v2_score = my_bleu_v2(predict_sentence, train_sentences, 4, weights=[0.25, 0.25, 0.25, 0.25], mode=0)
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150

            <link href="https://csdnimg.cn/release/phoenix/mdeditor/markdown_views-60ecaf1f42.css" rel="stylesheet">
                                <div></div>
                    
        <div class="person-messagebox">
            <div class="left-message"><a href="https://blog.youkuaiyun.com/Allocator">
                <img src="https://profile.csdnimg.cn/3/E/A/3_allocator" class="avatar_pic" username="Allocator">
            </a></div>
            <div class="middle-message">
                                    <div class="title"><span class="tit "><a href="https://blog.youkuaiyun.com/Allocator" data-report-click="{&quot;mod&quot;:&quot;popu_379&quot;,&quot;ab&quot;:&quot;new&quot;}" target="_blank">Allocator</a></span>
                    <!-- 等级，level -->
                                            <img class="identity-icon" src="https://csdnimg.cn/identity/blog4.png">                                            </div>
                <div class="text"><span>原创文章 40</span><span>获赞 55</span><span>访问量 12万+</span></div>
            </div>
                            <div class="right-message">
                                        <a class="btn btn-sm  bt-button personal-watch" data-report-click="{&quot;mod&quot;:&quot;popu_379&quot;,&quot;ab&quot;:&quot;new&quot;}">关注</a>
                                                            <a href="https://im.youkuaiyun.com/im/main.html?userName=Allocator" target="_blank" class="btn btn-sm bt-button personal-letter">私信
                    </a>
                                </div>
                        </div>
                    
    </div>
</article>

结束