BLEU笔记

最新推荐文章于 2024-09-01 21:22:56 发布

luputo

最新推荐文章于 2024-09-01 21:22:56 发布

阅读量472

点赞数 1

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/luo3300612/article/details/89842456

本文介绍了BLEU，一种自动度量机器翻译性能的指标。它通过计算n - gram的对数加权平均，综合考虑翻译句的合理度与流畅度。还提到了Modified n - gram precision解决重复词问题，引入句长惩罚对低召回的短句扣分，BLEU度量结果在0 - 1之间，可用于序列标签度量任务。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

BLEU

翻译总结自论文

BLEU是一种自动度量机器翻译（Machine translation）性能的指标

基本认识

对于两个候选的翻译句

Candidate1: It is a guide to action which
ensures that the military always obeys
the commands of the party.
Candidate2: It is to insure the troops
forever hearing the activity guidebook
that party direct.

以及三个Reference

Reference 1: It is a guide to action that
ensures that the military will forever
heed Party commands.
Reference 2: It is the guiding principle
which guarantees the military forces
always being under the command of the
Party.
Reference 3: It is the practical guide for
the army always to heed the directions
of the party.

Candidate1作为较好的翻译结果，其特点是与所有Reference都共享一些词，而Candidate2则与之相反，BLEU的基本认识就是更好的翻译结果与正确翻译之间共享更多的词（组）

BLEU(unigrams)

BLEU其实就是一个准确率度量，为了计算这个准确率，我们只要数一数candidate中出现在reference中的词的个数，再除以candidate的总长度即可，这里以unigram（1-gram）为例，意思是我们将单个词看作一个整体，比如：

Candidate: It is a guide to action which
ensures that the military always obeys
the commands of the party.

Reference 1: It is a guide to action that
ensures that the military will forever
heed Party commands.
Reference 2: It is the guiding principle
which guarantees the military forces
always being under the command of the
Party.
Reference 3: It is the practical guide for
the army always to heed the directions
of the party.

除了obey，其他词均在reference中出现，所以此处的unigram precision是17/18

Modified n-gram precision

上节的计算方法会引来如下问题

Candidate: the the the the the the the.
Reference1: The cat is on the mat.
Reference2: There is a cat on the mat.

机器翻译系统可以通过多次生成“合理”的词来达到高准确率，此处的unigram precision（1-gram）是7/7=1

为了弥补这一问题，对于某个词，我们以它在每个Reference中出现的最大次数作为上限，Candidate中如果次数超过了这个上限，则不计为正确，此时因为the在每个Reference中出现的最大次数为2次（Reference1：2次，Reference2：1次，取其中最大的），所以修改后的1-gram precision为2/7

n-gram precision

n-gram precision即是求precision时将n个词看作一个整体，还是原来的例子

Candidate

Candidate1: It is a guide to action which
ensures that the military always obeys
the commands of the party.
Candidate2: It is to insure the troops
forever hearing the activity guidebook
that party direct.

Reference

Reference 1: It is a guide to action that
ensures that the military will forever
heed Party commands.
Reference 2: It is the guiding principle
which guarantees the military forces
always being under the command of the
Party.
Reference 3: It is the practical guide for
the army always to heed the directions
of the party.

如果我们计算2-gram(bigram) precision的话，只需要把每两个词看作一个整体，candidate1的结果就是10/17，candidate2的结果就是1/13，前面那个the the the the的例子中，结果就是0。

这种修改后的n-gram precision计分方式抓住了翻译的两个方面：合理度与流畅度。Candidate与Reference相比，使用相同的词即为合理，使用连续相同的词即为流畅

仅仅使用 modified n-gram precision

下图是127句人类翻译和机器翻译的平均BLEU
在这里插入图片描述
可见BLEU对于好的人类翻译结果的度量分数远大于相对较差的机器翻译结果，且每个n-gram都可以度量翻译结果的好坏

随后作者对比了BLEU在好的和普通的翻译者、好的机器翻译系统和较差的系统之间翻译结果度量的一致性

在这里插入图片描述
H代表人类翻译，S表示机器翻译，不同的n的结果排序一致，可见n-gram的评分结果具有一致性

结合使用n-gram

为了得到不同gram的综合评分结果，我们可以对不同的n求加权平均，但需要注意的是，由之前的图片可以看到，随着n的增大，precision会以指数趋势降低，在计算均值的时候我们需要考虑这一特点，比如使用对数的加权平均，BLEU使用的就是average logarithm with uniform weights
在这里插入图片描述
对于BP的解释，请继续看下文

句长惩罚

翻译的结果最好不要太长，也不要太短，一个好的度量系统应该考虑到这一情况，modified n-gram precision部分考虑了这一情况，如前文所述，过多的“合理”词会被认为是错误的，但是它仍然解决不了以下的情况

Candidate: of the

Reference 1: It is a guide to action that
ensures that the military will forever
heed Party commands.
Reference 2: It is the guiding principle
which guarantees the military forces
always being under the command of the
Party.
Reference 3: It is the practical guide for
the army always to heed the directions
of the party.

unigram precison= 2/2
bigram precision= 1/1

这个情况在一些简单的分类任务中也出现过，如果我们将所有样本分为负类，尽管正类的准确率就是百分之百，但召回率却是百分之0，那么此处是否也需要引入召回率的概念呢？

答案是否定的，因为在多个Reference中，对某个概念可能有多种不同的表达，我们需要的不是把这些不同的表达都放在Candidate中，而是只要选择一种就行了，如果都放在Candidate中，反而会使得翻译结果很差，比如
Candidate

Candidate 1: I always invariably perpetually do.
Candidate 2: I always do.

Reference

Reference 1: I always do.
Reference 2: I invariably do.
Reference 3: I perpetually do.

第一个翻译的召回率很高，但显然比第二个翻译要差，我们也可以在概念而非词语上做recall，但这样显然很困难

简单句惩罚

虽然我们不能给高recall的加分，但我们可以给低recall的句子扣分，通过使用短句惩罚因子。

因为modiified n-gram已经在长句上做了惩罚，因此不必再做，引入一个短句惩罚因子（brevity penaly factor），当candidate和某个reference长度一致时，这个因子是1，否则，我们选择最接近的reference句的长度为最佳匹配长度（best match length），记为r，candidate的长度为c，对于该candidate的BP（brevity penalty）因子为
在这里插入图片描述
c>r时为1，即为对长句不做惩罚，翻译结果越短，c越小，r/c越大，从而BP越小。