BLEU简单解析

最新推荐文章于 2025-04-17 19:25:50 发布

SofanHe

最新推荐文章于 2025-04-17 19:25:50 发布

阅读量833

点赞数 1

分类专栏： NLP 文章标签：人工智能

本文链接：https://blog.youkuaiyun.com/H_18763886211/article/details/112726080

版权

NLP 专栏收录该内容

2 篇文章

订阅专栏

本文解析了BLEU（Bilingual Evaluation Understudy）在评估双语翻译准确性的详细计算方法，涉及Countclip惩罚过长候选句、Pn得分计算和Brevity Penalty对长度偏差的调整。通过实例和实验展示了BLEU在衡量翻译效果中的优缺点，并介绍了如何在实际开发中使用NLTK库实现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

BLEU: Bilingual Evaluation Understudy

Definition

$\begin{aligned} & BLEU = BP * exp(\sum_{n=1}^N w_n \log p_n).\\ & \log BLEU = \min(1 - {r\over c}, 0) + \sum_{n=1}^N w_n \log p_n\\ & BP = \begin{cases}1 & c > r\\ e^{1 - {r \over c}} & c \leq r\end{cases}, c \leftarrow len(candidate), r \leftarrow len(reference)\\ & p_n = {\sum_{C \in \{Candidates\}} \sum_{n-gram \in C} Count_{clip}(n-gram)\over \sum_{C' \in \{Candidates\}} \sum_{n-gram' \in C'} Count(n-gram')}\\ & Count_{clip}(n-gram) = \min(Count, Max\_Reference\_Count) \end{aligned}$

整体计算BLEU参数的定义就在上面了，下面针对每个部分进行说明。

Count clip

这个参数计算方法是：对于一个n-gram，也就是n元词，计算其在candidate sentence中出现的次数。这里举个例子：

Candidate : the the the the the the
Reference 1 : The cat is on the mat.
Reference 2 : There is a cat on the mat.

Count_Clip('the') = min(6, max(2, 1)) = min(6, 2) = 2

这样计算的好处就是防止有一些像是这里的Candidate一样，生成了好多Reference中有的词，但是却没有考虑全局。

对标的计算方法是计算Candidate中有多少个词在Reference中出现过，也就是 $C o u n t^{'} (n - g r a m) = c o u n t$ 的情况（因为这样这里的Candidate得分就是7了）

Pn

$P_n$ 的这个计算公式实际上就是针对n特定取值下的n-gram的得分。仍然是相似的词越多，得分越高。原句得分为1。

这里可以把这个公式拆看来看，可以看到在系统生成了多个候选的答案的时候， $P_n$ 是可以计算出总体平均得分的：
$P_n = {\sum_{n-gram \in Candidate_i} Count_{clip}(n-gram) \over \sum_{n-gram' \in Candidate_i} Count(n-gram)}, |\{Candidate\}| = 1$
对于有多个Candidate的情况，可以把分子分母分开，就能看到是对所有候选句子作为一个整体，计算对应的 $Count, Count_{clip}$ ，然后相除得分 $P_n$ 。

针对这个 $P_n$ 作者单独进行了测试，来证明仅通过改进n-gram算法得到的新指标的可用性：
在这里插入图片描述
上面这个是针对同一个句子，让人去翻译（深蓝色方块）和让一个差的机器翻译去翻译，在改进过的n-gram算法中n取1-4的时候计算得到的平均分（好像是）

上面这个做的更深入，是找了两个人H1，H2和三个商用机器翻译S1，S2，S3，然后用改进的n-gram计算得分。

BP

Brevity Penalty，针对语句长度进行的一个得分计算。

为啥会出现这个东西，是因为存在一些非常短的翻译只抓住了关键词，剩下的啥都没干，在原有的n-gram中得分还非常的高！

Candidate : of the
Reference 1 : It is a guide to action that ensures that the military will forever heen Party commands.
Reference 2 : It is the guiding principle which guarantees the military forces always being under the commmand of the Party.
Reference 3 : It is the practical guide for the army always to heed the directions of the Party.

然后Candidate的1-gram和2-gram得分都是1.0，非常的高（然而这显然不是一个好的翻译）

作者观察到这个现象之后，发现这种情况其实只会在句子非常短的时候发生，于是得想个办法惩罚一下这些投机取巧的短句子，然后就针对 $c < r$ 的部分，给他乘个因数不就行了？于是BP就出现了
$\begin{cases} 1&, c \geq r\\ e^{(1 - c / r)}&, c < r\end{cases}$
这样，上面那个东西就自然会受到惩罚。

那么聪明的小伙伴就会问了：为啥不用recall？

Candidate 1 : I always invariably perpetually do.
Candidate 2 : I always do.
Reference 1 : I always do.
Reference 2 : I invariably do.
Reference 3 : I perpetually do.

聪明的小伙伴一看就知道Candidate1的Recall显然高很多，但是Candidate1非常憨憨，远不如Candidate2好。因此引入Recall会破坏这种机器翻译评价。