评估生成模型基本方法
Suggestion:对于评估RAG没有严谨性要求的使用Ragas与TruLens即可,使用自研框架编写会略麻烦。
首先介绍一下本次RAG模型的测评指标:
F1_Score(Precision_Score ,Recall_Score)
F1_Score 类计算F1分数,它衡量预测和真实答案之间的相似性。具体步骤包括:
1.对每个预测与标准答案,进行标准化处理。
2.计算精确率、召回率和F1分数。
3.对多个标准答案进行比较,取最好的得分。
Precision_Score 和 Recall_Score 类分别继承自 F1_Score。
def compute_precision_recall_f1(prediction: str, ground_truth: str):
"""计算单个预测和标准答案之间的精确率、召回率和F1分数"""
pred_tokens = list(normalize_answer(prediction))
truth_tokens = list(normalize_answer(ground_truth))
common = set(pred_tokens) & set(truth_tokens)
num_same = len(common)
if num_same == 0:
return 0.0, 0.0, 0.0
precision = num_same / len(pred_tokens)
recall = num_same / len(truth_tokens)
f1 = 2 * precision * recall / (precision + recall)
return precision, recall, f1
class F1_Score:
def __call__(self, predictions: List[str], references: List[Union[str, List[str]]]) -> float:
"""计算平均 F1 分数"""
scores = []
for pred, ref in zip(predictions, references):
if isinstance(ref, str):
ref = [ref]
f1s = [compute_precision_recall_f1(pred, r)[2] for r in ref]
scores.append(max(f1s)) # 多个参考答案取最大
return np.mean(scores)
class Precision_Score(F1_Score):
def __call__(self, predictions: List[str], references: List[Union[str, List[str]]]) -> float:
"""计算平均 Precision"""
scores = []
for pred, ref in zip(predictions, references):
if isinstance(ref, str):
ref = [ref]
precisions = [compute_precision_recall_f1(pred, r)[0] for r in ref]
scores.append(max(precisions))
return np.mean(scores)
class Recall_Score(F1_Score):
def __call__(self, predictions: List[str], references: List[Union[str, List[str]]]) -> float:
"""计算平均 Recall"""
scores = []
for pred, ref in zip(predictions, references):
if isinstance(ref, str):
ref = [ref]
recalls = [compute_precision_recall_f1(pred, r)[1] for r in ref]
scores.append(max(recalls))
return np.mean(scores)
Exact Match(EM)
ExactMatch 类计算预测答案与真实答案是否完全匹配。
通过标准化后的字符串直接比较,若完全相同则得分为1。
class Exact_Match_Score:
def __call__(self, predictions: List[str], references: List[Union[str, List[str]]]) -> float:
"""计算平均 Exact Match 分数(支持多个参考答案)"""
scores = []
for pred, ref in zip(predictions, references):
if isinstance(ref, str):
ref = [ref]
norm_pred = normalize_answer(pred)
match = any(norm_pred == normalize_answer(r) for r in ref)
scores.append(float(match)) # 精确匹配为1,否则为0
return np.mean(scores)
Sub-string Exact Match(Sub-EM)
Sub_ExactMatch 类类似于 ExactMatch,但它计算的是子串匹配,即如果预测答案是标准答案的一个子集,也算作匹配。
class Sub_ExactMatch:
def __call__(self, predictions: List[str], references: List[Union[str, List[str]]]) -> float:
"""计算子串精确匹配得分(Sub-EM)"""
scores = []
for pred, ref in zip(predictions, references):
if isinstance(ref, str):
ref = [ref]
norm_pred = normalize_answer(pred)
match = any(norm_pred in normalize_answer(r) for r in ref)
scores.append(float(match))
return np.mean(scores)
ROUGE 评分
ROUGE(Recall-Oriented Understudy for Gisting Evaluation)是一种常见的文本生成评价指标,主要用于摘要生成任务。
Rouge_Score 类及其子类(Rouge_1, Rouge_2, Rouge_L)计算预测答案与标准答案的ROUGE得分。
使用了 rouge 包来计算 rouge-1、rouge-2、和 rouge-l 等不同的ROUGE指标。
class Rouge_Score:
def __init__(self, rouge_type="rouge-l"):
self.rouge_type = rouge_type
self.rouge = Rouge()
def __call__(self, predictions: List[str], references: List[Union[str, List[str]]]) -> float:
scores = []
for pred, ref in zip(predictions, references):
if isinstance(ref, str):
ref = [ref]
# 允许多个参考答案,取最高得分
max_score = 0.0
for r in ref:
try:
score = self.rouge.get_scores(pred, r)[0][self.rouge_type]['f']
max_score = max(max_score, score)
except ValueError:
# 空字符串处理
continue
scores.append(max_score)
return np.mean(scores)
class Rouge_1(Rouge_Score):
def __init__(self):
super().__init__(rouge_type="rouge-1")
class Rouge_2(Rouge_Score):
def __init__(self):
super().__init__(rouge_type="rouge-2")
class Rouge_L(Rouge_Score):
def __init__(self):
super().__init__(rouge_type="rouge-l")
中文测评爆0
在使用英文数据集测评的过程中,该自研框架测评效果好。以Rouge-l为例,其测评结果如下:
rouge-l 指标结果:
整体得分: {'rouge-l': 0.9266666616711111}
各样本得分:
样本 1:
预测答案: The capital of France is Paris
标准答案: ['Paris is the capital of France', 'The capital of France is Paris']
得分: 0.999999995
样本 2:
预测答案: The sky is blue
标准答案: ['The sky appears blue', 'The sky is blue']
得分: 0.999999995
样本 3:
预测答案: Machine learning is a subset of artificial intelligence
标准答案: ['Machine learning is part of AI', 'ML is a subset of artificial intelligence']
得分: 0.7999999950222222
样本 4:
预测答案: Python is a programming language
标准答案: ['Python is a programming language', 'Python is a coding language']
得分: 0.999999995
样本 5:
预测答案: The Earth orbits around the Sun
标准答案: ['The Earth moves around the Sun', 'Earth orbits the Sun']
得分: 0.8333333283333335
但是将测评数据集转换为中文后,发现其测评表现不佳,几乎匹配不上。
rouge-l 指标结果:
整体得分: {'rouge-l': 0.199999999}
各样本得分:
样本 1:
预测答案: 北京是中国的首都
标准答案: ['中国的首都是北京', '北京是中华人民共和国的首都']
得分: 0.0
样本 2:
预测答案: 太阳是一颗恒星
标准答案: ['太阳属于恒星', '太阳是恒星']
得分: 0.0
样本 3:
预测答案: 人工智能是计算机科学的一个分支
标准答案: ['人工智能是计算机科学的重要分支', 'AI是计算机科学的一部分']
得分: 0.0
样本 4:
预测答案: 长城是中国古代伟大的建筑工程
标准答案: ['万里长城是中国最伟大的建筑之一', '长城是古代中国的伟大工程']
得分: 0.0
样本 5:
预测答案: 水是生命之源
标准答案: ['水对生命非常重要', '水是生命之源']
得分: 0.999999995
解决误区
一般我们会认为:Rouge是面向自然语言完整句子的匹配,不需要手动分词后再评估,其底层逻辑是自动找n-gram(字/词)的最大重合度,而不是靠自行分词,如果用了jeba.cut()再join回去,反而会破坏Rouge的字符粒度匹配。直接使用rouge.get_score(“北京是中国的首都”, “中国的首都是北京”),就可以默认在中文中逐字滑窗读取n-gram。F1、Recall,Precision因为使用token-level比对,所有最好使用分词,而Rouge会自己处理字符级别匹配,分词反而没什么用。
可是真的如此么?
经过手动操作,直接使用pred和ground truth的字符串形式传入,结果和之前完全没两样。
所以这里需要从中英文语言自然特性差别来理解:中文的字符级评估是以单个汉字作为独立单元,英文单词之间却有天然空格分隔。虽说rouge会通过n-gram自动找最大重合度,但是英文中的n-gram通过空格来分割单词,空格出的一个单词换到中文语境不就是中文的词组分割么?这不就是jieba分词么?
加入适用中文的改进
改进1.标准化输入:
def normalize_answer(s):
"""标准化答案的辅助函数,支持中文"""
if not s:
return ""
# 转换为小写(仅对英文字符起作用)
s = s.lower()
# 移除标点符号(包括中文标点)和多余空格
s = re.sub(r'[^\u4e00-\u9fff\u3400-\u4dbfa-zA-Z0-9\s]', '', s)
# 移除多余空格
s = ' '.join(s.split())
return s
改进2.中文分词:
def tokenize_chinese(text):
"""中文分词函数"""
try:
import jieba
return list(jieba.cut(text))
except ImportError:
# 如果没有安装jieba,则按字符分割
return list(text)
附带完整代码
import re
import warnings
from collections import Counter
from dataclasses import dataclass
from typing import List
def normalize_answer(s):
"""标准化答案的辅助函数,支持中文"""
if not s:
return ""
# 转换为小写(仅对英文字符起作用)
s = s.lower()
# 移除标点符号(包括中文标点)和多余空格
s = re.sub(r'[^\u4e00-\u9fff\u3400-\u4dbfa-zA-Z0-9\s]', '', s)
# 移除多余空格
s = ' '.join(s.split())
return s
def tokenize_chinese(text):
"""中文分词函数"""
try:
import jieba
return list(jieba.cut(text))
except ImportError:
# 如果没有安装jieba,则按字符分割
return list(text)
@dataclass
class EvaluationData:
"""用于存储评估数据的简单数据类"""
pred: List[str] # 预测答案列表
golden_answers: List[List[str]] # 真实答案列表的列表
prompt: List[str] = None # 输入提示列表(可选)
retrieval_result: List[List[dict]] = None # 检索结果(可选)
class BaseMetric:
"""评估指标的基类"""
metric_name = "base"
def __init__(self, config):
self.config = config
self.dataset_name = config['dataset_name']
self.use_chinese = config.get('use_chinese', True) # 添加中文支持配置
def calculate_metric(self, data):
return {}, []
class F1_Score(BaseMetric):
"""F1分数评估指标"""
metric_name = "f1"
def token_level_scores(self, prediction: str, ground_truths: str):
final_metric = {'f1': 0, 'precision': 0, 'recall': 0}
if isinstance(ground_truths, str):
ground_truths = [ground_truths]
for ground_truth in ground_truths:
normalized_prediction = normalize_answer(prediction)
normalized_ground_truth = normalize_answer(ground_truth)
# 使用中文分词
prediction_tokens = tokenize_chinese(normalized_prediction)
ground_truth_tokens = tokenize_chinese(normalized_ground_truth)
common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
num_same = sum(common.values())
if num_same == 0:
continue
precision = 1.0 * num_same / len(prediction_tokens) if len(prediction_tokens) > 0 else 0
recall = 1.0 * num_same / len(ground_truth_tokens) if len(ground_truth_tokens) > 0 else 0
f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
final_metric['f1'] = max(f1, final_metric['f1'])
final_metric['precision'] = max(precision, final_metric['precision'])
final_metric['recall'] = max(recall, final_metric['recall'])
return final_metric
def calculate_metric(self, data):
pred_list = data.pred
golden_answers_list = data.golden_answers
metric_score_list = [self.token_level_scores(pred, golden_answers)['f1']
for pred, golden_answers in zip(pred_list, golden_answers_list)]
f1 = sum(metric_score_list) / len(metric_score_list) if metric_score_list else 0
return {"f1": f1}, metric_score_list
class Recall_Score(F1_Score):
"""召回率评估指标"""
metric_name = "recall"
def calculate_metric(self, data):
pred_list = data.pred
golden_answers_list = data.golden_answers
metric_score_list = [self.token_level_scores(pred, golden_answers)['recall']
for pred, golden_answers in zip(pred_list, golden_answers_list)]
recall = sum(metric_score_list) / len(metric_score_list)
return {"recall": recall}, metric_score_list
class Precision_Score(F1_Score):
"""精确率评估指标"""
metric_name = "precision"
def calculate_metric(self, data):
pred_list = data.pred
golden_answers_list = data.golden_answers
metric_score_list = [self.token_level_scores(pred, golden_answers)['precision']
for pred, golden_answers in zip(pred_list, golden_answers_list)]
precision = sum(metric_score_list) / len(metric_score_list)
return {"precision": precision}, metric_score_list
class ExactMatch(BaseMetric):
"""精确匹配评估指标"""
metric_name = "em"
def __init__(self, config):
super().__init__(config)
self.is_regex = self.dataset_name == 'curatedtrec'
def calculate_em(self, prediction: str, golden_answers: list) -> float:
if isinstance(golden_answers, str):
golden_answers = [golden_answers]
normalized_prediction = normalize_answer(prediction)
score = 0.0
for golden_answer in golden_answers:
if self.is_regex:
golden_answer = re.compile(golden_answer, re.IGNORECASE)
match = re.fullmatch(golden_answer, normalized_prediction)
if match is not None:
score = 1.0
break
else:
golden_answer = normalize_answer(golden_answer)
if golden_answer == normalized_prediction:
score = 1.0
break
return score
def calculate_metric(self, data):
golden_answers_list = data.golden_answers
pred_list = data.pred
metric_score_list = [self.calculate_em(pred, golden_answers)
for pred, golden_answers in zip(pred_list, golden_answers_list)]
em_score = sum(metric_score_list) / len(metric_score_list)
return {"em": em_score}, metric_score_list
class Sub_ExactMatch(BaseMetric):
"""子串精确匹配评估指标"""
metric_name = "sub_em"
def __init__(self, config):
super().__init__(config)
self.is_regex = self.dataset_name == 'curatedtrec'
def calculate_sub_em(self, prediction: str, golden_answers: list) -> float:
if isinstance(golden_answers, str):
golden_answers = [golden_answers]
normalized_prediction = normalize_answer(prediction)
score = 0.0
for golden_answer in golden_answers:
if self.is_regex:
golden_answer = re.compile(golden_answer, re.IGNORECASE)
match = re.search(golden_answer, normalized_prediction)
if match is not None:
score = 1.0
break
else:
golden_answer = normalize_answer(golden_answer)
if golden_answer in normalized_prediction:
score = 1.0
break
return score
def calculate_metric(self, data):
golden_answers_list = data.golden_answers
pred_list = data.pred
metric_score_list = [self.calculate_sub_em(pred, golden_answers)
for pred, golden_answers in zip(pred_list, golden_answers_list)]
sub_em_score = sum(metric_score_list) / len(metric_score_list)
return {"sub_em": sub_em_score}, metric_score_list
class Rouge_Score(BaseMetric):
"""ROUGE评分基类"""
metric_name = "rouge_score"
def __init__(self, config):
super().__init__(config)
try:
from rouge import Rouge
self.scorer = Rouge()
except ImportError:
raise ImportError("请先安装rouge包: pip install rouge")
def calculate_rouge(self, pred: str, golden_answers: List[str]) -> dict:
"""
计算ROUGE分数
Args:
pred: 预测文本
golden_answers: 参考文本列表
Returns:
dict: 包含rouge-1, rouge-2, rouge-l分数的字典
"""
if not pred or not golden_answers:
return {'rouge-1': 0.0, 'rouge-2': 0.0, 'rouge-l': 0.0}
output = {}
for answer in golden_answers:
if not answer: # 跳过空答案
continue
try:
scores = self.scorer.get_scores(pred, answer)
for key in ['rouge-1', 'rouge-2', 'rouge-l']:
if key not in output:
output[key] = []
output[key].append(scores[0][key]['f'])
except Exception as e:
print(f"计算ROUGE分数时出错: {e}")
continue
# 如果没有有效的分数,返回0
if not output:
return {'rouge-1': 0.0, 'rouge-2': 0.0, 'rouge-l': 0.0}
# 取每个指标的最大值
for k, v in output.items():
output[k] = max(v) if v else 0.0
return output
class Rouge_1(Rouge_Score):
"""ROUGE-1评分"""
metric_name = "rouge-1"
def calculate_metric(self, data):
golden_answers_list = data.golden_answers
pred_list = data.pred
metric_score_list = [self.calculate_rouge(pred, golden_answers)['rouge-1']
for pred, golden_answers in zip(pred_list, golden_answers_list)]
score = sum(metric_score_list) / len(metric_score_list)
return {"rouge-1": score}, metric_score_list
class Rouge_2(Rouge_Score):
"""ROUGE-2评分"""
metric_name = "rouge-2"
def calculate_metric(self, data):
golden_answers_list = data.golden_answers
pred_list = data.pred
metric_score_list = [self.calculate_rouge(pred, golden_answers)['rouge-2']
for pred, golden_answers in zip(pred_list, golden_answers_list)]
score = sum(metric_score_list) / len(metric_score_list)
return {"rouge-2": score}, metric_score_list
class Rouge_L(Rouge_Score):
"""ROUGE-L评分"""
metric_name = "rouge-l"
def calculate_metric(self, data):
golden_answers_list = data.golden_answers
pred_list = data.pred
metric_score_list = [self.calculate_rouge(pred, golden_answers)['rouge-l']
for pred, golden_answers in zip(pred_list, golden_answers_list)]
score = sum(metric_score_list) / len(metric_score_list)
return {"rouge-l": score}, metric_score_list
def main():
# 创建中文示例数据
sample_data = EvaluationData(
pred=[
"北京是中国的首都",
"太阳是一颗恒星",
"人工智能是计算机科学的一个分支",
"长城是中国古代伟大的建筑工程",
"水是生命之源"
],
golden_answers=[
["中国的首都是北京", "北京是中华人民共和国的首都"],
["太阳属于恒星", "太阳是恒星"],
["人工智能是计算机科学的重要分支", "AI是计算机科学的一部分"],
["万里长城是中国最伟大的建筑之一", "长城是古代中国的伟大工程"],
["水对生命非常重要", "水是生命之源"]
],
prompt=[
"中国的首都是哪里?",
"太阳是什么天体?",
"人工智能与计算机科学的关系是什么?",
"长城的重要性是什么?",
"水对生命有什么意义?"
]
)
# 创建评估配置
config = {
'dataset_name': 'sample',
'use_chinese': True, # 启用中文支持
'metric_setting': {
'retrieval_recall_topk': 5,
'bleu_max_order': 4,
'bleu_smooth': False,
'tokenizer_name': 'gpt-4'
}
}
# 初始化评估指标
metrics = [
F1_Score(config),
ExactMatch(config),
Sub_ExactMatch(config),
Precision_Score(config),
Recall_Score(config)
]
# 如果已安装rouge包,添加ROUGE评分指标
try:
import rouge
metrics.extend([
Rouge_1(config),
Rouge_2(config),
Rouge_L(config)
])
except ImportError:
print("\n注意:未安装rouge包,ROUGE评分指标将被跳过。可以使用 'pip install rouge' 安装。")
# 打印评估结果
print("\n=== 评估结果 ===")
for metric in metrics:
score_dict, score_list = metric.calculate_metric(sample_data)
print(f"\n{metric.metric_name} 指标结果:")
print(f"整体得分: {score_dict}")
print("各样本得分:")
for i, (score, pred, golden) in enumerate(zip(score_list, sample_data.pred, sample_data.golden_answers)):
print(f"\n样本 {i + 1}:")
print(f"预测答案: {pred}")
print(f"标准答案: {golden}")
print(f"得分: {score}")
if __name__ == "__main__":
main()