白话大模型评估：文本嵌入与文本生成模型评估方法详解

原创

于 2025-10-07 16:35:54 发布 · 1.1k 阅读

22 ·

CC 4.0 BY-SA版权

文章标签：

#大模型 #评估 #文本嵌入 #文本生成

引言

随着大语言模型的快速发展，如何科学、全面地评估模型性能成为了AI领域的重要课题。本文将深入介绍两类核心大模型的评估方法：文本嵌入模型和文本生成模型，通过实际代码示例帮助读者掌握模型评估的核心技能。

一、文本嵌入模型评估

1.1 什么是文本嵌入模型？

文本嵌入模型将文本转换为高维向量表示，这些向量能够捕捉文本的语义信息。好的嵌入模型应该让语义相似的文本在向量空间中距离更近。

1.2 核心评估指标

准确率 (Accuracy)

定义：正确预测的样本数占总样本数的比例
计算公式：准确率 = 正确预测数 / 总样本数
作用：衡量模型的整体预测准确性
适用场景：二分类任务，如句子相似性判断
局限：在类别不平衡时可能误导，如99%的样本都是同一类

精确率 (Precision)

定义：预测为正例中真正为正例的比例
计算公式：精确率 = 真正例 / (真正例 + 假正例)
作用：衡量模型预测的"质量"，避免误报
适用场景：关注预测准确性，如推荐系统中避免推荐不相关内容
局限：可能牺牲召回率

召回率 (Recall)

定义：真正例中被正确预测的比例
计算公式：召回率 = 真正例 / (真正例 + 假负例)
作用：衡量模型发现正例的能力，避免漏报
适用场景：信息检索，如搜索系统中确保找到相关内容
局限：可能产生较多误报

F1分数 (F1-Score)

定义：精确率和召回率的调和平均数
计算公式：F1 = 2 × (精确率 × 召回率) / (精确率 + 召回率)
作用：平衡精确率和召回率，提供综合评估
适用场景：需要平衡精确率和召回率的场景
局限：在精确率和召回率差异很大时可能不够敏感

1.3 评估步骤

# 1. 加载数据集
dataset = load_dataset('C-MTEB/LCQMC', split='test')
test_samples = list(dataset)[0:100]

# 2. 提取句子对
s1 = [x['sentence1'] for x in test_samples]
s2 = [x['sentence2'] for x in test_samples]

# 3. 编码为向量
emb1 = model.encode(s1, normalize_embeddings=True)
emb2 = model.encode(s2, normalize_embeddings=True)

# 4. 计算相似度
similarities = (emb1 * emb2).sum(axis=1)

# 5. 基于阈值分类
threshold = 0.5
y_pred = (similarities > threshold).astype(np.int32)
y_true = np.array([int(x['label']) for x in test_samples])

# 6. 计算评估指标
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

二、文本生成模型评估

2.1 什么是文本生成模型？

文本生成模型能够根据输入提示生成连贯、有意义的文本，如摘要、翻译、对话等。评估生成模型需要考虑生成文本的质量、流畅性和相关性。

2.2 核心评估指标

BLEU分数 (Bilingual Evaluation Understudy，双语评估替补)

定义：基于n-gram匹配的自动评估指标
计算方法：计算生成文本中n-gram在参考文本中的出现比例
作用：衡量生成文本与参考文本的词汇匹配度
适用场景：机器翻译、文本摘要等需要与参考文本对比的任务
局限：不考虑语义，只关注词汇重叠；对同义词不敏感

ROUGE分数 (Recall-Oriented Understudy for Gisting Evaluation，面向召回率的摘要评估替补)

ROUGE-1：单个词的召回率，衡量词汇覆盖度
ROUGE-2：双词组合的召回率，考虑词汇顺序
ROUGE-L：最长公共子序列，衡量句子结构相似性
作用：评估生成文本的内容覆盖度和结构相似性
适用场景：文本摘要、内容生成等任务
局限：主要关注内容覆盖，对语言流畅性评估有限

困惑度 (Perplexity)

定义：模型对生成文本的"困惑"程度
计算方法：基于语言模型的交叉熵损失计算
作用：反映生成文本的语言流畅性和模型置信度
适用场景：评估生成文本的语言质量
局限：需要将生成文本重新输入模型，计算成本高

2.3 评估步骤

# 1. 加载数据集
dataset = load_dataset('suolyer/lcsts', split='train')
test_samples = list(dataset)[0:20]

# 2. 提取输入和参考文本
inputs = [x['input'] for x in test_samples]
references = [x['output'] for x in test_samples]

# 3. 生成文本
generated_results = []
for input_text, reference in zip(inputs, references):
    prompt = f"{
     
     input_text}\n\n摘要："
    generated = model.generate(prompt, **generation_config)
    generated_results.append({
   
   
        "input": input_text,
        "reference": reference,
        "generated": generated
    })

# 4. 计算评估指标
def evaluate_generated_text(generated_list):
    smoother = SmoothingFunction()
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
    
    total_bleu = 0.0
    total_rouge1 = 0.0
    total_rouge2 = 0.0
    total_rougeL = 0.0
    total_perplexity = 0.0
    
    for res in generated_list:
        generated = res["generated"]
        reference = res["reference"]
        
        # BLEU计算
        bleu_score = sentence_bleu([reference], generated, 
                                 smoothing_function=smoother.method1)
        total_bleu += bleu_score
        
        # ROUGE计算
        rouge_scores = scorer.score(reference, generated)
        total_rouge1 += rouge_scores["rouge1"].fmeasure
        total_rouge2 += rouge_scores["rouge2"].fmeasure
        total_rougeL += rouge_scores["rougeL"].fmeasure
        
        # 困惑度计算
        inputs = tokenizer(generated, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
        perplexity = torch.exp(outputs.loss).item()
        total_perplexity += perplexity
    
    # 返回平均分数
    return {
   
   
        "BLEU": total_bleu / len(generated_list),
        "ROUGE-1": total_rouge1 / len(generated_list),
        "ROUGE-2": total_rouge2 / len(generated_list),
        "ROUGE-L": total_rougeL / len(generated_list),
        "Perplexity": total_perplexity / len(generated_list)
    }

三、完整代码示例

3.1 文本嵌入模型评估完整代码(分类任务，预测一个类别)

""" 
@description: 评估文本嵌入模型（以 LCQMC 句子匹配为例）
@author: frank
@date: 2025-08-09
@version: 1.0

评估 BAAI/bge-small-zh 模型在 LCQMC（中文问句匹配）上的效果。
核心流程：加载数据 -> 编码句向量（L2 归一化）-> 计算余弦相似度 -> 基于阈值的二分类评估。
"""

# -*- coding: utf-8 -*-

from sentence_transformers import SentenceTransformer
from datasets import load_dataset

# 加载本地模型；也可直接用 huggingface 上的权重名，例如："BAAI/bge-small-zh"
model_name = "D:\\xx\\pythonProject\\bge-small-zh"
model = SentenceTransformer(model_name)

# 内在质量评估：相似度相关性（句子匹配）
# 任务：中文句子匹配（LCQMC 数据集，Large-scale Chinese Question Matching Corpus）
# 说明：`C-MTEB/LCQMC` 为 MTEB 整理版本，字段通常包含 sentence1、sentence2、label/score（0/1）
# 若遇到网络/镜像问题，请配置环境变量 HF_ENDPOINT（例如国内镜像）或预先缓存到本地。
# 例如 PowerShell：$env:HF_ENDPOINT = "https://hf-mirror.com"
dataset = load_dataset('C-MTEB/LCQMC', split='test')

# 转为 Python 列表，便于后续处理与打印
test_samples = list(dataset)[0:100]

# 基本数据检查与可视化输出
print('test_samples[0:5]:', test_samples[0:5])
print(f'num_samples: {
     
     len(test_samples)}')
print(f'first_item_keys: {
     
     list(test_samples[0].keys())}')

import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# 1) 提取句子文本
# 字段语义：sentence1 与 sentence2 组成一条配对样本；标签 0/1 表示是否语义等价
s1 = [x['sentence1'] for x in test_samples]
s2 = [x['sentence2'] for x in test_samples]

# 2) 编码句向量
# 说明：normalize_embeddings=True 会做 L2 归一化，使每个向量长度为 1，仅保留方向信息，
#      此时两个向量的点积即为余弦相似度（范围约 [-1, 1]）。
emb1 = model.encode(s1, normalize_embeddings=True, show_progress_bar=True)
emb2 = model.encode(s2, normalize_embeddings=True, show_progress_bar=True)
print(f'emb1.shape: {
     
     emb1.shape}, emb2.shape: {
     
     emb2.shape}')

# 3) 计算余弦相似度（归一化后点积=余弦）。使用向量化实现，效率更高
similarities = (emb1 * emb2).sum(axis=1)
print(f'similarities.shape: {
     
     similarities.shape}')
print(f'similarities stats -> min: {
     
     similarities.min():.4f}, max: {
     
     similarities.max():.4f}, mean: {
     
     similarities.mean():.4f}')

# 4) 准备标签
# 兼容两种字段名：优先使用 label，否则回退到 score
if 'label' in test_samples[0]:
    y_true = np.array([int(x['label']) for x in test_samples], dtype=np.int32)
elif 'score' in test_samples[0]:
    y_true = np.array([int(x['score']) for x in test_samples], dtype=np.int32)
else:
    raise KeyError('未找到标签字段：期望存在 "label" 或 "score"')

# 5) 基于阈值进行二分类预测
# 经验阈值 0.5，若需更优阈值，可在验证集上网格搜索
threshold = 0.5
y_pred = (similarities > threshold).astype(np.int32)

# 6) 评估指标
accuracy = accuracy_score(y_true, y_pred)
f1 = f1_score