【Datawhale-tinyuniverse】TinyEval学习

最新推荐文章于 2026-01-06 17:01:04 发布

原创最新推荐文章于 2026-01-06 17:01:04 发布 · 1.3k 阅读

21 ·

CC 4.0 BY-SA版权

文章标签：

#语言模型 #自然语言处理 #人工智能

datawhale 专栏收录该内容

3 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

TinyEval学习

TinyEval学习
一、Eval是什么？
二、实践
总结

TinyEval学习

根据Datawhale开源的Tiny-Eval¹，通过调用学习大模型在不同任务上评测的基本流程，本文主要实践以F1 score评测根据问题和输入文本生成短文本回答的流程。

一、Eval是什么？

针对大模型性能的评测，不同的任务设计不同的评测来对比大模型处理特定形式问题的能力。

二、实践

在本地jupyter notebook中实现。采用智谱作为API提供方，LLM采用glm-3-turbo，获取api-key并放置环境变量中。

1.安装必要的三方库

python-dotenv
zhipuai
datasets
rouge

2.数据集

Datawhale提供的短文本回答数据集² (智谱安全做的不错，很多问题不做回答。)

#数据集主要内容
{
	"input": str，#问题
	"context": str， #文本
	"answers": [], #回答
}

2.加载API_KEY

加载LLM API_KEY

# 加载环境变量 ZHIPUAI_API_KEY
import os
from dotenv import load_dotenv

load_dotenv()

3.LLM构建

from zhipuai import ZhipuAI
from typing import Dict, List, Optional, Tuple, Union

class BaseModel:
    def __init__(self, path: str = '') -> None:
        self.path = path

    def chat(self, prompt: str, history: List[dict], content: str) -> str:
        pass

    def load_model(self):
        pass

class ZhipuChat(BaseModel):
    def __init__(self, path: str = '', model: str = "glm-3-turbo") -> None:
        super().__init__(path)

        self.client = ZhipuAI(api_key=os.getenv("ZHIPUAI_API_KEY"))
        self.model = model

    def chat(self, question: str, max_gen) -> str:
        history = []
        history.append({'role': 'user', 'content': question})
        response = self.client.chat.completions.create(
            model=self.model,
            messages=history,
            max_tokens=max_gen,
            temperature=0.1
        )
        return response.choices[0].message.content

4.prompt模板

短文本回答任务prompt的定义

dataset2prompt = {
    "multifieldqa_zh": "阅读以下文字并用中文简短回答：\n\n{context}\n\n现在请基于上面的文章回答下面的问题，只告诉我答案，不要输出任何其他字词。\n\n问题：{input}\n回答：",
    }

5.F1 score

import string
import jieba
from rouge import Rouge
from collections import Counter
jieba.setLogLevel(jieba.logging.INFO)


def normalize_zh_aswer(s):
    """小写化,删除标点,删除空格"""

    def white_space_fix(text):
        return "".join(text.split())
    
    def remove_punc(text):
        cn_punctuation = "！？｡。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
        all_punctuation = set(string.punctuation + cn_punctuation)
        return ''.join(ch for ch in text if ch not in all_punctuation)
    
    def lower(text):
        return text.lower()
    
    return white_space_fix(remove_punc(lower(s)))

def qa_f1_zh_score(prediction, ground_truth, **kwargs):
    prediction_tokens = list(jieba.cut(prediction, cut_all=False))
    #print("pred jieba分词：")
    #print(prediction_tokens)
    ground_truth_tokens = list(jieba.cut(ground_truth, cut_all=False))
    #print("truth jieba分词：")
    #print(ground_truth_tokens)
    prediction_tokens_norm = [normalize_zh_aswer(t) for t in prediction_tokens]
    #print("pred 归一化：")
    #print(prediction_tokens_norm)
    ground_truth_tokens_norm = [normalize_zh_aswer(t) for t in ground_truth_tokens]
    #print("truth 归一化：")
    #print(ground_truth_tokens_norm)
    prediction_tokens = [t for t in prediction_tokens_norm if len(t) > 0]
    #print("pred 处理后：")
    #print(prediction_tokens)
    ground_truth_tokens = [t for t in ground_truth_tokens_norm if len(t) > 0]
    #print("truth 处理后：")
    #print(ground_truth_tokens)
    return f1_score(prediction_tokens, ground_truth_tokens)

def f1_score(prediction, ground_truth, **kwargs):
    # Counter以dict的形式存储各个句子对应的词与其对应个数,&操作符返回两个Counter中共同的元素的键值对
    common = Counter(prediction) & Counter(ground_truth)  
    num_same = sum(common.values())                       # 显示prediction与gt的共同元素的个数
    #print("pred&truth共同元素个数：" + str(num_same))
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction)          # 即模型预测正确的样本数量与总预测样本数量的比值
    #print("准确率：" +  str(precision))
    recall = 1.0 * num_same / len(ground_truth)           # 模型正确预测的样本数量与总实际样本数量的比值
    #print("召回率：" +  str(precision))
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

6.实践

先尝试3个单个问题

from datasets import load_dataset
# 获得prompt模板
prompt_format = dataset2prompt.get("multifieldqa_zh")
dataset = 'multifieldqa_zhtest'
#加载数据
data = load_dataset('json', data_files=f'Eval/dataset/{dataset}.jsonl',split='train')

数据：{‘input’: ‘全国美国文学研究会的第十八届年会在哪所大学举办的？’, ‘context’: ‘全国美国文…散文学中的‘家叙事’”。’, ‘answers’: [‘厦门大学。’], ‘length’: 9593, ‘dataset’: ‘multifieldqa_zh’, ‘language’: ‘zh’, ‘all_classes’: None, ‘_id’: ‘5b1b8e937b83c3ff9b75ac386fae9c4575c4b9f26a4fbdad’}
prompt： '阅读以下文字并用中文简短回答：\n\n全国美国文学研究会\n受秘书处委托，由我向美文会会员单位的各位代表…“美国印度裔离散文学中的‘家叙事’”。\n\n现在请基于上面的文章回答下面的问题，只告诉我答案，不要输出任何其他字词。\n\n问题：全国美国文学研究会的第十八届年会在哪所大学举办的？\n回答：

#初始化模型
modelChat = ZhipuChat()
#结果生成
maxGen = 64 # 最大返回token数
pred = modelChat.chat(prompt, maxGen)
for json_obj in data:
	score = qa_f1_zh_score(pred,json_obj['answers'][0])
	print('f1 score：' + str(score))

返回结果：

pred jieba分词：
[‘厦门大学’]
truth jieba分词：
[‘厦门大学’, ‘。’]
pred 归一化：
[‘厦门大学’]
truth 归一化：
[‘厦门大学’, ‘’]
pred 处理后：
[‘厦门大学’]
truth 处理后：
[‘厦门大学’]
pred&truth共同元素个数：1
准确率：1.0
召回率：1.0
f1 score：1.0

pred jieba分词：
[‘奇力’, ‘锅炉’, ‘公司’, ‘支付’, ‘了’, ‘10’, ‘万元’, ‘预付款’, ‘。’]
truth jieba分词：
[‘10’, ‘万元’, ‘。’]
pred 归一化：
[‘奇力’, ‘锅炉’, ‘公司’, ‘支付’, ‘了’, ‘10’, ‘万元’, ‘预付款’, ‘’]
truth 归一化：
[‘10’, ‘万元’, ‘’]
pred 处理后：
[‘奇力’, ‘锅炉’, ‘公司’, ‘支付’, ‘了’, ‘10’, ‘万元’, ‘预付款’]
truth 处理后：
[‘10’, ‘万元’]
pred&truth共同元素个数：2
准确率：0.25
召回率：0.25
f1 score：0.4

pred jieba分词：
[‘郗鉴’, ‘拒绝’, ‘了’, ‘外戚’, ‘庾亮’, ‘废’, ‘王导’, ‘的’, ‘建议’, ‘。’]
truth jieba分词：
[‘郗鉴’, ‘拒绝’, ‘了’, ‘外戚’, ‘庾亮’, ‘废’, ‘王导’, ‘的’, ‘建议’, ‘。’]
pred 归一化：
[‘郗鉴’, ‘拒绝’, ‘了’, ‘外戚’, ‘庾亮’, ‘废’, ‘王导’, ‘的’, ‘建议’, ‘’]
truth 归一化：
[‘郗鉴’, ‘拒绝’, ‘了’, ‘外戚’, ‘庾亮’, ‘废’, ‘王导’, ‘的’, ‘建议’, ‘’]
pred 处理后：
[‘郗鉴’, ‘拒绝’, ‘了’, ‘外戚’, ‘庾亮’, ‘废’, ‘王导’, ‘的’, ‘建议’]
truth 处理后：
[‘郗鉴’, ‘拒绝’, ‘了’, ‘外戚’, ‘庾亮’, ‘废’, ‘王导’, ‘的’, ‘建议’]
pred&truth共同元素个数：9
准确率：1.0
召回率：1.0
f1 score：1.0

因安全问题，只截取了66个问题进行有效评测

total_score = 0
for json_obj in data:
    prompt = prompt_format.format(**json_obj)
    pred = modelChat.chat(prompt, maxGen)
    score = qa_f1_zh_score(pred,json_obj['answers'][0])
    score = max(0, score)
    total_score += score
print("智谱glm-3-turbo得分：")
print(round(100 * total_score / len(data), 2))

智谱glm-3-turbo得分：61.1

总结

通过短文本数据集对智谱大模型进行了评测，从单个评测可看出智谱大模型效果还是很不错的，评价指标仍有一定局限性，未能反映大模型真是性能。
在使用大模型解决问题时，可现在数据集上进行评测，根据评测结果选择合适的大模型。

您可能感兴趣的与本文相关的镜像

Llama Factory

模型微调

LLama-Factory

LLaMA Factory 是一个简单易用且高效的大型语言模型（Large Language Model）训练与微调平台。通过 LLaMA Factory，可以在无需编写任何代码的前提下，在本地完成上百种预训练模型的微调