写在前面
本文不推荐打算认真做本领域的同学阅读,本文只是作为一个简单的概要对于大模型的评测做一个十分简要的介绍与当前的成果展示,预期在未来几天,笔者会进一步的撰写相关的一些具体的评测指标的介绍,本文主要基于arxiv上的论文:A Survey on Evaluation of Large Language Models,推荐打算进一步深耕本领域的同学进行进一步的阅读,如果您只是做大模型相关的工作,想要简单的了解以下目前在评测上的工作,又苦于英文水平不足,那本文还算是比较合适的。
大模型的评估
为什么文明需要对于大模型进行评估,并对其进行如此多的研究?
- 合理的评估方案可以帮助我们更好的理解大模型的优缺点,如经过了一系列的测试后,PromptBench发现了大模型对于对抗性提示十分的敏感。
- 合理的评价有助于直到未来的交互设计方案。
- 大模型的安全性和可靠性在目前的使用背景下极其重要。
- 随着大模型的不断发展与能力的涌现,以往的评估策略可能会逐渐的过时。
对于大模型的评估,可以用三个单次进行简单的概括,即:What、Where、How。
- 我们该评估什么东西?
- 我们应该在哪些地方找到资源评估他?
- 我们如何对他进行评估?
AI模型的评估算法概述
目前的基础AI评估标准方法可以该数为以下几个:
- k-fold cross-validation
- holdout validation
- leave one out cross-validation
- bootstrap
- reduced set
但是,我们关注到,随着训练规模的越来越大,一些传统的评估方案可能无法对于深度学习模型尤其是大模型进行有效的评估,因此,我们发展出了对于静态的验证集进行评估作为深度学习评估的标准方案,如GLUE等等。
我们该评估什么?
本部分的目的在于展示,我们的对于大模型的评估应该聚焦在哪些地方,在开头的综述中,作者对于这些工作进行了极其系统的陈述,本文在此仅仅只是对其进行了简单的列举并附加了少量的现有结果。
基本的自然语言处理
目前绝大多数的工作都集中在自然语言的处理上,总的来说,可以概括为理解、推理、生成和性能四个维度。
-
自然语言理解
- 情绪分析(Sentiment analysis)
- 文本分类(Text classification)
- 自然语言推理(Natural language inference (NLI) )
- 语义的理解(Semantic understanding)
- 社交知识理解(Semantic understanding)
-
推理(Reasoning.)
与NLI的区别:
NLI表达的是确定给定的“假设”是否在逻辑上遵循“前提”。
而推理却可以描述为以下四个模块:
- mathematical reasoning
- commonsense reasoning
- logical reasoning
- domain-specific reasoning
-
自然语言生成
- 总结与摘要
- 对话
- 翻译
- 问答
- 句子风格迁移与转换
- 写作任务
- 文本生成
-
多语言任务
-
事实性
appendix:关于事实性的当前评测结果
基本的测试水平:GPT-4、BingChat距离完全准确目前只有15%左右的差距。
目前对于事实一致性的评价方法缺乏统一的比较框架,相关分数与二元的标签参考价值有限。
关于事实评估,目前的一些有趣的工作结果包含:
- 不考虑外部知识,将相关性分数转化为二元的标签:https://arxiv.org/abs/2204.04991(NAACL 2022)
- 基于信息论的评测方法:https://arxiv.org/abs/2306.06264
- 分解原子事实,评估其正确性:https://arxiv.org/abs/2305.14251(EMNLP 2023)
- TruthfulQA数据集:https://arxiv.org/abs/2109.07958(ACL 2022)
鲁棒性、伦理、偏见与可信度
-
鲁棒性
- OOD问题的鲁棒性
- adversarial问题的鲁棒性
-
偏见问题
- 社会偏见
- 文化偏见
- 道德偏见
- 政治偏见
- 文化价值观偏见
-
伦理问题
-
可信问题
Natural Science and Engineering
- Mathematics
- General science
- Engineering
- 代码生成:在贪心问题、动态规划、搜索问题上已经展现出了强大的能力,但是对于图和树等数据结构依旧有待进一步的提升。
- 软件工程
- 常识性计划
其他任务
- 医学任务
- queries
- examination
- assistants
- 社会科学
- Agent Applications
- 教育
- 搜索与推荐系统
- 人格测试
- 其他特定领域
Where:数据集和Benchmark
本部分笔者只进行了简单的了解,有兴趣的推荐阅读原文,也可以根据自己目前的需要,阅读下文对应的一些Benchmark。
Benchmark | Focus | Domain | Evaluation Criteria |
---|---|---|---|
SOCKET | Social knowledge | Specific downstream task | Social language understanding |
MME | Multimodal LLMs | Multi-modal task | Ability of perception and cognition |
Xiezhi | Comprehensive domain knowledge | General language task | Overall performance across multiple benchmarks |
Choice-75 | Script learning | Specific downstream task | Overall performance of LLMs |
CUAD | Legal contract review | Specific downstream task | Legal contract understanding |
TRUSTGPT | Ethic | Specific downstream task | Toxicity bias and value-alignment |
MMLU | Text models | General language task | Multitask accuracy |
MATH | Mathematical problem | Specific downstream task | Mathematical ability |
APPS | Coding challenge competence | Specific downstream task | Code generation ability |
CELLO | Complex instructions | Specific downstream task | Four designated evaluation criteria |
C-Eval | Chinese evaluation | General language task | 52 Exams in a Chinese context |
EmotionBench | Empathy ability | Specific downstream task | Emotional changes |
OpenLLM | Chatbots | General language task | Leaderboard rankings |
DynaBench | Dynamic evaluation | General language task | NLI QA sentiment and hate speech |
Chatbot Arena | Chat assistants | General language task | Crowdsourcing and Elo rating system |
AlpacaEval | Automated evaluation | General language task | Metrics robustness and diversity |
CMMLU | Chinese multi-tasking | Specific downstream task | Multi-task language understanding capabilities |
HELM | Holistic evaluation | General language task | Multi-metric |
API-Bank | Tool utilization | Specific downstream task | API call retrieval and planning |
M3KE | Multi-task | Specific downstream task | Multi-task accuracy |
MMBench | Large vision-language models | Multi-modal task | Multifaceted capabilities of VLMs |
SEED-Bench | Multimodal Large Language Models | Multi-modal task | Generative understanding of MLLMs |
UHGEval | Hallucination of Chinese LLMs | Specific downstream task | Form metric and granularity |
ARB | Advanced reasoning ability | Specific downstream task | Multidomain advanced reasoning ability |
BIG-bench | Capabilities and limitations of LMs | General language task | Model performance and calibration |
MultiMedQA | Medical QA | Specific downstream task | Accuracy and human evaluation |
CV ALUES | Safety and responsibility | Specific downstream task | Alignment ability of LLMs |
LVLM-eHub | LVLMs | Multi-modal task | Multimodal capabilities of LVLMs |
ToolBench | Software tools | Specific downstream task | Execution success rate |
FRESHQA | Dynamic QA | Specific downstream task | Correctness and hallucination |
CMB | Chinese comprehensive medicine | Specific downstream task | Expert evaluation and automatic evaluation |
PandaLM | Instruction tuning | General language task | Winrate judged by PandaLM |
Dialogue CoT | In-depth dialogue | Specific downstream task | Helpfulness and acceptness of LLMs |
BOSS | OOD robustness in NLP | General language task | OOD robustness |
MM-Vet | Complicated multi-modal tasks | Multi-modal task | Integrated vision-language capabilities |
LAMM | Multi-modal point clouds | Multi-modal task | Task-specific metrics |
GLUE-X | OOD robustness for NLP tasks | General language task | OOD robustness |
KoLA | Knowledge-oriented evaluation | General language task | Self-contrast metrics |
AGIEval | Human-centered foundational models | General language task | General |
PromptBench | Adversarial prompt resilience | General language task | Adversarial robustness |
MT-Bench | Multi-turn conversation | General language task | Winrate judged by GPT-4 |
M3Exam | Multilingual | multimodal and multilevel | Specific downstream task Task-specific metrics |
GAOKAO-Bench | Chinese Gaokao examination | Specific downstream task | Accuracy and scoring rate |
SafetyBench | Safety | Specific downstream task | Safety abilities of LLMs |
LLMEval2 | LLM Evaluator | General language task | Acc macro-f1 and kappa correlation coefficient |
How:我们应该如何评估?
评估可以简单的划分为自动化评估和人类评估两类,他们的具体异同与需要完成的工作如下图所示。
评价指标 | 自动化评估 | 人类评估 |
---|---|---|
Accuracy | Exact match, Quasi-exact match, F1 score, ROUGE score | 主要检查事实一致性和准确性 |
Calibrations | Expected calibration error, Area under the curve | None |
Fairness | Demographic parity difference, Equalized odds difference | None |
Robustness | Attack success rate, Performance drop rate | None |
Relevance | None | 字面意思 |
Fluency | None | 字面意思 |
Transparency | None | 决策过程的透明程度,即为什么会产生这样的响应 |
Safety | None | 检查生成文本的潜在危害性 |
Human alignment | None | 检查人类价值观、偏好和期望的一致性程度 |
Number of evaluators | None | Adequate representation, Statistical significance |
Evaluator’s expertise level | None | Relevant domain expertise, Task familiarity, Methodological training |