一、为什么需要 对 RAG 进行评测?
在探索和优化 RAG(检索增强生成器)的过程中,如何有效评估其性能已经成为关键问题。
二、如何合成 RAG 测试集?
假设你已经成功构建了一个RAG 系统,并且现在想要评估它的性能。为了这个目的,你需要一个评估数据集,该数据集包含以下列:
- question(问题):想要评估的RAG的问题
- ground_truths(真实答案):问题的真实答案
- answer(答案):RAG 预测的答案
- contexts(上下文):RAG 用于生成答案的相关信息列表
前两列代表真实数据,最后两列代表 RAG 预测数据。
要创建这样的数据集,我们首先需要生成问题和答案的元组。接下来,在RAG上运行这些问题以获得预测结果。
- 生成问题和基准答案(实践中可能会出现偏差)
要生成(问题、答案)元组,我们首先需要准备 RAG 数据,我们将其拆分为块,并将其嵌入向量数据库中。完成这些步骤后,我们会指示 LLM 从指定主题中生成 num_questions 个问题,从而得到问题和答案元组。
为了从给定的上下文中生成问题和答案,我们需要按照以下步骤操作:
- 选择一个随机块并将其作为根上下文
- 从向量数据库中检索 K 个相似的上下文
- 将根上下文和其 K 个相邻上下文的文本连接起来以构建一个更大的上下文
- 使用这个大的上下文和 num_questions 在以下的提示模板中生成问题和答案
Your task is to formulate exactly {num_questions} questions from given context and provide the answer to each one.
End each question with a '?' character and then in a newline write the answer to that question using only the context provided.
Separate each question/answer pair by "XXX" Each question must start with "question:". Each answer must start with "answer:".
The question must satisfy the rules given below:
1. The question should make sense to humans even when read without the given context.
2. The question should be fully answered from the given context.
3. The question should be framed from a part of context that contains important information. It can also be from tables, code, etc.
4. The answer to the question should not contain any links.
5. The question should be of moderate difficulty.
6. The question must be reasonable and must be understood and responded by humans.
7. Do no use phrases like 'provided context', etc in the question.
8. Avoid framing question using word "and" that can be decomposed into more than one question.
9. The question should not contain more than 10 words, make of use of abbreviation wherever possible.
context: {context}
您的任务是根据给定的上下文提出{num_questions}个问题,并给出每个问题的答案。
在每个问题的末尾加上"?
提供的上下文写出该问题的答案。每个问题/答案之间用 "XXX "隔开。
每个问题必须以 "question: "开头。每个答案必须以 "answer: "开头。
问题必须符合以下规则: 1.即使在没有给定上下文的情况下,问题也应该对人类有意义。
2.问题应能根据给定的上下文给出完整的答案。
3.问题应从包含重要信息的上下文中提取。也可以是表格、代码等。
4.问题答案不应包含任何链接。
5.问题难度应适中。
6.问题必须合理,必须为人类所理解和回答。
7.不要在问题中使用 "提供上下文 "等短语。
8.避免在问题中使用 "和 "字,因为它可以分解成多个问题。
9.问题不应超过 10 个单词,尽可能使用缩写。
语境: {上下文}
- 重复以上步骤 num_count 次,每次改变上下文并生成不同的问题。基于上面的工作流程,下面是我生成问题和答案的结果示例。
| | question | ground_truths
|
|---:|:---------------------------------------------------|:-----------------------
----------------------------|
| 8 | What is the difference between lists and tuples in | ['Lists are mutable and cannot be used as |
| | Python? | dictionary keys, while tuples are immutable and |
| | | can be used as dictionary keys if all elements are |
| | | immutable.']
|
| 4 | What is the name of the Python variant optimized | ['MicroPython and CircuitPython'] |
| | for microcontrollers? |
|
| 13 | What is the name of the programming language that | ['ABC programming language'] |
| | Python was designed to replace? |
|
| 17 | How often do bugfix releases occur?