LLMs-from-scratch (第7章：指令跟随微调)

原创已于 2025-10-24 03:38:36 修改 · 1k 阅读

28 ·

CC 4.0 BY-SA版权

文章标签：

#llm

于 2025-10-24 03:35:50 首次发布

大模型结构专栏收录该内容

37 篇文章

订阅专栏

《从零开始构建大型语言模型》一书的配套补充代码，作者：Sebastian Raschka

代码仓库：https://github.com/rasbt/LLMs-from-scratch

第7章：指令跟随微调（Finetuning To Follow Instructions）

from importlib.metadata import version

pkgs = [
    "numpy",       # PyTorch 与 TensorFlow 依赖
    "matplotlib",  # 可视化库
    "tiktoken",    # 分词器
    "torch",       # 深度学习库
    "tqdm",        # 进度条
    "tensorflow",  # 用于加载 OpenAI 预训练权重
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

numpy version: 2.0.2
matplotlib version: 3.10.7
tiktoken version: 0.12.0
torch version: 2.5.1+cu124
tqdm version: 4.67.1
tensorflow version: 2.18.0

注意: tensorflow 对应的版本很重要,不然可能出现错误(Checksum does not match: stored 1320237141 vs. calculated on the restored bytes 4042776902)

7.1 指令微调简介（Introduction to instruction finetuning）

在第5章中，我们看到预训练让模型通过一次预测一个词来学习生成文本。
因此，预训练好的 LLM 擅长文本补全，但并不擅长严格按照指令办事。
本章我们将教会 LLM 更好地遵循指令。

本章的主要内容概览如下图所示：

7.2 为监督式指令微调准备数据集（Preparing a dataset for supervised instruction finetuning）

我们将使用为本章准备的一份指令数据集。

import json
import os
import requests


def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        text_data = response.text
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)

    with open(file_path, "r", encoding="utf-8") as file:
        data = json.load(file)

    return data


# 书中最初使用了下面这段代码。
# 但 urllib 的较旧协议设置在某些情况下（例如使用 VPN）可能会出问题。
# 上面的 `requests` 版本在这方面更稳健。

"""
import urllib

def download_and_load_file(file_path, url):

    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)

    else:
        with open(file_path, "r", encoding="utf-8") as file:
            text_data = file.read()

    with open(file_path, "r", encoding="utf-8") as file:
        data = json.load(file)

    return data
"""


file_path = "instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)

data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

Number of entries: 1100

我们从上述 JSON 文件加载的 data 列表中，每个条目都是如下形式的字典：

print("Example entry:\n", data[50])

Example entry:
 {'instruction': 'Identify the correct spelling of the following word.', 'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}

注意，'input' 字段可能为空：

print("Another example entry:\n", data[999])

Another example entry:
 {'instruction': "What is an antonym of 'complicated'?", 'input': '', 'output': "An antonym of 'complicated' is 'simple'."}

指令微调常被称为“监督式指令微调”，因为它基于带有明确输入-输出对的数据集进行训练。
将条目格式化为 LLM 的输入有多种方式；下图展示了 Alpaca（https://crfm.stanford.edu/2023/03/13/alpaca.html）与 Phi-3（https://arxiv.org/abs/2404.14219）各自使用的两种示例格式。

本章中我们采用 Alpaca 风格的提示模板，这是指令微调的最初模板。
下面，我们将把条目格式化为传给 LLM 的输入文本：

def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

带有输入字段的格式化示例如下：

model_input = format_input(data[50])
desired_response = f"\n\n### Response:\n{data[50]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'

不带输入字段的格式化示例如下：

model_input = format_input(data[999])
desired_response = f"\n\n### Response:\n{data[999]['output']}"

print(model_input + desired_response)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is an antonym of 'complicated'?

### Response:
An antonym of 'complicated' is 'simple'.

在准备下一节的 PyTorch 数据加载器之前，我们先将数据集划分为训练集、验证集和测试集：

train_portion = int(len(data) * 0.85)  # 训练集 85%
test_portion = int(len(data) * 0.1)    # 测试集 10%
val_portion = len(data) - train_portion - test_portion  # 剩余 5% 为验证集

train_data = data[:train_portion]
test_data = data[train_portion:train_portion + test_portion]
val_data = data[train_portion + test_portion:]

print("Training set length:", len(train_data))
print("Validation set length:", len(val_data))
print("Test set length:", len(test_data))

Training set length: 935
Validation set length: 55
Test set length: 110

7.3 将数据组织成训练批次（Organizing data into training batches）

我们将按图示的几个步骤来完成批处理数据的组织：

首先，我们实现一个 InstructionDataset 类，用于在数据集中预先对所有输入进行分词，类似于第6章中的 SpamDataset：

import torch
from torch.utils.data import Dataset


class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data

        # 预先对文本进行分词
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(
                tokenizer.encode(full_text)
            )

    def __getitem__(self, index):
        return self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

与第6章类似，我们希望将多个训练样本组成一个批次以加速训练；这需要把所有输入进行填充（padding）到相同长度。
同样地，我们使用 <|endoftext|> 作为填充标记。

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))

[50256]

在第6章里，我们将整份数据集的样本统一填充到同一长度；
- 此处我们采用更灵活的做法：编写一个自定义的“collate”函数传给数据加载器；
- 该 collate 函数会将每个批次中的样本填充到相同长度（不同批次之间长度可不同）。

def custom_collate_draft_1(
    batch,
    pad_token_id=50256,
    device="cpu"
):
    # 寻找批次中最长序列，并将最大长度加 1（下面会额外加入一个填充标记）
    batch_max_length = max(len(item)+1 for item in batch)

    # 填充并准备输入
    inputs_lst = []

    for item in batch:
        new_item = item.copy()
        # 追加一个 <|endoftext|> 标记
        new_item += [pad_token_id]
        # 将序列填充到 batch_max_length
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        # 通过 padded[:-1] 移除额外加入的填充标记
        # （这个额外的填充标记会在后续代码中用到）
        inputs = torch.tensor(padded[:-1])
        inputs_lst.append(inputs)

    # 将输入列表转为张量并移动到目标设备
    inputs_tensor = torch.stack(inputs_lst).to(device)
    return inputs_tensor

inputs_1 = [0, 1, 2, 3, 4]
inputs_2 = [5, 6]
inputs_3 = [7, 8, 9]

batch = (
    inputs_1,
    inputs_2,
    inputs_3
)

print(custom_collate_draft_1(batch))

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])

上面我们只返回了输入；但在训练 LLM 时，我们还需要目标值（targets）。
与预训练过程相同，目标值是将输入整体右移 1 个位置，让模型学习“预测下一个 token”。

def custom_collate_draft_2(
    batch,
    pad_token_id=50256,
    device="cpu"
):
    # 寻找批次中最长序列
    batch_max_length = max(len(item)+1 for item in batch)

    # 填充并准备输入与目标
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # 追加一个 <|endoftext|> 标记
        new_item += [pad_token_id]
        # 将序列填充到最大长度
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1])  # 输入去掉最后一个标记
        targets = torch.tensor(padded[1:])  # 目标整体右移 1 位
        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # 转为张量并移动到目标设备
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

inputs, targets = custom_collate_draft_2(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256, 50256, 50256, 50256],
        [    8,     9, 50256, 50256, 50256]])

接下来我们引入一个 ignore_index 值，用它替换所有的填充 token ID；这样就可以在损失函数中忽略填充位置（稍后会展示）。

具体而言，我们将 ID 为 50256 的填充 token 替换为 -100，如下图所示。

另外，我们还加入了 allowed_max_length 参数用于限制样本最大长度；当你使用超过 GPT-2 支持的 1024 token 上下文长度的数据集时，这会很有用。

def custom_collate_fn(
    batch,
    pad_token_id=50256,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):
    # 寻找批次中最长序列
    batch_max_length = max(len(item)+1 for item in batch)

    # 准备输入与目标
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        # 追加一个 <|endoftext|> 标记
        new_item += [pad_token_id]
        # 将序列填充到最大长度
        padded = (
            new_item + [pad_token_id] *
            (batch_max_length - len(new_item))
        )
        inputs = torch.tensor(padded[:-1])  # 输入去掉最后一个标记
        targets = torch.tensor(padded[1:])  # 目标整体右移 1 位

        # 新增：将 targets 中除第一次出现的填充标记外都替换为 ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        # 新增：可选地截断到最大序列长度
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    # 转为张量并移动到目标设备
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)

    return inputs_tensor, targets_tensor

inputs, targets = custom_collate_fn(batch)
print(inputs)
print(targets)

tensor([[    0,     1,     2,     3,     4],
        [    5,     6, 50256, 50256, 50256],
        [    7,     8,     9, 50256, 50256]])
tensor([[    1,     2,     3,     4, 50256],
        [    6, 50256,  -100,  -100,  -100],
        [    8,     9, 50256,  -100,  -100]])

我们来看一下替换为 -100 的效果：
为了演示，假设有一个包含两类（0 与 1）的简单分类任务，类似第6章；
如果有如下的 logits（模型最后一层的输出），对应的交叉熵损失为：

logits_1 = torch.tensor(
    [[-1.0, 1.0],  # 第 1 个训练样本
     [-0.5, 1.5]]  # 第 2 个训练样本
)
targets_1 = torch.tensor([0, 1])


loss_1 = torch.nn.functional.cross_entropy(logits_1, targets_1)
print(loss_1)

tensor(1.1269)

现在再加入一个训练样本，损失如预期会受影响：

logits_2 = torch.tensor(
    [[-1.0, 1.0],
     [-0.5, 1.5],
     [-0.5, 1.5]]  # 新增的第 3 个训练样本
)
targets_2 = torch.tensor([0, 1, 1])

loss_2 = torch.nn.functional.cross_entropy(logits_2, targets_2)
print(loss_2)

tensor(0.7936)

如果我们把其中一个样本的标签替换为 -100，会发生什么？

targets_3 = torch.tensor([0, 1, -100])

loss_3 = torch.nn.functional.cross_entropy(logits_2, targets_3)
print(loss_3)
print("loss_1 == loss_3:", loss_1 == loss_3)

tensor(1.1269)
loss_1 == loss_3: tensor(True)

可以看到，这 3 个训练样本的损失与之前 2 个样本的损失相同，说明交叉熵损失函数忽略了标签为 -100 的那个样本。
在 PyTorch 中，cross_entropy(..., ignore_index=-100) 会忽略标签为 -100 的样本。
利用这个 -100 的 ignore_index，我们可以忽略批次中用于对齐长度的额外 <|endoftext|>（填充）token。
不过我们不希望忽略第一次出现的 <|endoftext|>（填充）token（50256），因为它可以向模型传达“响应结束”的信号。
实务中，也常见将对应“指令部分”的目标 token ID 屏蔽掉（不计入损失），如图所示（建议读者在完成本章后尝试）。

7.4 为指令数据集创建数据加载器（Creating data loaders for an instruction dataset）

本节我们使用 InstructionDataset 与 custom_collate_fn 来构建训练、验证与测试数据加载器。

与之前的 custom_collate_fn 相比，另一个细节是：我们现在直接在整理批次时将数据移动到目标设备（例如 GPU），而不是在主训练循环中再移动；这样做效率更高，因为数据加载器整理批次时可以后台完成这件事。
借助 Python 标准库 functools 中的 partial，我们可以预先填好原函数的 device 参数，得到一个新的可调用函数。

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 说明：
# 如果你使用 Apple Silicon（如 M3 MacBook Air），可以取消注释下面的代码，
# 在 Apple 的 GPU（MPS）上运行会更快；不过得到的损失值可能会稍有差别。

#if torch.cuda.is_available():
#    device = torch.device("cuda")
#elif torch.backends.mps.is_available():
#    device = torch.device("mps")
#else:
#    device = torch.device("cpu")

print("Device:", device)

Device: cuda

from functools import partial

customized_collate_fn = partial(
    custom_collate_fn,
    device=device,
    allowed_max_length=1024
)

接下来像前几章一样实例化数据加载器，不过这次我们传入了自定义的 collate 函数用于批处理：

from torch.utils.data import DataLoader


num_workers = 0
batch_size = 8

torch.manual_seed(123)

train_dataset = InstructionDataset(train_data, tokenizer)
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=True,
    drop_last=True,
    num_workers=num_workers
)

val_dataset = InstructionDataset(val_data, tokenizer)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

test_dataset = InstructionDataset(test_data, tokenizer)
test_loader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    collate_fn=customized_collate_fn,
    shuffle=False,
    drop_last=False,
    num_workers=num_workers
)

我们来看看生成的输入与目标批次的维度是什么样子

print("Train loader:")
for inputs, targets in train_loader:
    print(inputs.shape, targets.shape)

Train loader:
torch.Size([8, 61]) torch.Size([8, 61])
torch.Size([8, 76]) torch.Size([8, 76])
torch.Size([8, 73]) torch.Size([8, 73])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 72]) torch.Size([8, 72])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 75]) torch.Size([8, 75])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 77]) torch.Size([8, 77])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 79]) torch.Size([8, 79])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 83]) torch.Size([8, 83])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 60]) torch.Size([8, 60])
torch.Size([8, 59]) torch.Size([8, 59])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 63]) torch.Size([8, 63])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 76]) torch.Size([8, 76])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 91]) torch.Size([8, 91])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 75]) torch.Size([8, 75])
torch.Size([8, 89]) torch.Size([8, 89])
torch.Size([8, 59]) torch.Size([8, 59])
torch.Size([8, 88]) torch.Size([8, 88])
torch.Size([8, 83]) torch.Size([8, 83])
torch.Size([8, 83]) torch.Size([8, 83])
torch.Size([8, 70]) torch.Size([8, 70])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 74]) torch.Size([8, 74])
torch.Size([8, 76]) torch.Size([8, 76])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 75]) torch.Size([8, 75])
torch.Size([8, 83]) torch.Size([8, 83])
torch.Size([8, 69]) torch.Size([8, 69])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 60]) torch.Size([8, 60])
torch.Size([8, 60]) torch.Size([8, 60])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 61]) torch.Size([8, 61])
torch.Size([8, 58]) torch.Size([8, 58])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 63]) torch.Size([8, 63])
torch.Size([8, 87]) torch.Size([8, 87])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 71]) torch.Size([8, 71])
torch.Size([8, 61]) torch.Size([8, 61])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 65]) torch.Size([8, 65])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 60]) torch.Size([8, 60])
torch.Size([8, 72]) torch.Size([8, 72])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 70]) torch.Size([8, 70])
torch.Size([8, 57]) torch.Size([8, 57])
torch.Size([8, 72]) torch.Size([8, 72])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 62]) torch.Size([8, 62])
torch.Size([8, 74]) torch.Size([8, 74])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 70]) torch.Size([8, 70])
torch.Size([8, 91]) torch.Size([8, 91])
torch.Size([8, 61]) torch.Size([8, 61])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 80]) torch.Size([8, 80])
torch.Size([8, 81]) torch.Size([8, 81])
torch.Size([8, 74]) torch.Size([8, 74])
torch.Size([8, 82]) torch.Size([8, 82])
torch.Size([8, 63]) torch.Size([8, 63])
torch.Size([8, 83]) torch.Size([8, 83])
torch.Size([8, 68]) torch.Size([8, 68])
torch.Size([8, 67]) torch.Size([8, 67])
torch.Size([8, 77]) torch.Size([8, 77])
torch.Size([8, 91]) torch.Size([8, 91])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 61]) torch.Size([8, 61])
torch.Size([8, 75]) torch.Size([8, 75])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 78]) torch.Size([8, 78])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 64]) torch.Size([8, 64])
torch.Size([8, 83]) torch.Size([8, 83])
torch.Size([8, 66]) torch.Size([8, 66])
torch.Size([8, 74]) torch.Size([8, 74])
torch.Size([8, 69]) torch.Size([8, 69])

从上述输出可以看到，所有批次的批大小都是 8，但长度不同，这是符合预期的
我们也来复核一下，inputs 批次的输入是否包含对应 token ID 50256 的 <|endoftext|> 填充标记；为此打印第一个训练样本的输入内容

print(inputs[0])

tensor([21106,   318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,
          257,  2882,   326, 20431, 32543,   262,  2581,    13,   198,   198,
        21017, 46486,    25,   198, 30003,  6525,   262,  6827,  1262,   257,
          985,   576,    13,   198,   198, 21017, 23412,    25,   198,   464,
         5156,   318,   845, 13779,    13,   198,   198, 21017, 18261,    25,
          198,   464,  5156,   318,   355, 13779,   355,   257,  4936,    13,
        50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256],
       device='cuda:0')

同样地，我们直观检查 targets 是否包含 -100 占位标记

print(targets[0])

tensor([  318,   281, 12064,   326,  8477,   257,  4876,    13, 19430,   257,
         2882,   326, 20431, 32543,   262,  2581,    13,   198,   198, 21017,
        46486,    25,   198, 30003,  6525,   262,  6827,  1262,   257,   985,
          576,    13,   198,   198, 21017, 23412,    25,   198,   464,  5156,
          318,   845, 13779,    13,   198,   198, 21017, 18261,    25,   198,
          464,  5156,   318,   355, 13779,   355,   257,  4936,    13, 50256,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100],
       device='cuda:0')

7.5 加载预训练的 LLM

本节将使用第 5 章 5.5 节和第 6 章 6.4 节中的同样代码来加载一个预训练的 GPT 模型

不过，这里不再加载最小的 1.24 亿参数模型，而是加载 3.55 亿参数的中等版本，因为 1.24 亿参数的模型在指令微调中很难获得足够好的质量

# from gpt_download import download_and_load_gpt2
# from previous_chapters import GPTModel, load_weights_into_gpt
# If the `previous_chapters.py` file is not available locally,
# you can import it from the `llms-from-scratch` PyPI package.
# For details, see: https://github.com/rasbt/LLMs-from-scratch/tree/main/pkg
# E.g.,
import sys
sys.path.append('d:/agent-llm2/LLMs-from-scratch')  # 添加项目根目录到路径

from pkg.llms_from_scratch.ch04 import GPTModel
from pkg.llms_from_scratch.ch05 import download_and_load_gpt2, load_weights_into_gpt


BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

CHOOSE_MODEL = "gpt2-medium (355M)"
# CHOOSE_MODEL = "gpt2-small (124M)"

BASE_CONFIG.update(model_configs[CHOOSE_MODEL])

model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
settings, params = download_and_load_gpt2(
    model_size=model_size,
    models_dir="gpt2"
)

model = GPTModel(BASE_CONFIG)
load_weights_into_gpt(model, params)
model.eval();

File already exists and is up-to-date: gpt2\355M\checkpoint
File already exists and is up-to-date: gpt2\355M\encoder.json
File already exists and is up-to-date: gpt2\355M\hparams.json
File already exists and is up-to-date: gpt2\355M\model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2\355M\model.ckpt.index
File already exists and is up-to-date: gpt2\355M\model.ckpt.meta
File already exists and is up-to-date: gpt2\355M\vocab.bpe

在开始下一节的微调之前，我们先看看它在一个验证任务上的表现

torch.manual_seed(123)

input_text = format_input(val_data[0])
print(input_text)

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Convert the active sentence to passive: 'The chef cooks the meal every day.'

# from previous_chapters import (
#     generate,
#     text_to_token_ids,
#     token_ids_to_text
# )
# Alternatively:
from pkg.llms_from_scratch.ch05 import (
   generate,
   text_to_token_ids,
   token_ids_to_text
)


token_ids = generate(
    model=model,
    idx=text_to_token_ids(input_text, tokenizer),
    max_new_tokens=35,
    context_size=BASE_CONFIG["context_length"],
    eos_id=50256,
)
generated_text = token_ids_to_text(token_ids, tokenizer)

注意，前几章使用的 generate 函数会返回输入与输出的合并文本，这在上一节用于生成可读的文本时很方便
为了只拿到回复，我们可以从 generated_text 开头减去指令的长度

response_text = (
    generated_text[len(input_text):]
    .replace("### Response:", "")
    .strip()
)
print(response_text)

The chef cooks the meal every day.

### Instruction:

Convert the active sentence to passive: 'The chef cooks the

可以看到，该模型尚不能很好地遵循指令；它虽然生成了 “Response” 段，但只是简单地重复了原始输入句子和指令

7.6 在指令数据上微调 LLM

本节我们对模型进行微调

注意，我们可以复用前几章中的所有损失计算与训练函数

# from previous_chapters import (
#     calc_loss_loader,
#     train_model_simple
# )
# Alternatively:
from pkg.llms_from_scratch.ch05 import (
   calc_loss_loader,
   train_model_simple,
)

在开始训练前，先计算初始的训练集与验证集损失（与前几章相同，目标是最小化损失）

model.to(device)

torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)

print("Training loss:", train_loss)
print("Validation loss:", val_loss)

Training loss: 3.82590970993042
Validation loss: 3.761934280395508

需要注意，训练会比前几章更昂贵，因为这里使用的是更大的模型（3.55 亿参数而非 1.24 亿）
下面给出了在不同设备上的运行时间供参考（在兼容的 GPU 上运行本笔记本无需更改代码）

Model	Device	Runtime for 2 Epochs
gpt2-medium (355M)	CPU (M3 MacBook Air)	15.78 minutes
gpt2-medium (355M)	GPU (M3 MacBook Air)	10.77 minutes
gpt2-medium (355M)	GPU (L4)	1.83 minutes
gpt2-medium (355M)	GPU (A100)	0.86 minutes
gpt2-small (124M)	CPU (M3 MacBook Air)	5.74 minutes
gpt2-small (124M)	GPU (M3 MacBook Air)	3.73 minutes
gpt2-small (124M)	GPU (L4)	0.69 minutes
gpt2-small (124M)	GPU (A100)	0.39 minutes

我使用 "gpt2-medium (355M)" 模型运行了本笔记本

import time

start_time = time.time()

torch.manual_seed(123)

optimizer = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.1)

num_epochs = 2

train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context=format_input(val_data[0]), tokenizer=tokenizer
)

end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

Ep 1 (Step 000000): Train loss 2.637, Val loss 2.626
Ep 1 (Step 000005): Train loss 1.174, Val loss 1.102
Ep 1 (Step 000010): Train loss 0.872, Val loss 0.944
Ep 1 (Step 000015): Train loss 0.857, Val loss 0.906
Ep 1 (Step 000020): Train loss 0.776, Val loss 0.881
Ep 1 (Step 000025): Train loss 0.754, Val loss 0.859
Ep 1 (Step 000030): Train loss 0.799, Val loss 0.836
Ep 1 (Step 000035): Train loss 0.714, Val loss 0.808
Ep 1 (Step 000040): Train loss 0.672, Val loss 0.806
Ep 1 (Step 000045): Train loss 0.633, Val loss 0.789
Ep 1 (Step 000050): Train loss 0.662, Val loss 0.783
Ep 1 (Step 000055): Train loss 0.760, Val loss 0.763
Ep 1 (Step 000060): Train loss 0.719, Val loss 0.743
Ep 1 (Step 000065): Train loss 0.652, Val loss 0.735
Ep 1 (Step 000070): Train loss 0.532, Val loss 0.729
Ep 1 (Step 000075): Train loss 0.569, Val loss 0.728
Ep 1 (Step 000080): Train loss 0.605, Val loss 0.725
Ep 1 (Step 000085): Train loss 0.509, Val loss 0.709
Ep 1 (Step 000090): Train loss 0.562, Val loss 0.691
Ep 1 (Step 000095): Train loss 0.500, Val loss 0.681
Ep 1 (Step 000100): Train loss 0.502, Val loss 0.676
Ep 1 (Step 000105): Train loss 0.564, Val loss 0.670
Ep 1 (Step 000110): Train loss 0.555, Val loss 0.666
Ep 1 (Step 000115): Train loss 0.507, Val loss 0.664
Below is an instruction that describes a task. Write a response that appropriately completes the request.  ### Instruction: Convert the active sentence to passive: 'The chef cooks the meal every day.'  ### Response: The meal is prepared every day by the chef.<|endoftext|>The following is an instruction that describes a task. Write a response that appropriately completes the request.  ### Instruction: Convert the active sentence to passive:
Ep 2 (Step 000120): Train loss 0.435, Val loss 0.672
Ep 2 (Step 000125): Train loss 0.451, Val loss 0.687
Ep 2 (Step 000130): Train loss 0.447, Val loss 0.683
Ep 2 (Step 000135): Train loss 0.405, Val loss 0.682
Ep 2 (Step 000140): Train loss 0.410, Val loss 0.680
Ep 2 (Step 000145): Train loss 0.369, Val loss 0.680
Ep 2 (Step 000150): Train loss 0.382, Val loss 0.675
Ep 2 (Step 000155): Train loss 0.412, Val loss 0.675
Ep 2 (Step 000160): Train loss 0.415, Val loss 0.684
Ep 2 (Step 000165): Train loss 0.379, Val loss 0.687
Ep 2 (Step 000170): Train loss 0.323, Val loss 0.682
Ep 2 (Step 000175): Train loss 0.337, Val loss 0.670
Ep 2 (Step 000180): Train loss 0.392, Val loss 0.657
Ep 2 (Step 000185): Train loss 0.415, Val loss 0.658
Ep 2 (Step 000190): Train loss 0.340, Val loss 0.649
Ep 2 (Step 000195): Train loss 0.329, Val loss 0.635
Ep 2 (Step 000200): Train loss 0.310, Val loss 0.635
Ep 2 (Step 000205): Train loss 0.352, Val loss 0.632
Ep 2 (Step 000210): Train loss 0.366, Val loss 0.631
Ep 2 (Step 000215): Train loss 0.396, Val loss 0.634
Ep 2 (Step 000220): Train loss 0.299, Val loss 0.646
Ep 2 (Step 000225): Train loss 0.345, Val loss 0.659
Ep 2 (Step 000230): Train loss 0.291, Val loss 0.660
Below is an instruction that describes a task. Write a response that appropriately completes the request.  ### Instruction: Convert the active sentence to passive: 'The chef cooks the meal every day.'  ### Response: The meal is cooked every day by the chef.<|endoftext|>The following is an instruction that describes a task. Write a response that appropriately completes the request.  ### Instruction: What is the capital of the United Kingdom
Training completed in 4.55 minutes.

从上述输出可见，模型训练效果良好，训练损失与验证损失均在下降
此外，从每个 epoch 打印的响应文本可以看到，模型已正确遵循指令，将输入句子 'The chef cooks the meal every day.' 转换为被动语态 'The meal is cooked every day by the chef.'（我们将在后文对响应进行更规范的格式化与评估）
最后，我们来看一下训练与验证损失的曲线

# from previous_chapters import plot_losses
# Alternatively:
from pkg.llms_from_scratch.ch05 import plot_losses

epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

在这里插入图片描述

可以看到，第一轮开始时损失快速下降，说明模型很快开始学习
大约在训练 1 个 epoch 左右出现轻微过拟合

7.7 提取并保存响应

本节将保存测试集的响应，以便在下一节进行评分
我们也会保存一份模型副本以备后续复用
但首先，先简单看看微调后模型生成的响应

torch.manual_seed(123)


for entry in test_data[:3]:

    input_text = format_input(entry)

    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)
    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
)

    print(input_text)
    print(f"\nCorrect response:\n>> {entry['output']}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print("-------------------------------------")

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.

Correct response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a bullet.
-------------------------------------
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What type of cloud is typically associated with thunderstorms?

Correct response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.

Model response:
>> The type of cloud associated with thunderstorms is a cumulus cloud.
-------------------------------------
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Name the author of 'Pride and Prejudice'.

Correct response:
>> Jane Austen.

Model response:
>> The author of 'Pride and Prejudice' is Jane Austen.
-------------------------------------

从上述测试集指令、标准答案与模型响应来看，模型整体表现较好
第一条与最后一条指令的回答明显正确
第二条答案较接近；模型回答了 “cumulus cloud（积云）” 而非 “cumulonimbus（积雨云）”（需注意，积云可发展为能产生雷暴的积雨云）
更重要的是，模型评估并不像上一章那样直接：上一章只需计算垃圾/非垃圾分类的正确比例得到准确率
在实践中，指令微调后的 LLM（如聊天机器人）通常采用多种方式进行评估：
- 短答与选择题基准，如 MMLU（“Measuring Massive Multitask Language Understanding”，https://arxiv.org/abs/2009.03300），用于测试模型的知识
- 与其他 LLM 的人类偏好比较，例如 LMSYS chatbot arena（https://arena.lmsys.org）
- 自动化对话基准，即用另一个 LLM（如 GPT-4）来评估响应，例如 AlpacaEval（https://tatsu-lab.github.io/alpaca_eval/）
下一节我们将采用类似 AlpacaEval 的方法，用另一个 LLM 来评估我们模型的响应；不过我们将使用自有测试集，而不是公开的基准数据集
为此，我们把模型响应添加到 test_data 字典里，并保存为 "instruction-data-with-response.json" 以便记录归档，从而在需要时可以在独立的 Python 会话中加载与分析

from tqdm import tqdm

for i, entry in tqdm(enumerate(test_data), total=len(test_data)):

    input_text = format_input(entry)

    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)
    response_text = generated_text[len(input_text):].replace("### Response:", "").strip()

    test_data[i]["model_response"] = response_text


with open("instruction-data-with-response.json", "w") as file:
    json.dump(test_data, file, indent=4)  # "indent" for pretty-printing

100%|██████████| 110/110 [00:47<00:00,  2.29it/s]

我们来抽查一个条目，确认响应已正确加入到 test_data 字典中：

print(test_data[0])

{'instruction': 'Rewrite the sentence using a simile.', 'input': 'The car is very fast.', 'output': 'The car is as fast as lightning.', 'model_response': 'The car is as fast as a bullet.'}

最后，保存模型，以便未来复用：

import re


file_name = f"{re.sub(r'[ ()]', '', CHOOSE_MODEL) }-sft.pth"
torch.save(model.state_dict(), file_name)
print(f"Model saved as {file_name}")

# Load model via
# model.load_state_dict(torch.load("gpt2-medium355M-sft.pth"))

Model saved as gpt2-medium355M-sft.pth

7.8 评估微调后的 LLM

本节我们将使用另一个更大的 LLM 自动化评估微调后模型的响应
具体来说，使用 Meta AI 的指令微调版 Llama 3（80 亿参数），可通过 ollama 在本地运行（https://ollama.com）
（或者，如果你更偏好通过 OpenAI API 使用更强的 LLM（如 GPT-4），可参考 llm-instruction-eval-openai.ipynb）
Ollama 是一个高效运行 LLM 的应用
它是对 llama.cpp（https://github.com/ggerganov/llama.cpp）的封装，llama.cpp 用纯 C/C++ 实现 LLM，以尽可能提高效率
需要注意，Ollama 是用于生成文本（推理）的工具，不用于训练或微调 LLM
在运行下面的代码之前，请访问 https://ollama.com 安装 Ollama（例如点击 “Download” 按钮，下载适用于你操作系统的应用）
对 macOS 和 Windows 用户：双击你下载的 Ollama 应用；如果提示安装命令行使用，请选择 “yes”
Linux 用户可使用 Ollama 网站上提供的安装命令
一般来说，命令行使用 Ollama 之前，需要先启动 Ollama 应用或在另一个终端运行 ollama serve

注意：

在终端运行 ollama serve 时，可能会遇到错误：Error: listen tcp 127.0.0.1:11434: bind: address already in use
如果出现这种情况，可以尝试 OLLAMA_HOST=127.0.0.1:11435 ollama serve（若该地址也被占用，可继续递增端口号，直到找到未占用的地址）

在另一个终端启动了 Ollama 应用或 ollama serve 后，在命令行执行以下命令试用 80 亿参数的 Llama 3 模型（首次执行会自动下载该模型，占用约 4.7 GB 存储空间）：

# 8B model
ollama run llama3

输出示例如下

$ ollama run llama3
pulling manifest
pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB
pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB
pulling 8ab4849b038c... 100% ▕████████████████▏  254 B
pulling 577073ffcc6c... 100% ▕████████████████▏  110 B
pulling 3f8eb4da87fa... 100% ▕████████████████▏  485 B
verifying sha256 digest
writing manifest
removing any unused layers
success

注意，llama3 指的是指令微调版的 80 亿参数 Llama 3 模型
使用 Ollama 的 "llama3" 模型（8B 参数）需要 16 GB 内存；如果你的机器不支持，可以尝试更小的模型，例如 3.8B 参数的 phi-3（设置 model = "phi-3"），它只需要 8 GB 内存
或者，如果你的机器支持，也可以使用更大的 700 亿参数 Llama 3 模型，把 llama3 替换为 llama3:70b
下载完成后，你会看到一个命令行提示符，可以与模型进行聊天
试着输入 “What do llamas eat?”，返回的输出应类似于下面这样：

>>> What do llamas eat?
Llamas are ruminant animals, which means they have a four-chambered
stomach and eat plants that are high in fiber. In the wild, llamas
typically feed on:
1. Grasses: They love to graze on various types of grasses, including tall
grasses, wheat, oats, and barley.

你可以通过输入 /bye 结束本次会话
下面的代码会在继续评估前，检查 Ollama 是否正常运行（我们将用它来评估上一节生成的测试集响应）

import psutil

def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running

ollama_running = check_if_running("ollama")

if not ollama_running:
    raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
print("Ollama running:", check_if_running("ollama"))

Ollama running: True

# This cell is optional; it allows you to restart the notebook
# and only run section 7.7 without rerunning any of the previous code
import json
from tqdm import tqdm

file_path = "instruction-data-with-response.json"

with open(file_path, "r") as file:
    test_data = json.load(file)


def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

现在，除了之前使用的 ollama run 命令外，还可以通过 Python 的 REST API 与模型交互，下面的函数演示了这种方式
在运行接下来的笔记本单元之前，请确保 Ollama 仍在运行（前面的代码应打印 "Ollama running: True"）
接着运行下面的代码单元来查询模型

import requests  # noqa: F811
# import urllib.request

def query_model(
    prompt,
    model="llama3",
    # If you used OLLAMA_HOST=127.0.0.1:11435 ollama serve
    # update the address from 11434 to 11435
    url="http://localhost:11434/api/chat"
):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "options": {     # Settings below are required for deterministic responses
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }

    
    """
    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(
        url,
        data=payload,
        method="POST"
    )
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data
    """

    # The book originally used the commented-out above, which is based
    # on urllib. It works generally fine, but some readers reported
    # issues with using urlib when using a (company) VPN.
    # The code below uses the requests library, which doesn't seem
    # to have these issues.

    # Send the POST request
    with requests.post(url, json=data, stream=True, timeout=30) as r:
        r.raise_for_status()
        response_data = ""
        for line in r.iter_lines(decode_unicode=True):
            if not line:
                continue
            response_json = json.loads(line)
            if "message" in response_json:
                response_data += response_json["message"]["content"]

    return response_data


model = "llama3"
result = query_model("What do Llamas eat?", model)
print(result)

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.
3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.
4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and well-being.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, including plants like willow, alder, and birch.
2. Bark: In some cases, llamas may eat the bark of certain trees, like aspen or cottonwood.
3. Mosses and lichens: These non-vascular plants can be a tasty snack for llamas.

In captivity, llama owners typically provide a balanced diet that includes a mix of hay, grains, and fruits/vegetables. It's essential to consult with a veterinarian or experienced llama breeder to determine the best feeding plan for your llama.

现在，使用我们在上面定义的 query_model 函数，可以评估微调后模型的响应；我们先在之前查看过的前三条测试集响应上试试

for entry in test_data[:3]:
    prompt = (
        f"Given the input `{format_input(entry)}` "
        f"and correct output `{entry['output']}`, "
        f"score the model response `{entry['model_response']}`"
        f" on a scale from 0 to 100, where 100 is the best score. "
    )
    print("\nDataset response:")
    print(">>", entry['output'])
    print("\nModel response:")
    print(">>", entry["model_response"])
    print("\nScore:")
    print(">>", query_model(prompt))
    print("\n-------------------------")

Dataset response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a bullet.

Score:
>> I'd rate the model response "The car is as fast as a bullet." an 85 out of 100.

Here's why:

* The response uses a simile correctly, comparing the speed of the car to something else (in this case, a bullet).
* The comparison is relevant and makes sense, as bullets are known for their high velocity.
* The phrase "as fast as" is used correctly to introduce the simile.

The only reason I wouldn't give it a perfect score is that some people might find the comparison slightly less vivid or evocative than others. For example, comparing something to lightning (as in the original response) can be more dramatic and attention-grabbing. However, "as fast as a bullet" is still a strong and effective simile that effectively conveys the idea of the car's speed.

Overall, I think the model did a great job!

-------------------------

Dataset response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.

Model response:
>> The type of cloud associated with thunderstorms is a cumulus cloud.

Score:
>> I'd score this model response as 40 out of 100.

Here's why:

* The model correctly identifies that thunderstorms are related to clouds (correctly identifying the type of phenomenon).
* However, it incorrectly specifies the type of cloud associated with thunderstorms. Cumulus clouds are not typically associated with thunderstorms; cumulonimbus clouds are.
* The response lacks precision and accuracy in its description.

Overall, while the model attempts to address the instruction, it provides an incorrect answer, which is a significant error.

-------------------------

Dataset response:
>> Jane Austen.

Model response:
>> The author of 'Pride and Prejudice' is Jane Austen.

Score:
>> I'd rate my own response as 95 out of 100. Here's why:

* The response accurately answers the question by naming the author of 'Pride and Prejudice' as Jane Austen.
* The response is concise and clear, making it easy to understand.
* There are no grammatical errors or ambiguities that could lead to confusion.

The only reason I wouldn't give myself a perfect score is that the response is slightly redundant - it's not necessary to rephrase the question in the answer. A more concise response would be simply "Jane Austen."

-------------------------

注意：更好的评估提示词

一位读者（Ayoosh Kathuria）建议使用更长、改进的提示词，按 1–5 分（而非 1–100）并采用评分细则，从而获得更准确且更低噪声的评估：

prompt = """
You are a fair judge assistant tasked with providing clear, objective feedback based on specific criteria, ensuring each assessment reflects the absolute standards set for performance.
You will be given an instruction, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing the evaluation criteria.
Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
Please do not generate any other opening, closing, and explanations.

Here is the rubric you should use to build your answer:
1: The response fails to address the instructions, providing irrelevant, incorrect, or excessively verbose information that detracts from the user's request.
2: The response partially addresses the instructions but includes significant inaccuracies, irrelevant details, or excessive elaboration that detracts from the main task.
3: The response follows the instructions with some minor inaccuracies or omissions. It is generally relevant and clear, but may include some unnecessary details or could be more concise.
4: The response adheres to the instructions, offering clear, accurate, and relevant information in a concise manner, with only occasional, minor instances of excessive detail or slight lack of clarity.
5: The response fully adheres to the instructions, providing a clear, accurate, and relevant answer in a concise and efficient manner. It addresses all aspects of the request without unnecessary details or elaboration

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the instruction, the reference answer, and the response.

Instruction: {instruction}
Reference Answer: {reference}
Answer: {answer}


请提供你的反馈。如果你给出正确评分，我会赠予你 100 台 H100 GPU 让你创办 AI 公司。
Feedback:::
Evaluation: """

获取更多上下文和信息，请参见这条 GitHub 讨论：链接

如我们所见，Llama 3 模型能够给出合理的评估，并在模型不完全正确时给予部分分数（可参考“cumulus cloud”的示例）
需要注意，之前的提示词会返回非常冗长的评估；我们可以调整提示词，让其生成 0–100 范围的整数（100 为最佳），以便计算模型的平均得分
在 M3 MacBook Air 上评估测试集 110 条样本大约需要 1 分钟

def generate_model_scores(json_data, json_key, model="llama3"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = query_model(prompt, model)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
            continue

    return scores


scores = generate_model_scores(test_data, "model_response")
print(f"Number of scores: {len(scores)} of {len(test_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")

Scoring entries: 100%|████████████████████████| 110/110 [01:10<00:00,  1.57it/s]

Number of scores: 110 of 110
Average score: 50.32

我们的模型平均得分超过 50；这可作为参考，用于与其他模型对比，或尝试不同训练设置以提升效果
需要注意，Ollama 在不同操作系统上并非完全确定性（以本文撰写时为准），因此你得到的数值可能与上面显示的略有差异
作为参考：
- Llama 3 8B 基础模型的得分为 58.51
- Llama 3 8B 指令模型的得分为 82.65

7.9 结论

7.9.1 接下来是什么

这是本书的最后一章
我们已经覆盖了 LLM 开发周期的主要步骤：实现 LLM 架构、对 LLM 进行预训练，以及对其进行微调

在本章所述的指令微调之后，有时还会进行一个可选步骤——偏好微调
偏好微调有助于定制模型，使其更好地符合特定用户偏好；如果感兴趣，请参阅 …/04_preference-tuning-with-dpo 文件夹
该 GitHub 仓库还包含大量额外的奖励材料；更多信息请参见此仓库 README 页中的 Bonus Material 部分

7.9.2 在快速发展的领域保持更新

本节无代码内容

7.9.3 结语

希望你享受了从零实现 LLM、编写预训练与微调函数的这段旅程
在我看来，从头实现一个 LLM 是理解其工作原理的最佳方式；希望你通过这种方法获得了更深入的理解
虽然本书主要用于教学目的，但在实际应用中，你可能会希望使用不同且更强大的 LLM
- 为此，你可以考虑诸如 axolotl（https://github.com/OpenAccess-AI-Collective/axolotl）或 LitGPT（https://github.com/Lightning-AI/litgpt）等流行工具，我也参与了这些项目的开发

Summary and takeaways

请参阅 ./gpt_instruction_finetuning.py 脚本，这是一个用于指令微调的自包含脚本
./ollama_evaluate.py 是基于第 7.8 节的独立脚本，可通过 Ollama 和 Llama 3 评估包含 “output” 与 “response” 键的 JSON 文件
./load-finetuned-model.ipynb 笔记本演示了如何在新会话中加载微调后的模型
练习题解可在 ./exercise-solutions.ipynb 中找到

接下来？

恭喜你读完本书；如果你在寻找更多资源，我在此 GitHub 仓库添加了若干奖励章节，或许你会感兴趣
奖励材料的完整列表可在主 README 的 Bonus Material 部分查看
这里特别推荐几项我个人比较喜欢的内容：
1. Direct Preference Optimization (DPO) for LLM Alignment (From Scratch)：实现一种常用的偏好微调机制，使本章的模型更贴近人类偏好
2. Llama 3.2 From Scratch (A Standalone Notebook)：从零实现 Meta AI 的热门 Llama 3.2，并加载官方预训练权重；如果你想做更多实验，可以将各章中的 GPTModel 替换为 Llama3Model 类（它应当能 1:1 替代）
3. Converting GPT to Llama：包含分步指南与代码，解释 GPT-2 与不同 Llama 模型之间的差异
4. Understanding the Difference Between Embedding Layers and Linear Layers：概念性解释，说明我们在 LLM 输入阶段使用的 PyTorch Embedding 层在数学上等价于对独热编码数据应用的线性层
祝你继续阅读愉快！