| 《从零开始构建大型语言模型》一书的补充代码(作者:Sebastian Raschka) 代码仓库:https://github.com/rasbt/LLMs-from-scratch | |
FLOPS 分析
- FLOPs(每秒浮点运算次数)通过统计执行的浮点运算数量来衡量神经网络模型的计算复杂度
- 更高的 FLOPs 表示更密集的计算和更高的能耗
# pip install -r requirements-extra.txt
from importlib.metadata import version
pkgs = [
"thop",
"torch",
]
for p in pkgs:
print(f"{p} version: {version(p)}")
thop version: 0.1.1-2209072238
torch version: 2.4.1+cu121
固定批大小的简单基准测试
- 仅前向传播
import torch
from thop import profile
# 安装说明请参见:
# https://github.com/rasbt/LLMs-from-scratch/tree/main/pkg
from llms_from_scratch.ch04 import GPTModel
BASE_CONFIG = {
"vocab_size": 50257, # 词表大小
"context_length": 1024, # 上下文长度
"drop_rate": 0.0, # Dropout率
"qkv_bias": True # Query-Key-Value偏置
}
model_configs = {
"gpt-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
"gpt-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
"gpt-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
"gpt-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 2
input_tensor = torch.randint(0, 50257, (batch_size, 1024)).to(device)
for size in model_configs:
BASE_CONFIG.update(model_configs[size])
model = GPTModel(BASE_CONFIG).bfloat16()
model.to(device)
# MACS = multiply-accumulate operations
# MACS are typically counted as two FLOPS (one multiply and one accumulate)
macs, params = profile(model, inputs=(input_tensor,), verbose=False)
flops = 2*macs
print(f"{size:18}: {flops:.1e} FLOPS")
del model
torch.cuda.empty_cache()
gpt-small (124M) : 5.1e+11 FLOPS
gpt-medium (355M) : 1.4e+12 FLOPS
gpt-large (774M) : 3.2e+12 FLOPS
gpt-xl (1558M) : 6.4e+12 FLOPS
自动查找批大小的简单基准测试
- 仅前向传播
for size in model_configs:
print(f"\nProcessing {size}")
config = BASE_CONFIG.copy()
config.update(model_configs[size])
min_batch_size = 1
max_batch_size = None
max_possible_batch_size = 4096
while min_batch_size <= max_possible_batch_size:
batch_size = (min_batch_size + max_possible_batch_size) // 2
try:
input_tensor = torch.randint(
0, config["vocab_size"],
(batch_size, config["context_length"]),
device=device
)
model = GPTModel(config).bfloat16().to(device)
# MACS = 乘加运算(multiply-accumulate operations)
# 通常将一次MACS计为两个FLOPs(一次乘法与一次累加)
macs, params = profile(model, inputs=(input_tensor,), verbose=False)
flops = 2 * macs
print(f" Batch size {batch_size}: {flops:.1e} FLOPS")
# 如果成功,则尝试更大的批大小
min_batch_size = batch_size + 1
max_batch_size = batch_size
# 清理资源
del model, input_tensor
torch.cuda.empty_cache()
except RuntimeError as e:
if "out of memory" in str(e):
# 尝试更小的批大小
max_possible_batch_size = batch_size - 1
# 清理资源
try:
del model, input_tensor
torch.cuda.empty_cache()
except NameError:
pass
else:
raise e
Processing gpt-small (124M)
Batch size 256: 6.5e+13 FLOPS
Batch size 384: 9.7e+13 FLOPS
Batch size 388: 9.8e+13 FLOPS
Batch size 389: 9.8e+13 FLOPS
Processing gpt-medium (355M)
Batch size 256: 1.9e+14 FLOPS
Batch size 260: 1.9e+14 FLOPS
Batch size 262: 1.9e+14 FLOPS
Batch size 263: 1.9e+14 FLOPS
Processing gpt-large (774M)
Batch size 256: 4.0e+14 FLOPS
Processing gpt-xl (1558M)
Batch size 128: 4.1e+14 FLOPS
Batch size 136: 4.3e+14 FLOPS
Batch size 140: 4.5e+14 FLOPS
Batch size 142: 4.5e+14 FLOPS
Batch size 143: 4.6e+14 FLOPS
自动查找批大小的基准与模型FLOP利用率(MFU)
- 模型FLOP利用率(MFU)说明,摘自 PaLM 论文
我们提出了一个与具体实现无关的新效率指标,便于更清晰地比较系统效率,称为模型FLOP利用率(MFU)。这是观测到的吞吐量(token/秒)与系统在峰值 FLOPs 下理论最大吞吐量的比值。关键的是,“理论最大”吞吐量仅计算完成前向与反向传播所需的操作,不包括重物化(rematerialization)。
MFU = Observed Tokens per Second Theoretical Max Tokens per Second \text{MFU} = \frac{\text{Observed Tokens per Second}}{\text{Theoretical Max Tokens per Second}} MFU=Theoretical Max Tokens per SecondObserved Tokens per Second
其中
Theoretical Max Tokens per Second = Max FLOPs per Second Total FLOPs per Token \text{Theoretical Max Tokens per Second} = \frac{\text{Max FLOPs per Second}}{\text{Total FLOPs per Token}} Theoretical Max Tokens per Second=Total FLOPs per TokenMax FLOPs per Second
并且
Tokens per Second = Batch Size × Sequence Length Total Time \text{Tokens per Second} = \frac{\text{Batch Size} \times \text{Sequence Length}}{\text{Total Time}} Tokens per Second=Total TimeBatch Size×Sequence Length
- 前向和反向传播
# GPU 厂商提供的理论最大每秒 FLOPs
flops_per_second = {
# https://www.techpowerup.com/gpu-specs/h100-pcie-80-gb.c3899
"H100": {
torch.float32: 51.22e12, # NVIDIA H100 上 FP32 为 51.22 TFLOPs
torch.float16: 204.9e12, # NVIDIA H100 上 FP16 为 204.9 TFLOPs
torch.bfloat16: 204.9e12
},
# https://www.techpowerup.com/gpu-specs/l4.c4091
"L4": {
torch.float32: 30.29e12, # NVIDIA L4 上 FP32 为 30.29 TFLOPs
torch.float16: 30.29e12, # NVIDIA L4 上 FP16 为 30.29 TFLOPs
torch.bfloat16: 30.29e12
},
# https://www.techpowerup.com/gpu-specs/tesla-t4.c3316
"T4": {
torch.float32: 8.1e12, # NVIDIA T4 上 FP32 为 8.1 TFLOPs
torch.float16: 65.13e12, # NVIDIA T4 上 FP16 为 65.13 TFLOPs
torch.bfloat16: 65.13e12
},
# https://www.techpowerup.com/gpu-specs/a10g.c3798
"A10G": {
torch.float32: 31.52e12, # NVIDIA A10G 上 FP32 为 31.52 TFLOPs
torch.float16: 31.52e12, # NVIDIA A10G 上 FP16 为 31.52 TFLOPs
torch.bfloat16: 31.52e12
},
# https://www.techpowerup.com/gpu-specs/a100-pcie-40-gb.c3623
"A100": {
torch.float32: 19.49e12, # NVIDIA A100 上 FP32 为 19.49 TFLOPs
torch.float16: 77.97e12, # NVIDIA A100 上 FP16 为 77.97 TFLOPs
torch.bfloat16: 77.97e12
},
# https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621
"RTX_3080": {
torch.float32: 29.77e12, # NVIDIA RTX 3080 上 FP32 为 29.77 TFLOPs
torch.float16: 29.77e12, # NVIDIA RTX 3080 上 FP16 为 29.77 TFLOPs
torch.bfloat16: 29.77e12
},
# https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622
"RTX_3090": {
torch.float32: 35.58e12, # NVIDIA RTX 3090 上 FP32 为 35.58 TFLOPs
torch.float16: 35.58e12, # NVIDIA RTX 3090 上 FP16 为 35.58 TFLOPs
torch.bfloat16: 35.58e12
}
}
import time
def get_gpu_model(flops_per_second_dict):
device_name = torch.cuda.get_device_name(0)
for model in flops_per_second_dict.keys():
if model in device_name:
return model
return "Unknown" # 如果没有找到匹配的模型则返回默认值
gpu_model = get_gpu_model(flops_per_second)
print("GPU Model:", gpu_model)
if gpu_model != "Unknown":
for size in model_configs:
print(f"\nProcessing {size}")
config = BASE_CONFIG.copy()
config.update(model_configs[size])
min_batch_size = 1
max_batch_size = None
max_possible_batch_size = 4096
while min_batch_size <= max_possible_batch_size:
batch_size = (min_batch_size + max_possible_batch_size) // 2
try:
input_tensor = torch.randint(
0, config["vocab_size"],
(batch_size, config["context_length"]),
device=device
)
model = GPTModel(config).bfloat16().to(device)
model.train()
# 开始计时
torch.cuda.synchronize()
start_time = time.time()
# 前向与反向传播
output = model(input_tensor)
loss = output.sum() # 计算一个虚拟损失
loss.backward()
# 结束计时
torch.cuda.synchronize()
end_time = time.time()
total_time_seconds = end_time - start_time
# 计算前向传播的FLOPs
macs, params = profile(model, inputs=(input_tensor,), verbose=False)
flops_forward = 2 * macs # 假设一次MAC等于两个FLOPs
# 估算反向传播的FLOPs(通常是前向传播的2倍)
flops_backward = 2 * flops_forward
# 前向+反向传播的总FLOPs
total_flops = flops_forward + flops_backward # 或者 total_flops = flops_forward * 3
data_type = next(model.parameters()).dtype
max_flops_per_second = flops_per_second[gpu_model].get(data_type, 0)
# 计算每秒token数
tokens_processed = batch_size * config["context_length"]
tokens_per_second = tokens_processed / total_time_seconds
# 计算每个token的FLOPs
flops_per_token = total_flops / tokens_processed
# 计算理论最大每秒token数
if flops_per_token > 0:
theoretical_max_tokens_per_second = max_flops_per_second / flops_per_token
else:
theoretical_max_tokens_per_second = 0 # 避免除零错误
# 计算MFU
if theoretical_max_tokens_per_second > 0:
mfu = tokens_per_second / theoretical_max_tokens_per_second
else:
mfu = 0 # 避免除零错误
print(f" Batch size {batch_size}: Tokens/sec: {tokens_per_second:.2f}, MFU: {mfu:.4f}")
# 如果成功,则尝试更大的批大小
min_batch_size = batch_size + 1
max_batch_size = batch_size
# 清理资源
del model, input_tensor, output, loss
torch.cuda.empty_cache()
except RuntimeError as e:
if "out of memory" in str(e).lower():
# 尝试更小的批大小
max_possible_batch_size = batch_size - 1
# 清理资源
try:
del model, input_tensor
torch.cuda.empty_cache()
except NameError:
pass
else:
raise e
else:
print("未知的GPU模型。请使用您的GPU信息更新flops_per_second字典。")
GPU Model: A100
Processing gpt-small (124M)
Batch size 16: Tokens/sec: 34248.82, MFU: 0.3256
Batch size 24: Tokens/sec: 62568.34, MFU: 0.5948
Processing gpt-medium (355M)
Batch size 4: Tokens/sec: 20159.93, MFU: 0.5483
Batch size 6: Tokens/sec: 21717.66, MFU: 0.5907
Batch size 7: Tokens/sec: 22536.25, MFU: 0.6130
Processing gpt-large (774M)
Batch size 8: Tokens/sec: 12465.21, MFU: 0.7406
Processing gpt-xl (1558M)
Batch size 4: Tokens/sec: 6779.92, MFU: 0.8113
- 值为1.0是最佳的(等于100%)
- 注意这里的批大小比之前更小,因为我们还执行了反向传播,这更消耗内存



被折叠的 条评论
为什么被折叠?



