从零开始逐步指导开发者构建自己的大型语言模型(LLM)学习笔记- 第4章 GPT 的大语言模型架构-衡量神经网络模型的计算复杂性

FLOPS 分析:FLOPs(每秒浮点运算次数)通过计算执行的浮点运算数量来衡量神经网络模型的计算复杂性。高 FLOPs 意味着更密集的计算和更高的能源消耗。

FLOPS Analysis

  • FLOPs (Floating Point Operations Per Second) measure the computational complexity of neural network models by counting the number of floating-point operations executed
  • High FLOPs indicate more intensive computation and energy consumption

In [ ]:

# pip install -r requirements-extra.txt

In [ ]:

from importlib.metadata import version

pkgs = [
    "thop",
    "torch",
]
for p in pkgs:
    print(f"{p} version: {version(p)}")
thop version: 0.1.1-2209072238
torch version: 2.4.1+cu121

Simple benchmark with fixed batch size

  • forward pass only

In [ ]:

import torch
from thop import profile

from previous_chapters import GPTModel


BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
    "gpt-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 2
input_tensor = torch.randint(0, 50257, (batch_size, 1024)).to(device)

for size in model_configs:
    BASE_CONFIG.update(model_configs[size])

    model = GPTModel(BASE_CONFIG).bfloat16()
    model.to(device)

    # MACS = multiply-accumulate operations
    # MACS are typically counted as two FLOPS (one multiply and one accumulate)
    macs, params = profile(model, inputs=(input_tensor,), verbose=False)
    flops = 2*macs
    print(f"{size:18}: {flops:.1e} FLOPS")

    del model
    torch.cuda.empty_cache()
gpt-small (124M)  : 5.1e+11 FLOPS
gpt-medium (355M) : 1.4e+12 FLOPS
gpt-large (774M)  : 3.2e+12 FLOPS
gpt-xl (1558M)    : 6.4e+12 FLOPS

Simple benchmark with automatic batch size finding

  • forward pass only

In [ ]:

for size in model_configs:
    print(f"\nProcessing {size}")
    config = BASE_CONFIG.copy()
    config.update(model_configs[size])

    min_batch_size = 1
    max_batch_size = None
    max_possible_batch_size = 4096

    while min_batch_size <= max_possible_batch_size:
        batch_size = (min_batch_size + max_possible_batch_size) // 2
        try:
            input_tensor = torch.randint(
                0, config["vocab_size"],
                (batch_size, config["context_length"]),
                device=device
            )

            model = GPTModel(config).bfloat16().to(device)

            # MACS = multiply-accumulate operations
            # MACS are typically counted as two FLOPS (one multiply and one accumulate)
            macs, params = profile(model, inputs=(input_tensor,), verbose=False)
            flops = 2 * macs
            print(f"  Batch size {batch_size}: {flops:.1e} FLOPS")

            # If successful, try a larger batch size
            min_batch_size = batch_size + 1
            max_batch_size = batch_size

            # Clean up
            del model, input_tensor
            torch.cuda.empty_cache()

        except RuntimeError as e:
            if "out of memory" in str(e):
                # Try smaller batch size
                max_possible_batch_size = batch_size - 1

                # Clean up
                try:
                    del model, input_tensor
                    torch.cuda.empty_cache()
                except NameError:
                    pass
            else:
                raise e

这段代码的主要目的是针对不同大小的模型配置,通过尝试不同的批量大小(batch_size)来计算模型的浮点运算次数(FLOPS),并在遇到内存不足错误时调整批量大小。以下是详细解释:

  1. 循环遍历模型配置

    for size in model_configs:
        print(f"\nProcessing {size}")
        config = BASE_CONFIG.copy()
        config.update(model_configs[size])
    
     
    • for size in model_configs: :遍历 model_configs 字典中的每个模型大小配置。
    • print(f"\nProcessing {size}") :打印当前正在处理的模型大小。
    • config = BASE_CONFIG.copy() :复制基础配置 BASE_CONFIG
    • config.update(model_configs[size]) :使用当前模型大小的特定配置更新基础配置。
  2. 初始化批量大小参数

    min_batch_size = 1
    max_batch_size = None
    max_possible_batch_size = 4096
    
     
    • min_batch_size :最小批量大小,初始值为 1。
    • max_batch_size :最大成功运行的批量大小,初始值为 None
    • max_possible_batch_size :尝试的最大批量大小,初始值为 4096。
  3. 二分查找合适的批量大小

    while min_batch_size <= max_possible_batch_size:
        batch_size = (min_batch_size + max_possible_batch_size) // 2
        try:
            # 生成输入张量
            input_tensor = torch.randint(
                0, config["vocab_size"],
                (batch_size, config["context_length"]),
                device=device
            )
    
            # 初始化模型并转换为bfloat16类型,移动到指定设备
            model = GPTModel(config).bfloat16().to(device)
    
            # 计算MACS和参数数量
            macs, params = profile(model, inputs=(input_tensor,), verbose=False)
            flops = 2 * macs
            print(f"  Batch size {batch_size}: {flops:.1e} FLOPS")
    
            # 调整批量大小
            min_batch_size = batch_size + 1
            max_batch_size = batch_size
    
            # 清理内存
            del model, input_tensor
            torch.cuda.empty_cache()
    
        except RuntimeError as e:
            if "out of memory" in str(e):
                # 调整最大可能的批量大小
                max_possible_batch_size = batch_size - 1
    
                # 清理内存
                try:
                    del model, input_tensor
                    torch.cuda.empty_cache()
                except NameError:
                    pass
            else:
                raise e
    
     
    • while min_batch_size <= max_possible_batch_size: :循环条件,当最小批量大小小于等于最大可能批量大小时继续循环。
    • batch_size = (min_batch_size + max_possible_batch_size) // 2 :使用二分查找法计算当前尝试的批量大小。
    • try 块:
      • input_tensor = torch.randint(0, config["vocab_size"], (batch_size, config["context_length"]), device=device) :生成一个随机整数张量作为模型输入,形状为 (batch_size, config["context_length"]),数值范围在 0 到 config["vocab_size"] 之间,并将其移动到指定设备。
      • model = GPTModel(config).bfloat16().to(device) :根据当前配置初始化 GPTModel,将模型转换为 bfloat16 数据类型,并移动到指定设备。
      • macs, params = profile(model, inputs=(input_tensor,), verbose=False) :使用 profile 函数计算模型的乘法累加操作次数(MACS)和参数数量,verbose=False 表示不打印详细信息。
      • flops = 2 * macs :根据定义,将 MACS 转换为浮点运算次数(FLOPS),因为一次 MACS 通常计为两次 FLOPS。
      • print(f" Batch size {batch_size}: {flops:.1e} FLOPS") :打印当前批量大小下的 FLOPS。
      • min_batch_size = batch_size + 1 :如果当前批量大小成功运行,增加最小批量大小。
      • max_batch_size = batch_size :更新最大成功运行的批量大小。
      • del model, input_tensor 和 torch.cuda.empty_cache() :删除模型和输入张量,并清理 GPU 缓存,释放内存。
    • except RuntimeError as e 块:
      • 如果错误信息中包含 out of memory,表示内存不足,将最大可能的批量大小减小,并尝试清理内存。
      • 如果不是内存不足错误,重新抛出异常。
Processing gpt-small (124M)
  Batch size 256: 6.5e+13 FLOPS
  Batch size 384: 9.7e+13 FLOPS
  Batch size 388: 9.8e+13 FLOPS
  Batch size 389: 9.8e+13 FLOPS

Processing gpt-medium (355M)
  Batch size 256: 1.9e+14 FLOPS
  Batch size 260: 1.9e+14 FLOPS
  Batch size 262: 1.9e+14 FLOPS
  Batch size 263: 1.9e+14 FLOPS

Processing gpt-large (774M)
  Batch size 256: 4.0e+14 FLOPS

Processing gpt-xl (1558M)
  Batch size 128: 4.1e+14 FLOPS
  Batch size 136: 4.3e+14 FLOPS
  Batch size 140: 4.5e+14 FLOPS
  Batch size 142: 4.5e+14 FLOPS
  Batch size 143: 4.6e+14 FLOPS

Benchmark with automatic batch size finding and Model FLOP Utilization (MFU)

  • Model FLOPs Utilization (MFU) explanation from the PaLM paper

我们提出一种新的效率度量标准,它与具体实现方式无关,能够更清晰地比较系统效率,被称为模型浮点运算次数利用率(MFU)。这是观测到的吞吐量(每秒处理的标记数)与在峰值浮点运算次数下运行的系统的理论最大吞吐量的比率。至关重要的是,“理论最大”吞吐量仅考虑计算前向传播和后向传播所需的操作,而不包括重新计算。

MFU=Observed Tokens per SecondTheoretical Max Tokens per Second

where

Theoretical Max Tokens per Second=Max FLOPs per SecondTotal FLOPs per Token

and

Tokens per Second=Batch Size×Sequence LengthTotal Time

  • forward and backward pass

In [ ]:

# Theoretical max flops per second provided by the GPU manufacturer

flops_per_second = {
    # https://www.techpowerup.com/gpu-specs/h100-pcie-80-gb.c3899
    "H100": {
        torch.float32: 51.22e12,  # 51.22 TFLOPs for FP32 on NVIDIA H100
        torch.float16: 204.9e12,  # 204.9 TFLOPs for FP16 on NVIDIA H100
        torch.bfloat16: 204.9e12
    },
    # https://www.techpowerup.com/gpu-specs/l4.c4091
    "L4": {
        torch.float32: 30.29e12,  # 30.29 TFLOPs for FP32 on NVIDIA L4
        torch.float16: 30.29e12,  # 30.29 TFLOPs for FP16 on NVIDIA L4
        torch.bfloat16: 30.29e12
    },
    # https://www.techpowerup.com/gpu-specs/tesla-t4.c3316
    "T4": {
        torch.float32: 8.1e12,  # 8.1 TFLOPs for FP32 on NVIDIA T4
        torch.float16: 65.13e12,  # 65.13 TFLOPs for FP16 on NVIDIA T4
        torch.bfloat16: 65.13e12
    },
    # https://www.techpowerup.com/gpu-specs/a10g.c3798
    "A10G": {
        torch.float32: 31.52e12,  # 31.52 TFLOPs for FP32 on NVIDIA A10G
        torch.float16: 31.52e12,  # 31.52 TFLOPs for FP16 on NVIDIA A10G
        torch.bfloat16: 31.52e12
    },
    # https://www.techpowerup.com/gpu-specs/a100-pcie-40-gb.c3623
    "A100": {
        torch.float32: 19.49e12,  # 19.49 TFLOPs for FP32 on NVIDIA A100
        torch.float16: 77.97e12,  # 77.97 TFLOPs for FP16 on NVIDIA A100
        torch.bfloat16: 77.97e12
    },
    # https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621
    "RTX_3080": {
        torch.float32: 29.77e12,  # 29.77 TFLOPs for FP32 on NVIDIA RTX 3080
        torch.float16: 29.77e12,  # 29.77 TFLOPs for FP16 on NVIDIA RTX 3080
        torch.bfloat16: 29.77e12
    },
    # https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622
    "RTX_3090": {
        torch.float32: 35.58e12,  # 35.58 TFLOPs for FP32 on NVIDIA RTX 3090
        torch.float16: 35.58e12,  # 35.58 TFLOPs for FP16 on NVIDIA RTX 3090
        torch.bfloat16: 35.58e12
    }
}

In [ ]:

import time

def get_gpu_model(flops_per_second_dict):
    device_name = torch.cuda.get_device_name(0)
    for model in flops_per_second_dict.keys():
        if model in device_name:
            return model
    return "Unknown"  # Default if no matching model is found


gpu_model = get_gpu_model(flops_per_second)
print("GPU Model:", gpu_model)

if gpu_model != "Unknown":

    for size in model_configs:
        print(f"\nProcessing {size}")
        config = BASE_CONFIG.copy()
        config.update(model_configs[size])

        min_batch_size = 1
        max_batch_size = None
        max_possible_batch_size = 4096

        while min_batch_size <= max_possible_batch_size:
            batch_size = (min_batch_size + max_possible_batch_size) // 2
            try:
                input_tensor = torch.randint(
                    0, config["vocab_size"],
                    (batch_size, config["context_length"]),
                    device=device
                )

                model = GPTModel(config).bfloat16().to(device)
                model.train()

                # Start timing
                torch.cuda.synchronize()
                start_time = time.time()

                # Forward & backward pass
                output = model(input_tensor)
                loss = output.sum()  # Compute a dummy loss
                loss.backward()

                # End timing
                torch.cuda.synchronize()
                end_time = time.time()

                total_time_seconds = end_time - start_time

                # Calculate FLOPs for forward pass
                macs, params = profile(model, inputs=(input_tensor,), verbose=False)
                flops_forward = 2 * macs  # Assuming one MAC equals two FLOPs

                # Estimate FLOPs for backward pass (typically 2x forward FLOPs)
                flops_backward = 2 * flops_forward

                # Total FLOPs for forward + backward passes
                total_flops = flops_forward + flops_backward  # Or total_flops = flops_forward * 3

                data_type = next(model.parameters()).dtype
                max_flops_per_second = flops_per_second[gpu_model].get(data_type, 0)

                # Compute tokens per second
                tokens_processed = batch_size * config["context_length"]
                tokens_per_second = tokens_processed / total_time_seconds

                # Compute FLOPs per token
                flops_per_token = total_flops / tokens_processed

                # Compute theoretical max tokens per second
                if flops_per_token > 0:
                    theoretical_max_tokens_per_second = max_flops_per_second / flops_per_token
                else:
                    theoretical_max_tokens_per_second = 0  # Avoid division by zero

                # Compute MFU
                if theoretical_max_tokens_per_second > 0:
                    mfu = tokens_per_second / theoretical_max_tokens_per_second
                else:
                    mfu = 0  # Avoid division by zero

                print(f"  Batch size {batch_size}: Tokens/sec: {tokens_per_second:.2f}, MFU: {mfu:.4f}")

                # If successful, try a larger batch size
                min_batch_size = batch_size + 1
                max_batch_size = batch_size

                # Clean up
                del model, input_tensor, output, loss
                torch.cuda.empty_cache()

            except RuntimeError as e:
                if "out of memory" in str(e).lower():
                    # Try smaller batch size
                    max_possible_batch_size = batch_size - 1

                    # Clean up
                    try:
                        del model, input_tensor
                        torch.cuda.empty_cache()
                    except NameError:
                        pass
                else:
                    raise e

else:
    print("Unknown GPU model. Please update the flops_per_second dictionary with your GPU information.")
GPU Model: A100

Processing gpt-small (124M)
  Batch size 16: Tokens/sec: 34248.82, MFU: 0.3256
  Batch size 24: Tokens/sec: 62568.34, MFU: 0.5948

Processing gpt-medium (355M)
  Batch size 4: Tokens/sec: 20159.93, MFU: 0.5483
  Batch size 6: Tokens/sec: 21717.66, MFU: 0.5907
  Batch size 7: Tokens/sec: 22536.25, MFU: 0.6130

Processing gpt-large (774M)
  Batch size 8: Tokens/sec: 12465.21, MFU: 0.7406

Processing gpt-xl (1558M)
  Batch size 4: Tokens/sec: 6779.92, MFU: 0.8113
  • a value of 1.0 is best (equal to 100%)
  • Note that the batch sizes are smaller than previously because we also carry out the backward pass here, which is more memory-intensive
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值