FLOPS 分析:FLOPs(每秒浮点运算次数)通过计算执行的浮点运算数量来衡量神经网络模型的计算复杂性。高 FLOPs 意味着更密集的计算和更高的能源消耗。
FLOPS Analysis
- FLOPs (Floating Point Operations Per Second) measure the computational complexity of neural network models by counting the number of floating-point operations executed
- High FLOPs indicate more intensive computation and energy consumption
In [ ]:
# pip install -r requirements-extra.txt
In [ ]:
from importlib.metadata import version pkgs = [ "thop", "torch", ] for p in pkgs: print(f"{p} version: {version(p)}")
thop version: 0.1.1-2209072238 torch version: 2.4.1+cu121
Simple benchmark with fixed batch size
- forward pass only
In [ ]:
import torch from thop import profile from previous_chapters import GPTModel BASE_CONFIG = { "vocab_size": 50257, # Vocabulary size "context_length": 1024, # Context length "drop_rate": 0.0, # Dropout rate "qkv_bias": True # Query-key-value bias } model_configs = { "gpt-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12}, "gpt-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16}, "gpt-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20}, "gpt-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25}, } device = torch.device("cuda" if torch.cuda.is_available() else "cpu") batch_size = 2 input_tensor = torch.randint(0, 50257, (batch_size, 1024)).to(device) for size in model_configs: BASE_CONFIG.update(model_configs[size]) model = GPTModel(BASE_CONFIG).bfloat16() model.to(device) # MACS = multiply-accumulate operations # MACS are typically counted as two FLOPS (one multiply and one accumulate) macs, params = profile(model, inputs=(input_tensor,), verbose=False) flops = 2*macs print(f"{size:18}: {flops:.1e} FLOPS") del model torch.cuda.empty_cache()
gpt-small (124M) : 5.1e+11 FLOPS gpt-medium (355M) : 1.4e+12 FLOPS gpt-large (774M) : 3.2e+12 FLOPS gpt-xl (1558M) : 6.4e+12 FLOPS
Simple benchmark with automatic batch size finding
- forward pass only
In [ ]:
for size in model_configs: print(f"\nProcessing {size}") config = BASE_CONFIG.copy() config.update(model_configs[size]) min_batch_size = 1 max_batch_size = None max_possible_batch_size = 4096 while min_batch_size <= max_possible_batch_size: batch_size = (min_batch_size + max_possible_batch_size) // 2 try: input_tensor = torch.randint( 0, config["vocab_size"], (batch_size, config["context_length"]), device=device ) model = GPTModel(config).bfloat16().to(device) # MACS = multiply-accumulate operations # MACS are typically counted as two FLOPS (one multiply and one accumulate) macs, params = profile(model, inputs=(input_tensor,), verbose=False) flops = 2 * macs print(f" Batch size {batch_size}: {flops:.1e} FLOPS") # If successful, try a larger batch size min_batch_size = batch_size + 1 max_batch_size = batch_size # Clean up del model, input_tensor torch.cuda.empty_cache() except RuntimeError as e: if "out of memory" in str(e): # Try smaller batch size max_possible_batch_size = batch_size - 1 # Clean up try: del model, input_tensor torch.cuda.empty_cache() except NameError: pass else: raise e
这段代码的主要目的是针对不同大小的模型配置,通过尝试不同的批量大小(batch_size
)来计算模型的浮点运算次数(FLOPS),并在遇到内存不足错误时调整批量大小。以下是详细解释:
-
循环遍历模型配置:
for size in model_configs: print(f"\nProcessing {size}") config = BASE_CONFIG.copy() config.update(model_configs[size])
for size in model_configs:
:遍历model_configs
字典中的每个模型大小配置。print(f"\nProcessing {size}")
:打印当前正在处理的模型大小。config = BASE_CONFIG.copy()
:复制基础配置BASE_CONFIG
。config.update(model_configs[size])
:使用当前模型大小的特定配置更新基础配置。
-
初始化批量大小参数:
min_batch_size = 1 max_batch_size = None max_possible_batch_size = 4096
min_batch_size
:最小批量大小,初始值为 1。max_batch_size
:最大成功运行的批量大小,初始值为None
。max_possible_batch_size
:尝试的最大批量大小,初始值为 4096。
-
二分查找合适的批量大小:
while min_batch_size <= max_possible_batch_size: batch_size = (min_batch_size + max_possible_batch_size) // 2 try: # 生成输入张量 input_tensor = torch.randint( 0, config["vocab_size"], (batch_size, config["context_length"]), device=device ) # 初始化模型并转换为bfloat16类型,移动到指定设备 model = GPTModel(config).bfloat16().to(device) # 计算MACS和参数数量 macs, params = profile(model, inputs=(input_tensor,), verbose=False) flops = 2 * macs print(f" Batch size {batch_size}: {flops:.1e} FLOPS") # 调整批量大小 min_batch_size = batch_size + 1 max_batch_size = batch_size # 清理内存 del model, input_tensor torch.cuda.empty_cache() except RuntimeError as e: if "out of memory" in str(e): # 调整最大可能的批量大小 max_possible_batch_size = batch_size - 1 # 清理内存 try: del model, input_tensor torch.cuda.empty_cache() except NameError: pass else: raise e
while min_batch_size <= max_possible_batch_size:
:循环条件,当最小批量大小小于等于最大可能批量大小时继续循环。batch_size = (min_batch_size + max_possible_batch_size) // 2
:使用二分查找法计算当前尝试的批量大小。try
块:input_tensor = torch.randint(0, config["vocab_size"], (batch_size, config["context_length"]), device=device)
:生成一个随机整数张量作为模型输入,形状为(batch_size, config["context_length"])
,数值范围在0
到config["vocab_size"]
之间,并将其移动到指定设备。model = GPTModel(config).bfloat16().to(device)
:根据当前配置初始化GPTModel
,将模型转换为bfloat16
数据类型,并移动到指定设备。macs, params = profile(model, inputs=(input_tensor,), verbose=False)
:使用profile
函数计算模型的乘法累加操作次数(MACS)和参数数量,verbose=False
表示不打印详细信息。flops = 2 * macs
:根据定义,将 MACS 转换为浮点运算次数(FLOPS),因为一次 MACS 通常计为两次 FLOPS。print(f" Batch size {batch_size}: {flops:.1e} FLOPS")
:打印当前批量大小下的 FLOPS。min_batch_size = batch_size + 1
:如果当前批量大小成功运行,增加最小批量大小。max_batch_size = batch_size
:更新最大成功运行的批量大小。del model, input_tensor
和torch.cuda.empty_cache()
:删除模型和输入张量,并清理 GPU 缓存,释放内存。
except RuntimeError as e
块:- 如果错误信息中包含
out of memory
,表示内存不足,将最大可能的批量大小减小,并尝试清理内存。 - 如果不是内存不足错误,重新抛出异常。
- 如果错误信息中包含
Processing gpt-small (124M) Batch size 256: 6.5e+13 FLOPS Batch size 384: 9.7e+13 FLOPS Batch size 388: 9.8e+13 FLOPS Batch size 389: 9.8e+13 FLOPS Processing gpt-medium (355M) Batch size 256: 1.9e+14 FLOPS Batch size 260: 1.9e+14 FLOPS Batch size 262: 1.9e+14 FLOPS Batch size 263: 1.9e+14 FLOPS Processing gpt-large (774M) Batch size 256: 4.0e+14 FLOPS Processing gpt-xl (1558M) Batch size 128: 4.1e+14 FLOPS Batch size 136: 4.3e+14 FLOPS Batch size 140: 4.5e+14 FLOPS Batch size 142: 4.5e+14 FLOPS Batch size 143: 4.6e+14 FLOPS
Benchmark with automatic batch size finding and Model FLOP Utilization (MFU)
- Model FLOPs Utilization (MFU) explanation from the PaLM paper
我们提出一种新的效率度量标准,它与具体实现方式无关,能够更清晰地比较系统效率,被称为模型浮点运算次数利用率(MFU)。这是观测到的吞吐量(每秒处理的标记数)与在峰值浮点运算次数下运行的系统的理论最大吞吐量的比率。至关重要的是,“理论最大”吞吐量仅考虑计算前向传播和后向传播所需的操作,而不包括重新计算。
MFU=Observed Tokens per SecondTheoretical Max Tokens per Second
where
Theoretical Max Tokens per Second=Max FLOPs per SecondTotal FLOPs per Token
and
Tokens per Second=Batch Size×Sequence LengthTotal Time
- forward and backward pass
In [ ]:
# Theoretical max flops per second provided by the GPU manufacturer flops_per_second = { # https://www.techpowerup.com/gpu-specs/h100-pcie-80-gb.c3899 "H100": { torch.float32: 51.22e12, # 51.22 TFLOPs for FP32 on NVIDIA H100 torch.float16: 204.9e12, # 204.9 TFLOPs for FP16 on NVIDIA H100 torch.bfloat16: 204.9e12 }, # https://www.techpowerup.com/gpu-specs/l4.c4091 "L4": { torch.float32: 30.29e12, # 30.29 TFLOPs for FP32 on NVIDIA L4 torch.float16: 30.29e12, # 30.29 TFLOPs for FP16 on NVIDIA L4 torch.bfloat16: 30.29e12 }, # https://www.techpowerup.com/gpu-specs/tesla-t4.c3316 "T4": { torch.float32: 8.1e12, # 8.1 TFLOPs for FP32 on NVIDIA T4 torch.float16: 65.13e12, # 65.13 TFLOPs for FP16 on NVIDIA T4 torch.bfloat16: 65.13e12 }, # https://www.techpowerup.com/gpu-specs/a10g.c3798 "A10G": { torch.float32: 31.52e12, # 31.52 TFLOPs for FP32 on NVIDIA A10G torch.float16: 31.52e12, # 31.52 TFLOPs for FP16 on NVIDIA A10G torch.bfloat16: 31.52e12 }, # https://www.techpowerup.com/gpu-specs/a100-pcie-40-gb.c3623 "A100": { torch.float32: 19.49e12, # 19.49 TFLOPs for FP32 on NVIDIA A100 torch.float16: 77.97e12, # 77.97 TFLOPs for FP16 on NVIDIA A100 torch.bfloat16: 77.97e12 }, # https://www.techpowerup.com/gpu-specs/geforce-rtx-3080.c3621 "RTX_3080": { torch.float32: 29.77e12, # 29.77 TFLOPs for FP32 on NVIDIA RTX 3080 torch.float16: 29.77e12, # 29.77 TFLOPs for FP16 on NVIDIA RTX 3080 torch.bfloat16: 29.77e12 }, # https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622 "RTX_3090": { torch.float32: 35.58e12, # 35.58 TFLOPs for FP32 on NVIDIA RTX 3090 torch.float16: 35.58e12, # 35.58 TFLOPs for FP16 on NVIDIA RTX 3090 torch.bfloat16: 35.58e12 } }
In [ ]:
import time def get_gpu_model(flops_per_second_dict): device_name = torch.cuda.get_device_name(0) for model in flops_per_second_dict.keys(): if model in device_name: return model return "Unknown" # Default if no matching model is found gpu_model = get_gpu_model(flops_per_second) print("GPU Model:", gpu_model) if gpu_model != "Unknown": for size in model_configs: print(f"\nProcessing {size}") config = BASE_CONFIG.copy() config.update(model_configs[size]) min_batch_size = 1 max_batch_size = None max_possible_batch_size = 4096 while min_batch_size <= max_possible_batch_size: batch_size = (min_batch_size + max_possible_batch_size) // 2 try: input_tensor = torch.randint( 0, config["vocab_size"], (batch_size, config["context_length"]), device=device ) model = GPTModel(config).bfloat16().to(device) model.train() # Start timing torch.cuda.synchronize() start_time = time.time() # Forward & backward pass output = model(input_tensor) loss = output.sum() # Compute a dummy loss loss.backward() # End timing torch.cuda.synchronize() end_time = time.time() total_time_seconds = end_time - start_time # Calculate FLOPs for forward pass macs, params = profile(model, inputs=(input_tensor,), verbose=False) flops_forward = 2 * macs # Assuming one MAC equals two FLOPs # Estimate FLOPs for backward pass (typically 2x forward FLOPs) flops_backward = 2 * flops_forward # Total FLOPs for forward + backward passes total_flops = flops_forward + flops_backward # Or total_flops = flops_forward * 3 data_type = next(model.parameters()).dtype max_flops_per_second = flops_per_second[gpu_model].get(data_type, 0) # Compute tokens per second tokens_processed = batch_size * config["context_length"] tokens_per_second = tokens_processed / total_time_seconds # Compute FLOPs per token flops_per_token = total_flops / tokens_processed # Compute theoretical max tokens per second if flops_per_token > 0: theoretical_max_tokens_per_second = max_flops_per_second / flops_per_token else: theoretical_max_tokens_per_second = 0 # Avoid division by zero # Compute MFU if theoretical_max_tokens_per_second > 0: mfu = tokens_per_second / theoretical_max_tokens_per_second else: mfu = 0 # Avoid division by zero print(f" Batch size {batch_size}: Tokens/sec: {tokens_per_second:.2f}, MFU: {mfu:.4f}") # If successful, try a larger batch size min_batch_size = batch_size + 1 max_batch_size = batch_size # Clean up del model, input_tensor, output, loss torch.cuda.empty_cache() except RuntimeError as e: if "out of memory" in str(e).lower(): # Try smaller batch size max_possible_batch_size = batch_size - 1 # Clean up try: del model, input_tensor torch.cuda.empty_cache() except NameError: pass else: raise e else: print("Unknown GPU model. Please update the flops_per_second dictionary with your GPU information.")
GPU Model: A100 Processing gpt-small (124M) Batch size 16: Tokens/sec: 34248.82, MFU: 0.3256 Batch size 24: Tokens/sec: 62568.34, MFU: 0.5948 Processing gpt-medium (355M) Batch size 4: Tokens/sec: 20159.93, MFU: 0.5483 Batch size 6: Tokens/sec: 21717.66, MFU: 0.5907 Batch size 7: Tokens/sec: 22536.25, MFU: 0.6130 Processing gpt-large (774M) Batch size 8: Tokens/sec: 12465.21, MFU: 0.7406 Processing gpt-xl (1558M) Batch size 4: Tokens/sec: 6779.92, MFU: 0.8113
- a value of 1.0 is best (equal to 100%)
- Note that the batch sizes are smaller than previously because we also carry out the backward pass here, which is more memory-intensive