LLM推理阶段的显存都去哪儿了

最新推荐文章于 2025-12-14 20:42:32 发布

原创最新推荐文章于 2025-12-14 20:42:32 发布 · 419 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#性能优化

LLM推理优化专栏收录该内容

39 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

$Total memory = (Model size + KV cache size + Activation memory) / Parallelism$

where

The model size is the number of parameters * the size of data type.
The KV cache size is the total number of tokens * the size of KV cache data type * the number of layers * the KV hidden dimension
The activation memory is determined by TRT engine, which can be a few GBs regardless of the degree of parallelism used
For LLaMA v2 70B FP16 weights + FP8 KV cache, the model size is 70B parameters * 2 bytes = 140GB. The KV cache size is 32K tokens * 1 bytes * 80 layers * 2048 KV hidden dimension = 5GB per 32K tokens. We have 145GB spread across 8 GPUs. The end result is ~18GB per GPU plus some GBs of flat scratch/activation memory allocated by TRT engine and the TRT-LLM runtime.

Note that the KV hidden dimension is derived by the number of KV heads times hidden dimension of each head. LLaMA v2 70B has hidden dimension of 8192, and uses grouped-query attention where 8 key heads and 8 value heads are associated with 64 query heads. Each head has hidden dimension of 8192/64 = 128. So the hidden dimension for KV in total is 128 * 8 * 2 = 2048. （2是K和V)

The total number of tokens is determined by beam width, batch size, and maximum sequence length.

您可能感兴趣的与本文相关的镜像