Nvidia GPU profiling by nsight system

Luchang-Li

已于 2025-05-12 20:11:06 修改

阅读量1.7k

点赞数 11

文章标签： GPU profiling

于 2024-08-13 09:32:58 首次发布

本文链接：https://blog.youkuaiyun.com/u013701860/article/details/141154811

版权

profiling重要性

找到问题的原因和瓶颈，问题就解决了一大半。

如何分析：首先是计算图分析和profiling分析，本文主要是profiling分析。

计算图分析：分析计算图是否足够优化，如格式转换，算子融合，模型量化是否处理的比较好了。

算子profiling：分析算子性能和瓶颈，卷积和矩阵乘是否占据了绝大多数，找到性能存在瓶颈的算子进行针对性优化。

nsight system

User Guide — nsight-systems 2024.5 documentation

Profiling Deep Learning with Nsight Systems

nsight system代替了旧的nvprof工具，提供更强大的profiling能力。当然你仍然可以在nsight system里面继续使用nvprof功能（如nsys nvprof python resnet_test.py）。

使用命令参考：

nsys profile --trace=cuda,nvtx,osrt,cudnn,cublas --export sqlite --force-overwrite true -o analysis_test python resnet_test.py

会生成xx.nsys-rep文件然后可以用windows安装的nsight system打开可视化。

导出profiling的详细json文件：

nsys export --type json --force-overwrite=true -o profiling.json analysis_test.nsys-rep

或者用这个更方便：

https://github.com/chenyu-jiang/nsys2json

python nsys2json.py -f analysis_test.sqlite -o analysis_test_2_json.json

nvtx标记执行范围，如Pytorch

https://pytorch.org/docs/stable/cud/ta.html#nvidia-tools-extension-nvtx

torch.cuda.nvtx.mark
Describe an instantaneous event that occurred at some point.

torch.cuda.nvtx.range_push
Push a range onto a stack of nested range span.

torch.cuda.nvtx.range_pop
Pop a range off of a stack of nested range spans.

torch.cuda.nvtx.range
Context manager / decorator that pushes an NVTX range at the beginning of its scope, and pops it at the end.
使用范例：

    with torch.cuda.nvtx.range(f"resnet_inference_iter{i}"):
        logits = model(pixel_values).logits

或者不使用Pytorch直接使用nvtx python包：

https://github.com/NVIDIA/NVTX

使用效果如下，可以清晰分析标记范围的情况。如果不进行标记，profiling里面夹杂了模型加载，前后处理等信息，难以知道每个步骤起始位置。

如何只采集/提取每个nvtx标记范围内的算子profiling信息？

使用上面提到的https://github.com/chenyu-jiang/nsys2json工具转换为json后，里面有NVTXRegions可以很方便获取标记范围的算子的信息，例如：

    {
        "name": "implicit_convolve_sgemm",
        "ph": "X",
        "cat": "cuda",
        "ts": 3530122.805,
        "dur": 51.968,
        "tid": "Stream 7",
        "pid": "Device 0",
        "args": {
            "NVTXRegions": [
                "resnet_inference_iter1"
            ]
        }
    },

算子与onnx模型对应关系

参考https://github.com/NVIDIA/TensorRT/blob/release/10.3/tools/experimental/trt-engine-explorer/utils/process_engine.py

里面的方法，采用如下方式转模型，并保存转换的日志engine.build.log，可以在engine.build.log里面找到kernel和算子对应关系。

model_path=model.onnx
trtexec --verbose --onnx=${model_path} --saveEngine=${model_path}.engine --exportLayerInfo=${model_path}.engine.graph.json --timingCacheFile=./timing.cache --profilingVerbosity=detailed --fp16 &>${model_path}.engine.build.log

例如：

Name: _gemm_mha_v2_myl16_6, LayerType: kgen, Inputs: [ { Name: __mye20293, Dimensions: [16,4096,40], Format/Datatype: Half }, { Name: __mye20293, Dimensions: [16,40,4096], Format/Datatype: Half }, { Name: __mye20293, Dimensions: [16,4096,40], Format/Datatype: Half }], Outputs: [ { Name: __mye19743, Dimensions: [16,4096,40], Format/Datatype: Half }], TacticName: _gemm_mha_v2_0x85dfe4c293b52ef5f677bd87da6f0e92, StreamId: 0, Metadata: [ONNX Layer: /down_blocks.0/attentions.0/transformer_blocks.0/attn1/MatMul_1][ONNX Layer: /down_blocks.0/attentions.0/transformer_blocks.0/attn1/Softmax][ONNX Layer: /down_blocks.0/attentions.0/transformer_blocks.0/attn1/MatMul][ONNX Layer: /down_blocks.0/attentions.0/transformer_blocks.0/attn1/Mul][ONNX Layer: /down_blocks.0/attentions.0/transformer_blocks.0/attn1/Mul_1]

profiling无终止执行的程序

有一种场景是profiling的对象并不是一个有限生命周期的对象，比如训练任务，vllm/sglang的推理服务，除非kill掉，否则是一个无限长执行的场景，这时单用上面命令无法完成。

可以参考：

Benchmark and Profiling — SGLang

启动方法还是参考上面的例子，但是可以结合nsys start和nsys stop来开始和停止profiling。

启动sglang服务和profiling：

nsys profile --trace=cuda,nvtx,osrt,cudnn,cublas --force-overwrite true -o sglang_perf \
--start-later python3 -m sglang.launch_server --model-path Qwen/Qwen2-1.5B-Instruct

注意这里加了一个--start-later的参数，从而使得后续start才开始profiling，否则一启动就profiling。

查看nsys sessions:

nsys sessions list
#              ID         TIME                       STATE LAUNCH NAME
#       102602720        00:50           DelayedCollection      1 profile-2602703

然后profiling需要的部分：

nsys start --session=profile-2602703

# do something

nsys stop --session=profile-2602703

开始和停止可以进行多次。

nvidia-smi查看GPU使用率

https://zhuanlan.zhihu.com/p/667658845

python3获取nvidia GPU信息程序_python 调用smi-优快云博客

python定时抓取nvidia-smi example（nvidia-smi -l 1也有这个功能但是最小只能做到秒级）

import subprocess
import numpy as np
import time
from datetime import datetime


def getcmdoutput(cmd):
    output = subprocess.getoutput(cmd)
    output = output.split('\n')
    return output


gap = 0.3  # second
period = 2 * 60  # second
loop_num = int(period / gap)

cmd = "nvidia-smi"

for i in range(loop_num):
    output = getcmdoutput(cmd)
    cur_date = datetime.now()
    print("current date", cur_date)
    for out in output:
        print(out)
    time.sleep(gap)