GPU（七）CUDA事件计时与nvprof

原创已于 2024-03-31 22:22:55 修改 · 1.2k 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#GPU #CUDA

于 2024-03-31 22:00:23 首次发布

GPU 专栏收录该内容

12 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

CUDA runtime API文档参考：https://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf

1、CUDA事件计时

不管是CPU程序，还是CUDA GPU程序，性能永远是一个重要的关注点，而评估一段程序的性能，耗时是一个非常直观的指标。假设我们要测试某个函数的耗时，一般会在该函数开始时计时一下，函数结束的时候计时一下，最终计算两个时间差。在普通程序中完全可以这么做，但是在CUDA GPU程序的host和device具有异构计算特性，如果把计时代码放在CPU处执行，则在第二次计时前，要做好同步操作。本章节讨论一种CUDA事件计时（也就是在device上计时的方式）来计算核函数执行时间的方法。

1.1 事件计时CUDA函数

本章节要用到的CUDA函数定义和功能如下：

cudaEventCreate

函数定义：__host__cudaError_t cudaEventCreate (cudaEvent_t *event)

函数功能：创建一个事件对象

cudaEventRecord

函数定义：__host____device__cudaError_t cudaEventRecord(cudaEvent_t event, cudaStream_t stream)

函数功能：记录一个事件

cudaEventQuery

函数定义：__host__cudaError_t cudaEventQuery (cudaEvent_t event)

函数功能：查询一个事件状态

cudaEventSynchronize

函数定义：__host__cudaError_t cudaEventSynchronize(cudaEvent_t event)

函数功能：等待一个事件完成

cudaEventElapsedTime

函数定义： __host__cudaError_t cudaEventElapsedTime (float *ms, cudaEvent_t start, cudaEvent_t end)

函数功能：计算两个事件的运行时间差

cudaEventDestroy

函数定义：__host____device__cudaError_t cudaEventDestroy(cudaEvent_t event)

函数功能：销毁一个事件对象

1.2 事件计时示例函数

基于上面的CUDA函数，我们写一个简单的CUDA程序：

// time.cu
#include<stdio.h>
#include<stdlib.h>
#include<time.h>

const int g_buffer_size = 1024; // 数组大小
const int g_buffer_bytes = g_buffer_size*sizeof(float);
const int calc_count = 1000*1000; // 100w次计算
const long s2ns = 1000000000;
const long ms2ns = 1000000;

void init_data(float *p_host){
        for (int i = 0; i < g_buffer_size; i++){
                p_host[i] = float(rand() % g_buffer_size);
        }
}

__global__ void device_calc(float *p_device){
    const int idx = threadIdx.x + blockIdx.x*blockDim.x;
    p_device[idx] *= 1;
}

void host_calc(float * p_host){
        for (int i = 0; i < g_buffer_size; i++){
                p_host[i] *= 1;
        }
}

void host_time(float *p_host){
        struct timespec start, end;
        clock_gettime(CLOCK_REALTIME, &start);

        for (int i = 0; i < calc_count; i++){
                host_calc(p_host);
        }

        clock_gettime(CLOCK_REALTIME, &end);

        long start_us = start.tv_sec*s2ns+start.tv_nsec;
        long end_us = end.tv_sec*s2ns+end.tv_nsec;
        printf("host_time diff: %ld ms\n", (end_us-start_us)/ms2ns);
}

void device_time(float *p_device){
        cudaEvent_t start, stop;
        
        cudaEventCreate(&start);
        cudaEventCreate(&stop);
        cudaEventRecord(start);
        cudaEventQuery(start);

        for (int i = 0; i < calc_count; i++){
                device_calc<<<1, g_buffer_size>>>(p_device);
        }

        cudaEventRecord(stop);
        cudaEventSynchronize(stop);

        float elapsed_time;
        cudaEventElapsedTime(&elapsed_time, start, stop);

        printf("device_time: %g ms\n", elapsed_time);

        cudaEventDestroy(start);
        cudaEventDestroy(stop);
}

int main(){
        // host数组初始化
        float *p_host = (float*)malloc(g_buffer_bytes);
        if (p_host == NULL) {
                printf("malloc failed");
                exit(-1);
        }
        memset(p_host, 0, g_buffer_bytes);

        // 初始化host数据
        init_data(p_host);

        // device数组初始化
        float *p_device;
        cudaMalloc((float**)&p_device, g_buffer_bytes);
        if (p_device == NULL) {
                printf("cudaMalloc failed");
                free(p_host);
                exit(-1);
        }
        cudaMemset(p_device, 0, g_buffer_bytes);

        // 拷贝host数据到device
        cudaMemcpy(p_device, p_host, g_buffer_bytes, cudaMemcpyHostToDevice);

        // host上计算时间
        host_time(p_host);

        // device上计算时间
        device_time(p_device);

        // 释放内存
        free(p_host);
        cudaFree(p_device);

        return 0;
}

在上面的CUDA程序中，我们分别用CPU和GPU对一个长度为1024的float数组做了100W次简单计算，然后计算耗时，注意在GPU计算的计时处，我们用的是CUDA事件计时方法。编译运行：

$ nvcc time.cu -o time
$ ./time
host_time diff: 997 ms
device_time: 6058.68 ms

从运行结果来看，CPU比GPU快很多，这是因为CPU单核运算速度比GPU单核更快，以及数据通过PCIe从host到device比较耗时等原因，在处理少量数据时会更有优势。但是如果把数据量和计算复杂度提上来，GPU就有更大的发挥空间了。

2、nvprof

除了像上面那样显示的在程序中加入计时来分析程序的性能，NVIDIA还提供了一个叫做nvprof的工具，我们把上面的代码稍微修改一下来看看nvprof工具的用法：

// nvprof.cu
#include<stdio.h>
#include<stdlib.h>
#include<time.h>

const int g_buffer_size = 1024; // 数组大小
const int g_buffer_bytes = g_buffer_size*sizeof(float);

void init_data(float *p_host){
        for (int i = 0; i < g_buffer_size; i++){
                p_host[i] = float(rand() % g_buffer_size);
        }
}

__global__ void device_calc(float *p_device){
    const int idx = threadIdx.x + blockIdx.x*blockDim.x;
    p_device[idx] *= 1;
}

int main(){
        // host数组初始化
        float *p_host = (float*)malloc(g_buffer_bytes);
        if (p_host == NULL) {
                printf("malloc failed");
                exit(-1);
        }
        memset(p_host, 0, g_buffer_bytes);

        // 初始化host数据
        init_data(p_host);

        // device数组初始化
        float *p_device;
        cudaMalloc((float**)&p_device, g_buffer_bytes);
        if (p_device == NULL) {
                printf("cudaMalloc failed");
                free(p_host);
                exit(-1);
        }
        cudaMemset(p_device, 0, g_buffer_bytes);

        // 拷贝host数据到device
        cudaMemcpy(p_device, p_host, g_buffer_bytes, cudaMemcpyHostToDevice);

        device_calc<<<1, g_buffer_size>>>(p_device);

        // 释放内存
        free(p_host);
        cudaFree(p_device);

        return 0;
}

编译CUDA程序生成可执行文件：

$ nvcc nvprof.cu -o prof
$ ll prof
-rwxr-xr-x 1 nuczzz nuczzz 989504 Mar 31 21:48 prof*

执行nsys nvprof ./prof得到如下结果，可以看出这段程序主要耗时是在cudaMalloc：

在一些低版本中，可以直接使用nvprof ./prof，高版本需要加上nsys

$ nsys nvprof ./prof
WARNING: prof and any of its children processes will be profiled.

Generating '/tmp/nsys-report-795c.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /home/nuczzz/c/day4/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)           Name

 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------
     99.8        180894728          1  180894728.0  180894728.0  180894728  180894728          0.0  cudaMalloc

      0.1           175601          1     175601.0     175601.0     175601     175601          0.0  cudaFree

      0.1           120538          1     120538.0     120538.0     120538     120538          0.0  cudaLaunchKernel

      0.0            62816          1      62816.0      62816.0      62816      62816          0.0  cudaMemcpy

      0.0            12792          1      12792.0      12792.0      12792      12792          0.0  cudaMemset

      0.0              804          1        804.0        804.0        804        804          0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report
SKIPPED: /home/nuczzz/c/day4/report1.sqlite does not contain CUDA kernel data.
[6/7] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: /home/nuczzz/c/day4/report1.sqlite does not contain GPU memory data.
[7/7] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: /home/nuczzz/c/day4/report1.sqlite does not contain GPU memory data.
Generated:
    /home/nuczzz/c/day4/report1.nsys-rep
    /home/nuczzz/c/day4/report1.sqlite

微信公众号卡巴斯同步发布，欢迎大家关注。

您可能感兴趣的与本文相关的镜像

PyTorch 2.8

PyTorch

Cuda

PyTorch 是一个开源的 Python 机器学习库，基于 Torch 库，底层由 C++ 实现，应用于人工智能领域，如计算机视觉和自然语言处理