CUDA Graph（创建静态图加快后续多次启动速度）

最新推荐文章于 2025-10-20 16:04:27 发布

原创最新推荐文章于 2025-10-20 16:04:27 发布 · 6.4k 阅读

4 ·

CC 4.0 BY-SA版权

GPU 专栏收录该内容

19 篇文章

订阅专栏

部署运行你感兴趣的模型镜像

博客原文

应用场景：多次迭代（每次迭代执行的任务图都一样）

假设：shortKernel是一个运行时间极短的kernel

1. 初始版本：

define NSTEP 1000
#define NKERNEL 20

// start CPU wallclock timer
for(int istep=0; istep<NSTEP; istep++){
  for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
    shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
    cudaStreamSynchronize(stream);
  }
}
//end CPU wallclock time

总共平均耗时9.6μs；kernel执行耗时2.9us；

缺点：启动kernel-->执行kernel-->等待执行完；

2. 优化版本：

// start wallclock timer
for(int istep=0; istep<NSTEP; istep++){
  for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
    shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
  }
  cudaStreamSynchronize(stream);
}
//end wallclock timer

总共平均耗时3.8μs；kernel执行耗时2.9us；

优点：启动下一个kernel和执行上一个kernel，能够并行起来；

缺点：每个kernel还得启动一次；

3. Graph优化版本：

bool graphCreated=false;
cudaGraph_t graph;
cudaGraphExec_t instance;
for(int istep=0; istep<NSTEP; istep++){
  if(!graphCreated){
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    for(int ikrnl=0; ikrnl<NKERNEL; ikrnl++){
      shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
    }
    cudaStreamEndCapture(stream, &graph);
    cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
    graphCreated=true;
  }
  cudaGraphLaunch(instance, stream);
  cudaStreamSynchronize(stream);
}

总共平均耗时3.4μs；kernel执行耗时2.9us；

优点：整个graph启动一次；头一次构建graph慢，但是后面的迭代就可以复用该graph了；

simpleCUDAGraphs例子里，是DAG图构建后反复启动；

DAG包含节点和依赖；节点支持GPU kernel、CPU<-->GPU内存copy、CPU函数；支持多stream，支持跨多GPU卡；

参考资料：

CUDA Graphs section of the Programming Guide

CUDA: New Features and Beyond

您可能感兴趣的与本文相关的镜像