《Learning CUDA Programming》读书笔记(四)

探讨CUDAStream在GPU计算中的角色,包括Kernel执行、数据Copy管理、Pipeline并行条件及其实现,如使用pinned memory、非阻塞内存复制、多Stream。深入解析Stream与defaultstream的并行性,Kernel抢占机制,以及cudaEvent_t、dynamic parallelism和Grid-level Cooperative Group的使用技巧。同时,介绍Multi-Process Service(MPS)如何改善多进程并行执行效率。
部署运行你感兴趣的模型镜像

Kernel执行,GPU和Host之间数据Copy,都是有CUDA Stream管理的;

default stream默认不能与其他stream并行执行;即其他stream都结束后,default stream才能开始执行,他结束后,其他stream才能开始执行;(其他stream用cudaStreamCreateWithPriority可以设为和default stream能并行!)

实现Pipeline并行的条件:

1. Host上开辟的是pinned memory;

2. 使用非阻塞的cudaMemcpyAsync;

3. 使用多个Stream

一块卡上能不能同时并行跑来自多个Stream的多个Kernel,取决于GPU核的占用情况(当前Kernel占满了所有核,则不会调度下一个Kernel同时执行;

cudaStreamAddCallback可以让Stream上的某些任务完成后,调用该CPU端函数,该函数结束后,该Stream再执行后续任务;

在一个回调函数中,一定不能进行任何CUDA API的调用,直接的或者间接的都是不可以的。

cudaStreamCreateWithPriority可以创建带优先级的Stream, 并且指定该Stream是否要与default stream是同步的关系(如果指定为异步关系即cudaStreamNonBlocking,则该Stream和default stream可以并行执行)

高优先级的Stream的Kernel, 居然可以抢占低优先级Stream的;profiler图上会显示低优先级Stream的Kernel执行时间变长,其实期间是被抢占了没有真正执行;

用cudaEvent_t来卡时间的好处:1. 可以不用去让host和device同步;2. 更精确;3.可以为多个Stream卡时间;

dynamic parallelism: 在Kernel里再启动子Kernel;可以实现递归调用,但是层数有限制的;

Grid-level Coorperative Group,可以让之前“Kernel结束再起一个Kernel做Reduce”的开销避免,不用新启Kernel就能实现Grid级的同步;

限制就是启动的block数目不能超过硬件和Kernel共同决定的max active block数目(可调API获得这个数目)(这里的active, 指的是SM寄存器和Shared-memory能承受的最大值,不是和core数一样那个)

因为block数目受限,所以性能也许反而下降,因为hide memory访问latency的能力下降了;

Multi-Process Service(MPS):

如果默认多进程使用同一个GPU,即使单个Kernel都占不满GPU,系统也会分时复用该GPU(即时间片轮转调度,独占式);

使用了MPS,多个进程的多个Kernel可以同时在一个GPU上执行(除非一个Kernel占满整个GPU,否则可能会同时把一个Kernel占不满的GPU SM分配给另一个进程的Kernel);

要先启动一个CUDA的daemon进程,然后就不用管了;(本质上是多个进程把Kernel提交给这个daemon进程,由该进程来往GPU上提交任务,所以NVProfiler里看上去,也是同一个进程的多个Stream并行执行的)

多Kernel同时在一个GPU卡上跑的另一个好处是,能Hide其中某些Kernel的访存Latency;

演示了NV Profiler怎么展示多进程的profiling结果;(暂不支持异步MPI调用)

简单的y[i]=a*x[i]+b任务,启动1次Kernel内嵌循环迭代N次,速度10倍于,启动N次Kernel(开销在于启动Kernel时间和读显存至寄存器的时间),速度10倍于,在Kernel里递归启动Kernel调N层(为什么这么慢?)

您可能感兴趣的与本文相关的镜像

PyTorch 2.5

PyTorch 2.5

PyTorch
Cuda

PyTorch 是一个开源的 Python 机器学习库,基于 Torch 库,底层由 C++ 实现,应用于人工智能领域,如计算机视觉和自然语言处理

CUDA programming: a developer's guide to parallel computing with GPUs. by Shane Cook. Over the past five years there has been a revolution in computing brought about by a company that for successive years has emerged as one of the premier gaming hardware manufacturersdNVIDIA. With the introduction of the CUDA (Compute Unified Device Architecture) programming language, for the first time these hugely powerful graphics coprocessors could be used by everyday C programmers to offload computationally expensive work. From the embedded device industry, to home users, to supercomputers, everything has changed as a result of this. One of the major changes in the computer software industry has been the move from serial programming to parallel programming. Here, CUDA has produced great advances. The graphics processor unit (GPU) by its very nature is designed for high-speed graphics, which are inherently parallel. CUDA takes a simple model of data parallelism and incorporates it into a programming model without the need for graphics primitives. In fact, CUDA, unlike its predecessors, does not require any understanding or knowledge of graphics or graphics primitives. You do not have to be a games programmer either. The CUDA language makes the GPU look just like another programmable device. Throughout this book I will assume readers have no prior knowledge of CUDA, or of parallel programming. I assume they have only an existing knowledge of the C/C++ programming language. As we progress and you become more competent with CUDA, we’ll cover more advanced topics, taking you from a parallel unaware programmer to one who can exploit the full potential of CUDA. For programmers already familiar with parallel programming concepts and CUDA, we’ll be discussing in detail the architecture of the GPUs and how to get the most from each, including the latest Fermi and Kepler hardware. Literally anyone who can program in C or C++ can program with CUDA in a few hours given a little training. Getting from novice CUDA programmer, with a several times speedup to 10 times–plus speedup is what you should be capable of by the end of this book. The book is very much aimed at learning CUDA, but with a focus on performance, having first achieved correctness. Your level of skill and understanding of writing high-performance code, especially for GPUs, will hugely benefit from this text. This book is a practical guide to using CUDA in real applications, by real practitioners. At the same time, however, we cover the necessary theory and background so everyone, no matter what their background, can follow along and learn how to program in CUDA, making this book ideal for both professionals and those studying GPUs or parallel programming.
评论 3
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值