前面三节已经对CUDA做了一个简单的介绍,这一节开始真正进入编程环节。
首先,初学者应该对自己使用的设备有较为扎实的理解和掌握,这样对后面学习并行程序优化很有帮助,了解硬件详细参数可以通过上节介绍的几本书和官方资料获得,但如果仍然觉得不够直观,那么我们可以自己动手获得这些内容。
以第二节例程为模板,我们稍加改动的部分代码如下:
- // Add vectors in parallel.
- cudaError_t cudaStatus;
- int num = 0;
- cudaDeviceProp prop;
- cudaStatus = cudaGetDeviceCount(&num);
- for(int i = 0;i<num;i++)
- {
- cudaGetDeviceProperties(&prop,i);
- }
- cudaStatus = addWithCuda(c, a, b, arraySize);
这个改动的目的是让我们的程序自动通过调用cuda API函数获得设备数目和属性,所谓“知己知彼,百战不殆”。
cudaError_t 是cuda错误类型,取值为整数。
cudaDeviceProp为设备属性结构体,其定义可以从cuda Toolkit安装目录中找到,我的路径为:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include\driver_types.h,找到定义为:
- /**
- * CUDA device properties
- */
- struct __device_builtin__ cudaDeviceProp
- {
- char name[256]; /**< ASCII string identifying device */
- size_t totalGlobalMem; /**< Global memory available on device in bytes */
- size_t sharedMemPerBlock; /**< Shared memory available per block in bytes */
- int regsPerBlock; /**< 32-bit registers available per block */
- int warpSize; /**< Warp size in threads */
- size_t memPitch; /**< Maximum pitch in bytes allowed by memory copies */
- int maxThreadsPerBlock; /**< Maximum number of threads per block */
- int maxThreadsDim[3]; /**< Maximum size of each dimension of a block */
- int maxGridSize[3]; /**< Maximum size of each dimension of a grid */
- int clockRate; /**< Clock frequency in kilohertz */
- size_t totalConstMem; /**< Constant memory available on device in bytes */
- int major; /**< Major compute capability */
- int minor; /**< Minor compute capability */
- size_t textureAlignment; /**< Alignment requirement for textures */
- size_t texturePitchAlignment; /**< Pitch alignment requirement for texture references bound to pitched memory */
- int deviceOverlap; /**< Device can concurrently copy memory and execute a kernel. Deprecated. Use instead asyncEngineCount. */
- int multiProcessorCount; /**< Number of multiprocessors on device */
- int kernelExecTimeoutEnabled; /**< Specified whether there is a run time limit on kernels */
- int integrated; /**< Device is integrated as opposed to discrete */
- int canMapHostMemory; /**< Device can map host memory with cudaHostAlloc/cudaHostGetDevicePointer */
- int computeMode; /**< Compute mode (See ::cudaComputeMode) */
- int maxTexture1D; /**< Maximum 1D texture size */
- int maxTexture1DMipmap; /**< Maximum 1D mipmapped texture size */
- int maxTexture1DLinear; /**< Maximum size for 1D textures bound to linear memory */
- int maxTexture2D[2]; /**< Maximum 2D texture dimensions */
- int maxTexture2DMipmap[2]; /**< Maximum 2D mipmapped texture dimensions */
- int maxTexture2DLinear[3]; /**< Maximum dimensions (width, height, pitch) for 2D textures bound to pitched memory */
- int maxTexture2DGather[2]; /**< Maximum 2D texture dimensions if texture gather operations have to be performed */
- int maxTexture3D[3]; /**< Maximum 3D texture dimensions */
- int maxTextureCubemap; /**< Maximum Cubemap texture dimensions */
- int maxTexture1DLayered[2]; /**< Maximum 1D layered texture dimensions */
- int maxTexture2DLayered[3]; /**< Maximum 2D layered texture dimensions */
- int maxTextureCubemapLayered[2];/**< Maximum Cubemap layered texture dimensions */
- int maxSurface1D; /**< Maximum 1D surface size */
- int maxSurface2D[2]; /**< Maximum 2D surface dimensions */
- int maxSurface3D[3]; /**< Maximum 3D surface dimensions */
- int maxSurface1DLayered[2]; /**< Maximum 1D layered surface dimensions */
- int maxSurface2DLayered[3]; /**< Maximum 2D layered surface dimensions */
- int maxSurfaceCubemap; /**< Maximum Cubemap surface dimensions */
- int maxSurfaceCubemapLayered[2];/**< Maximum Cubemap layered surface dimensions */
- size_t surfaceAlignment; /**< Alignment requirements for surfaces */
- int concurrentKernels; /**< Device can possibly execute multiple kernels concurrently */
- int ECCEnabled; /**< Device has ECC support enabled */
- int pciBusID; /**< PCI bus ID of the device */
- int pciDeviceID; /**< PCI device ID of the device */
- int pciDomainID; /**< PCI domain ID of the device */
- int tccDriver; /**< 1 if device is a Tesla device using TCC driver, 0 otherwise */
- int asyncEngineCount; /**< Number of asynchronous engines */
- int unifiedAddressing; /**< Device shares a unified address space with the host */
- int memoryClockRate; /**< Peak memory clock frequency in kilohertz */
- int memoryBusWidth; /**< Global memory bus width in bits */
- int l2CacheSize; /**< Size of L2 cache in bytes */
- int maxThreadsPerMultiProcessor;/**< Maximum resident threads per multiprocessor */
- };
后面的注释已经说明了其字段代表意义,可能有些术语对于初学者理解起来还是有一定困难,没关系,我们现在只需要关注以下几个指标:
name:就是设备名称;
totalGlobalMem:就是显存大小;
major,minor:CUDA设备版本号,有1.1, 1.2, 1.3, 2.0, 2.1等多个版本;
clockRate:GPU时钟频率;
multiProcessorCount:GPU大核数,一个大核(专业点称为流多处理器,SM,Stream-Multiprocessor)包含多个小核(流处理器,SP,Stream-Processor)
编译,运行,我们在VS2008工程的cudaGetDeviceProperties()函数处放一个断点,单步执行这一函数,然后用Watch窗口,切换到Auto页,展开+,在我的笔记本上得到如下结果:
可以看到,设备名为GeForce 610M,显存1GB,设备版本2.1(比较高端了,哈哈),时钟频率为950MHz(注意950000单位为kHz),大核数为1。在一些高性能GPU上(如Tesla,Kepler系列),大核数可能达到几十甚至上百,可以做更大规模的并行处理。
PS:今天看SDK代码时发现在help_cuda.h中有个函数实现从CUDA设备版本查询相应大核中小核的数目,觉得很有用,以后编程序可以借鉴,摘抄如下:
- // Beginning of GPU Architecture definitions
- inline int _ConvertSMVer2Cores(int major, int minor)
- {
- // Defines for GPU Architecture types (using the SM version to determine the # of cores per SM
- typedef struct
- {
- int SM; // 0xMm (hexidecimal notation), M = SM Major version, and m = SM minor version
- int Cores;
- } sSMtoCores;
- sSMtoCores nGpuArchCoresPerSM[] =
- {
- { 0x10, 8 }, // Tesla Generation (SM 1.0) G80 class
- { 0x11, 8 }, // Tesla Generation (SM 1.1) G8x class
- { 0x12, 8 }, // Tesla Generation (SM 1.2) G9x class
- { 0x13, 8 }, // Tesla Generation (SM 1.3) GT200 class
- { 0x20, 32 }, // Fermi Generation (SM 2.0) GF100 class
- { 0x21, 48 }, // Fermi Generation (SM 2.1) GF10x class
- { 0x30, 192}, // Kepler Generation (SM 3.0) GK10x class
- { 0x35, 192}, // Kepler Generation (SM 3.5) GK11x class
- { -1, -1 }
- };
- int index = 0;
- while (nGpuArchCoresPerSM[index].SM != -1)
- {
- if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor))
- {
- return nGpuArchCoresPerSM[index].Cores;
- }
- index++;
- }
- // If we don't find the values, we default use the previous one to run properly
- printf("MapSMtoCores for SM %d.%d is undefined. Default to use %d Cores/SM\n", major, minor, nGpuArchCoresPerSM[7].Cores);
- return nGpuArchCoresPerSM[7].Cores;
- }
- // end of GPU Architecture definitions
可见,设备版本2.1的一个大核有48个小核,而版本3.0以上的一个大核有192个小核!
前文说到过,当我们用的电脑上有多个显卡支持CUDA时,怎么来区分在哪个上运行呢?这里我们看一下addWithCuda这个函数是怎么做的。
- cudaError_t cudaStatus;
- // Choose which GPU to run on, change this on a multi-GPU system.
- cudaStatus = cudaSetDevice(0);
- if (cudaStatus != cudaSuccess) {
- fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
- goto Error;
- }
使用了cudaSetDevice(0)这个操作,0表示能搜索到的第一个设备号,如果有多个设备,则编号为0,1,2...。
再看我们本节添加的代码,有个函数cudaGetDeviceCount(&num),这个函数用来获取设备总数,这样我们选择运行CUDA程序的设备号取值就是0,1,...num-1,于是可以一个个枚举设备,利用cudaGetDeviceProperties(&prop)获得其属性,然后利用一定排序、筛选算法,找到最符合我们应用的那个设备号opt,然后调用cudaSetDevice(opt)即可选择该设备。选择标准可以从处理能力、版本控制、名称等各个角度出发。后面讲述流并发过程时,还要用到这些API。
如果希望了解更多硬件内容可以结合http://www.geforce.cn/hardware获取。