CUDA使用的一些Tips

1.查询GPU的配置参数

直接运行下面代码获得各项参数:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);

    int dev;
    for (dev = 0; dev < deviceCount; dev++)
    {
        int driver_version(0), runtime_version(0);
        cudaDeviceProp deviceProp;
        cudaGetDeviceProperties(&deviceProp, dev);
        if (dev == 0)
            if (deviceProp.minor = 9999 && deviceProp.major == 9999)
                printf("\n");
        printf("\nDevice%d:\"%s\"\n", dev, deviceProp.name);
        cudaDriverGetVersion(&driver_version);
        printf("CUDA驱动版本:                                   %d.%d\n", driver_version / 1000, (driver_version % 1000) / 10);
        cudaRuntimeGetVersion(&runtime_version);
        printf("CUDA运行时版本:                                 %d.%d\n", runtime_version / 1000, (runtime_version % 1000) / 10);
        printf("设备计算能力:                                   %d.%d\n", deviceProp.major, deviceProp.minor);
        printf("Total amount of Global Memory:                  %u bytes\n", deviceProp.totalGlobalMem);
        printf("Number of SMs:                                  %d\n", deviceProp.multiProcessorCount);
        printf("Total amount of Constant Memory:                %u bytes\n", deviceProp.totalConstMem);
        printf("Total amount of Shared Memory per block:        %u bytes\n", (int)deviceProp.sharedMemPerBlock);
        printf("Total number of registers available per block:  %d\n", deviceProp.regsPerBlock);
        printf("Warp size:                                      %d\n", deviceProp.warpSize);
        printf("Maximum number of threads per SM:               %d\n", deviceProp.maxThreadsPerMultiProcessor);
        printf("Maximum number of threads per block:            %d\n", deviceProp.maxThreadsPerBlock);
        printf("Maximum size of each dimension of a block:      %d x %d x %d\n", deviceProp.maxThreadsDim[0],
            deviceProp.maxThreadsDim[1],
            deviceProp.maxThreadsDim[2]);
        printf("Maximum size of each dimension of a grid:       %d x %d x %d\n", deviceProp.maxGridSize[0], deviceProp.maxGridSize[1], deviceProp.maxGridSize[2]);
        printf("Maximum memory pitch:                           %u bytes\n", deviceProp.memPitch);
        printf("Texture alignmemt:                              %u bytes\n", deviceProp.texturePitchAlignment);
        printf("Clock rate:                                     %.2f GHz\n", deviceProp.clockRate * 1e-6f);
        printf("Memory Clock rate:                              %.0f MHz\n", deviceProp.memoryClockRate * 1e-3f);
        printf("Memory Bus Width:                               %d-bit\n", deviceProp.memoryBusWidth);
        
    }

    return 0;
}

2.性能提升的Tips

1.取消Debug information

进入到项目的属性选项中,选择CUDA C/C++中Device选项,反选Generate GPU Debug information
在这里插入图片描述

2.修改计算能力

首先查看自己设备所支持的计算能力(我的设备是6.0),接着在上图的属性页更改code generation为compute_60, sm_60,数字与设备所支持的最高计算能力相匹配,这样我们就可以使用一些更高性能的指令。

对SM_60的补充说明:
版本越高说明流处理器同时处理的流越多,如下图所示。
SM可以同时拥有多个处理流
在这里插入图片描述

注意:我尝试过设置为6.0以上的其他版本,例如:“compute_80, sm_80”,虽然代码能运行,但是GPU的计算结果会出问题,会有计算错误。所以还是根据自己的设备来设置。

3.计算GPU的运行时间

通常的CPU获取时间的函数不能应用在GPU上,比如下面的例子是不行的:

DWORD s1 = GetTickCount();
vecAdd <<<(N + 255) / 256, 256 >>> (N, devA, devB, devC);
DWORD e1 = GetTickCount();
cputime = (double)(e1 - s1) * 1.00 ;
printf("time for GPU is %f ms \n", cputime);

第二行代码只是CPU去通知GPU开始进行运算,通过在函数两端获取时间只会得到一个几乎是一瞬间的值,无论GPU将要做多大的运算。
正确的方法是利用CUDA的API函数,创建时间戳并记录时间,计算时间差,完整代码如下:

	cudaEvent_t time1, time2;
	cudaEventCreate(&time1);	cudaEventCreate(&time2);//创建事件
	
	cudaEventRecord(time1, 0);//记录时间戳
	vecAdd << <(N + 255) / 256, 256 >> > (N, devA, devB, devC);
	cudaEventRecord(time2, 0);
	
	cudaEventSynchronize(time1);	cudaEventSynchronize(time2);//等待时间戳前的线程同步
	
	cudaEventElapsedTime(&kernalExecutionTime, time1, time2);//计算时间差
	printf("time for GPU is %f ms \n", kernalExecutionTime);//输出
	
	cudaEventDestroy(time1);	cudaEventDestroy(time2);//销毁事件

4.在核函数中调用核函数

在项目属性中, 设定CUDA C/C++的Generate Relocatable Device Code为True(-rdc=true).

### PyCUDA Compatibility and Installation Guide for CUDA 11.7 For ensuring compatibility between PyCUDA and CUDA 11.7, several factors need consideration including the Python version, operating system specifics, and existing NVIDIA drivers. #### Verifying System Compatibility Before installing any components, verify that the current setup supports CUDA 11.7 by checking the installed NVIDIA driver through `nvidia-smi` command[^1]. The output should indicate whether the hardware can support this particular CUDA version without issues related to driver mismatch which could lead to computational errors or crashes[^4]. #### Installing Required Components To proceed with using CUDA 11.7 alongside PyCUDA: - **Install CUDA Toolkit**: Download and install CUDA 11.7 from the official [CUDA Toolkit Archive](https://developer.nvidia.com/cuda-toolkit-archive). Follow the instructions provided on the site carefully. - **Set Up Environment Variables**: After installation, ensure environment variables such as PATH and LD_LIBRARY_PATH (on Linux systems) point correctly towards the newly installed CUDA directories so applications like PyCUDA recognize them properly during runtime. ```bash export PATH=/usr/local/cuda-11.7/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} ``` - **Verify Installation**: Use commands like `nvcc --version` to confirm successful installation of CUDA toolkit at desired version level. #### Installing PyCUDA Once the appropriate CUDA environment is set up: - Install PyCUDA via pip after confirming its availability for CUDA 11.7: ```python pip install pycuda==2023.1 ``` Note: Always check the latest documentation regarding supported versions since software libraries frequently update their dependencies over time. #### Testing Setup After completing these steps, test the configuration thoroughly before moving forward into more complex projects involving GPU computations. Running simple tests included within PyCUDA examples helps validate proper functioning under chosen configurations. --related questions-- 1. What are common troubleshooting tips when encountering problems while setting up CUDA? 2. How does one determine the correct cuDNN version needed based on an already installed CUDA version? 3. Can TensorRT be used effectively with different combinations of CUDA and cuDNN beyond officially tested setups? 4. Are there alternative methods besides using pip to install specialized builds of PyCUDA tailored specifically toward certain CUDA releases?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值