理解CUDA

最新推荐文章于 2024-03-06 17:00:00 发布

BreakTheRules

最新推荐文章于 2024-03-06 17:00:00 发布

阅读量590

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/buptjz/article/details/11787663

（一）理解GPU

为了提升运算能力，大家更喜欢用 “更多的、简单的计算单元”
CPU解决的问题是Latency，每个任务最短能在多长时间内完成
GPU解决的是ThroughPut，每个单位时间能解决多少任务
GPU擅长高效的并发
并行的执行大量的线程

（二）CUDA计算模型

（三）典型的GPU程序

CPU为GPU分配内存空间 CUDA MALLOC
CPU拷贝输入数据 CPU->GPU CUDA memcpy
CPU launches kernel on GPU来计算数据 kernel Launch
CPU发送拷贝请求，GPU->CPU CUDA memcpy

上述的2、4步骤都是转移数据，这两步骤都是非常的昂贵的。

（四）BIG IDEA
Kernels look serial programs.Write your program as if it will run on one thread.The GPU will run that program on many threads

（五）最简单的CUDA程序

这是一个计算一组数字的平方的CUDA程序__global__修饰符表示这是一个CUDA函数，也叫kernel function。

整个程序简单易懂，注释清晰，不做过多的解释了

#include <stdio.h>

__global__ void square(float * d_out, float * d_in){
    int idx = threadIdx.x;
    float f = d_in[idx];
    d_out[idx] = f *f;
}

int main(int argc, char ** argv) {
    const int ARRAY_SIZE = 64;
    const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);
    
    // generate the input array on the host
    float h_in[ARRAY_SIZE];
    for (int i = 0; i < ARRAY_SIZE; i++) {
        h_in[i] = float(i);
    }
    float h_out[ARRAY_SIZE];
    
    // declare GPU memory pointers
    float * d_in;
    float * d_out;
    
    // allocate GPU memory
    cudaMalloc((void**) &d_in, ARRAY_BYTES);
    cudaMalloc((void**) &d_out, ARRAY_BYTES);
    
    // transfer the array to the GPU
    cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
    
    // launch the kernel
    square<<<1, ARRAY_SIZE>>>(d_out, d_in);
    
    // copy back the result array to the CPU
    cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
    
    // print out the resulting array
    for (int i =0; i < ARRAY_SIZE; i++) {
        printf("%f", h_out[i]);
        printf(((i % 4) != 3) ? "\t" : "\n");
    }
    cudaFree(d_in);
    cudaFree(d_out);
    
    return 0;
}