见多识广8：初识CUDA-优快云博客

前言

今天我开始学习CUDA，首先我看一个NVIDIA的官方入门技术文档，链接为：https://developer.nvidia.com/blog/even-easier-introduction-cuda/

CUDA简介

CUDA C++是一个可以用CUDA创建并行应用的方式。它允许我们使用C++语言去开发高性能的算法，同时在GPU上运行数千个线程。

Starting Simple

初始运行

这里有两个数组x和y，每个数组都有1M个数，现在要将x和y做element-wise的加法，把结果赋给y。代码如下：

#include <iostream>
#include <math.h>
 
// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
 for (int i = 0; i < n; i++)
     y[i] = x[i] + y[i];
}
 
int main(void)
{
    int N = 1<<20; // 1M elements
    
    float *x = new float[N];
    float *y = new float[N];
    
    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
      x[i] = 1.0f;
      y[i] = 2.0f;
    }
    
    // Run kernel on 1M elements on the CPU
    add(N, x, y);
    
    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
      maxError = fmax(maxError, fabs(y[i]-3.0f));
    std::cout << "Max error: " << maxError << std::endl;
    
    // Free memory
    delete [] x;
    delete [] y;
    
    return 0;
}

我这个文件的命名是add.cpp，编译它得到可执行文件add：

g++ add.cpp -o add

执行add：

./add

得到输出：

Max error: 0

创建CUDA kernel

接下来我们要考虑如何并行运行上面那1M次的加法。
把add函数改写成一个CUDA可以运行的kernel函数。这里我们要给add函数加一个特殊符号__global__，这告诉CUDA C++编译器，这是一个可以在GPU上运行的函数：

// Kernel function to add the elements of two arrays
__global__
void add(int n, float *sum, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    sum[i] = x[i] + y[i];
}

__global__函数就是CUDA kernel，在GPU上运行。（补充知识：在GPU上运行的代码叫device code，在CPU上的叫host code）

CUDA中的memory分配

使用cudaMallocManaged()函数去分配统一的memory。如果要释放内存，就使用cudaFree()函数。代码如下：

// Allocate Unified Memory -- accessible from CPU or GPU
float *x, *y, *sum;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
 
...
 
// Free memory
cudaFree(x);
cudaFree(y);

发射add() kernel

用 <<< >>>修饰add函数，就代表它在GPU上面执行。

同步

让CPU在访问计算结果之前，等待GPU执行完毕。调用的函数是cudaDeviceSynchronize()。

完整代码

这个文件的命名为add.cu，代码如下：

#include <iostream>
#include <math.h>
 
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
 for (int i = 0; i < n; i++)
   y[i] = x[i] + y[i];
}
 
int main(void)
{
 int N = 1<<20;
 float *x, *y;
 
 // Allocate Unified Memory – accessible from CPU or GPU
 cudaMallocManaged(&x, N*sizeof(float));
 cudaMallocManaged(&y, N*sizeof(float));
 
 // initialize x and y arrays on the host
 for (int i = 0; i < N; i++) {
   x[i] = 1.0f;
   y[i] = 2.0f;
 }
 
 // Run kernel on 1M elements on the GPU
 add<<<1, 1>>>(N, x, y);
 
 // Wait for GPU to finish before accessing on host
 cudaDeviceSynchronize();
 
 // Check for errors (all values should be 3.0f)
 float maxError = 0.0f;
 for (int i = 0; i < N; i++) {
   maxError = fmax(maxError, fabs(y[i]-3.0f));
 }
 std::cout << "Max error: " << maxError << std::endl;
 
 // Free memory
 cudaFree(x);
 cudaFree(y);
  return 0;
}

编译（使用nvcc）：

> nvcc add.cu -o add_cuda
> ./add_cuda
Max error: 0.000000

这就代表程序在GPU上运行完成了！