一、CUDA和OPENCL
从TBB旧的版本到OneAPI的新的框架,都对并行计算最大可能的进行了支持。这其中就包括CUDA和OPENGL,这两个框架对大多数做开发的人来说,可能听到的较多,但真正用的并不多。反倒是在AI应用中,因为涉及到图像的处理,大多使用GPU来进行,所以应用比较广泛。
这里不打算对二者进行详细的说明,有兴趣的可以自行查看相关资料。大概简单说一下,二者做为异构平台并行计算的框架,CUDA是特定平台(NVIDIA)的而OPENCL是类似于一个标准存在的,理论上是可以适应各种平台的。
二、TBB中的应用
在OneAPI当然对二者也进行了支持,毕竟并行框架里不支持更多的并行框架,简直是不要不要的。都是为了性能,如果能迭加产生1+1>2的效果得有多好。这在Supra中有所体现,下面看一下相关的代码:
1、使用CUDA
#include "ImageProcessingCuda.h"
#include <thrust/transform.h>
#include <thrust/execution_policy.h>
using namespace std;
namespace supra
{
namespace ImageProcessingCudaInternal
{
typedef ImageProcessingCuda::WorkType WorkType;
// here the actual processing happens!
template <typename InputType, typename OutputType>
__global__ void processKernel(const InputType* inputImage, vec3s size, WorkType factor, OutputType* outputImage)
{
size_t x = blockDim.x*blockIdx.x + threadIdx.x;
size_t y = blockDim.y*blockIdx.y + threadIdx.y;
size_t z = blockDim.z*blockIdx.z + threadIdx.z;
size_t width = size.x;
size_t height = size.y;
size_t depth = size.z;
if (x < width && y < height && z < depth)
{
// Perform a pixel-wise operation on the image
// Get the input pixel value and cast it to out working type.
// As this should in general be a type with wider range / precision, this cast does not loose anything.
WorkType inPixel = inputImage[x + y*width + z *width*height];
// Perform operation, in this case multiplication
WorkType value = inPixel * factor;
// Store the output pixel value.
// Because this is templated, we need to cast from "WorkType" to "OutputType".
// This should happen in a sane way, that is with clamping. There is a helper for that!
outputImage[x + y*width + z *width*height] = clampCast<OutputType>(value);
}
}
}
template