并行编程实战——TBB框架的应用之四Supra对CUDA的支持

一、CUDA和OPENCL

从TBB旧的版本到OneAPI的新的框架,都对并行计算最大可能的进行了支持。这其中就包括CUDA和OPENGL,这两个框架对大多数做开发的人来说,可能听到的较多,但真正用的并不多。反倒是在AI应用中,因为涉及到图像的处理,大多使用GPU来进行,所以应用比较广泛。
这里不打算对二者进行详细的说明,有兴趣的可以自行查看相关资料。大概简单说一下,二者做为异构平台并行计算的框架,CUDA是特定平台(NVIDIA)的而OPENCL是类似于一个标准存在的,理论上是可以适应各种平台的。

二、TBB中的应用

在OneAPI当然对二者也进行了支持,毕竟并行框架里不支持更多的并行框架,简直是不要不要的。都是为了性能,如果能迭加产生1+1>2的效果得有多好。这在Supra中有所体现,下面看一下相关的代码:
1、使用CUDA

#include "ImageProcessingCuda.h"

#include <thrust/transform.h>
#include <thrust/execution_policy.h>

using namespace std;

namespace supra
{
	namespace ImageProcessingCudaInternal
	{
		typedef ImageProcessingCuda::WorkType WorkType;

		// here the actual processing happens!

		template <typename InputType, typename OutputType>
		__global__ void processKernel(const InputType* inputImage, vec3s size, WorkType factor, OutputType* outputImage)
		{
			size_t x = blockDim.x*blockIdx.x + threadIdx.x;
			size_t y = blockDim.y*blockIdx.y + threadIdx.y;
			size_t z = blockDim.z*blockIdx.z + threadIdx.z;

			size_t width = size.x;
			size_t height = size.y;
			size_t depth = size.z;

			if (x < width && y < height && z < depth)
			{
				// Perform a pixel-wise operation on the image

				// Get the input pixel value and cast it to out working type.
				// As this should in general be a type with wider range / precision, this cast does not loose anything.
				WorkType inPixel = inputImage[x + y*width + z *width*height];

				// Perform operation, in this case multiplication
				WorkType value = inPixel * factor;

				// Store the output pixel value.
				// Because this is templated, we need to cast from "WorkType" to "OutputType".
				// This should happen in a sane way, that is with clamping. There is a helper for that!
				outputImage[x + y*width + z *width*height] = clampCast<OutputType>(value);
			}
		}
	}

	template <typename InputType, typename OutputType>
	shared_ptr<Container<OutputType> > ImageProcessingCuda::process(const shared_ptr<const Container<InputType>>& imageData, vec3s size, WorkType factor)
	{
		// here we prepare the buffers and call the cuda kernel

		size_t width = size.x;
		size_t height = size.y;
		size_t depth = size.z;

		// make sure the data is in cpu memory
		auto inImageData = imageData;
		if (!inImageData->isGPU() && !inImageData->isBoth())
		{
			inImageData = make_shared<Container<InputType> >(LocationGpu, *inImageData);
		}

		// prepare the output memory
		auto outImageData = make_shared<Container<OutputType> >(LocationGpu, inImageData->getStream(), width*height*depth);

		// call the kernel for the heavy-lifting
		dim3 blockSize(32, 4, 1);
		dim3 gridSize(
			static_cast<unsigned int>((size.x + blockSize.x - 1) / blockSize.x),
			static_cast<unsigned int>((size.y + blockSize.y - 1) / blockSize.y),
			static_cast<unsigned int>((size.z + blockSize.z - 1) / blockSize.z));
		ImageProcessingCudaInternal::processKernel << <gridSize, blockSize, 0, inImageData->getStream() >> > (
			inImageData->get(),
			size,
			factor,
			outImageData->get());
		// check for cuda launch errors
		cudaSafeCall(cudaPeekAtLastError());
		// You should NOT synchronize the device or the stream we are working on!!

		// return the result!
		return outImageData;
	}

...
}

主要目的就是将图像在GPU中进行处理。
2、使用OPENCL
代码如下:


template <typename InputType, typename OutputType>
shared_ptr<Container<OutputType>> ScanConverter::convert(const shared_ptr<USImage> &inImage) {
...
  auto p = make_shared<Container<OutputType>>(LocationGpu, pScanlineData->getStream(),
                                                  m_imageSize.x * m_imageSize.y * m_imageSize.z);

  if (m_is2D) {
    sycl::range<3> blockSize(1, 256, 1);
    sycl::range<3> gridSize(1, static_cast<unsigned int>((m_imageSize.y + blockSize[1] - 1) / blockSize[1]),
                            static_cast<unsigned int>((m_imageSize.x + blockSize[2] - 1) / blockSize[2]));

    static long scan_call_count = 0;

    sycl::event scan_event = pScanlineData->getStream()->submit([&](sycl::handler &c) {
      auto m_imageSize_x_2 = (uint32_t)m_imageSize.x;
      auto m_imageSize_y_3 = (uint32_t)m_imageSize.y;
...

      c.parallel_for(sycl::nd_range<3>(gridSize * blockSize, blockSize), [=](sycl::nd_item<3> item_1) {
        scanConvert2D(numScanlines, numSamples, m_imageSize_x_2, m_imageSize_y_3, m_mask_get_4,
                      m_sampleIdx_get_5, m_weightX_get_6, m_weightY_get_7, pScanlineData_get_8, p_get_9,
                      item_1);
      });
    });

...

  } else {
    sycl::range<3> blockSize(1, 256, 1);
    sycl::range<3> gridSize(static_cast<unsigned int>((m_imageSize.z + blockSize[0] - 1) / blockSize[0]),
                            static_cast<unsigned int>((m_imageSize.y + blockSize[1] - 1) / blockSize[1]),
                            static_cast<unsigned int>((m_imageSize.x + blockSize[2] - 1) / blockSize[2]));

    pScanlineData->getStream()->submit([&](sycl::handler &c) {
      auto m_imageSize_x_3 = (uint32_t)m_imageSize.x;
      auto m_imageSize_y_4 = (uint32_t)m_imageSize.y;
      auto m_imageSize_z_5 = (uint32_t)m_imageSize.z;
...

      c.parallel_for(sycl::nd_range<3>(gridSize * blockSize, blockSize), [=](sycl::nd_item<3> item_1) {
        scanConvert3D((uint32_t)scanlineLayout.x, (uint32_t)scanlineLayout.y, numSamples, m_imageSize_x_3,
                      m_imageSize_y_4, m_imageSize_z_5, m_mask_get_6, m_sampleIdx_get_7, m_weightX_get_8,
                      m_weightY_get_9, m_weightZ_get_10, pScanlineData_get_11, p_get_12, item_1);
      });
    });
  }
  return p;
}

不过好像这个OpenCL使用的是Intel自己的库(Data Parallel C++ ,DPC++, 是 oneAPI 的 SYCL 实现),大家在使用时可以仔细的看看。

三、总结

通过这些年来看国外的框架和库,有一个粗浅的经验即国外一般是互相引用互相借鉴,不大怎么重复造轮子。这个在Java上非常明显,就是一引用就是一大群库或框架。或者一个框架中有一批其它的库,而其它的库又引用其它的库,C++中也是如此。国内的的相对封闭许多,大多都是自我工作,一般都是在内部引用一些库或者一些国外非常有名的库。
这样说的意思是说国内的软件开发环境其实相对国外要差不少,仍然需要进一步的提高。当然,国内的开发环境也不是一天两天的原因更不是某个人某些人造成的,需要大家一起努力。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值