C++ AMP: Any optimization ways in memory transfer in AMP?

Augusdi

于 2013-09-21 23:32:04 发布

阅读量2.3k

点赞数

分类专栏： C++ AMP

C++ AMP 专栏收录该内容

75 篇文章

订阅专栏

本文探讨了使用AMP和OpenCL进行图像处理时的性能差异，特别是在内存传输方面。研究发现，在Intel Ivy Bridge平台上，AMP相比OpenCL在主机与GPU之间的内存传输上表现较差。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

- Question
- Recently we did some research on Image processing using GPGPU, and we tired both OpenCL and AMP. It was found that AMP performs worse than OpenCL, consuming more time with the same workload. And we found that the bottleneck is memory transferring between Host and GPU. So we wrote a test code, using AMP or OpenCL just to copy the values of one vector to another. It contains three partitions: copying data from Host to GPU, execution in GPU, and copying data back from GPU to Host, and the time spent in each partition was recorded:
  
  Platform: IvyBridge, I5-3450, with HD2500 GPU
  
  Data length: 3264*2448*sizeof(int) bytes
  
  OpenCL config: global_size = 3264*2448, local_size = 1, implement the copies via clEnqueueWriteBuffer and clEnqueueReadBuffer.
  
  AMP config: use the array<int, 1>a(3264*2448) with no tile, implement the copies via the function copy().
  
  Copy to GPU
  Execution in GPU
  Copy back to CPU
  OpenCL
  7.45 ms
  100.57 ms
  7.68 ms
  AMP
  98.64 ms
  39.4 ms
  89.11 ms
  
  It can be found that, AMP consumes much more time in memory transfer between Host and GPU, but less time in GPU execution. We guess that maybe AMP copies data to a deeper memory buffer than OpenCL, so in GPU execution, fetching the data from the GPU buffer in OCL will spend much more time.
  
  However, increasing the local_size in OpenCL can help shorten the GPU Execution time:
  
  Local_size
  1
  4
  8
  16
  32
  64
  GPU Execution in OpenCL
  100.57ms
  35.34ms
  18.13ms
  10.37ms
  10.15ms
  10.21ms
  
  And time spent in the memory copy almost remains the same, since they have nothing to do with local_size.
  
  But in AMP, the bottleneck is the memory transfer, not the GPU execution, and changing the tile size doesn’t help (in fact, it increased the consumed time of GPU execution in our test). So after changing the local_size in OpenCL, it outperforms a lot than AMP.
  
  Anyone come across similar issues? And what’s your solutions? I need your advice.
  
  Thanks a lot.
- Your code looks fine. I executed your code and got expected results on a variety of discrete GPU cards (both from AMD and NVidia). The behavior you are observing seems to be specific to the IvyBridge GPU driver you are using.
  
  Can you share the version of the driver you are using?
  
  -Amit
  
  Amit K Agarwal
- The IvyBridge integrated GPU shares the same physical memory as the CPU. Hence copying from host memory to device memory ideally only requires a simple "memcpy" from the source host memory to the device buffer (which is accessible from the CPU). The OpenCL driver for IvyBridge takes advantage of this ability.
  
  In the VS2012 release, C++ AMP is unable to utilize this ability (due to the DirectX restriction of device buffers being inaccessible on the CPU). This is something we are actively working on addressing for the future versions of C++ AMP. The current implementation needs to first copy data from the host buffer to a "staging buffer" followed by a copy from the staging buffer to the device buffer. This difference only makes C++ AMP copy on IvyBridge GPUs to be ~50% slower compared to OpenCL (our test benchmarks validate this expectation). C++ AMP offers staging array to improve the copy performance. If a staging array is used as the CPU data container, copies to/from an accelerator array would be much faster.
  
  However, the results from your experiment indicate C++ AMP copy to be ~13 times slower compared to OpenCL, which cannot be explained by the aforementioned. We would have to take a closer look at your code to understand the cause of this discrepancy.
  
  A couple of our earlier blogs posts provide guidance regarding measuring the performance of C++ AMP programs. I would encourage you to take a look at these.
  
  How to measure C++ AMP performance?
  
  Data warmup when measuring C++ AMP performance.
  
  -Amit
- Hi Amit, thanks for your excellent detailed information. I have tried the staging array and warmup but the results are still not satisfying. This is my code:
  int vec_copy_amp_stage(int *a, int *b, size_t global_size) { LARGE_INTEGER frequency; LARGE_INTEGER startTimeQPC; LARGE_INTEGER endTimeQPC; LARGE_INTEGER startTimeQPC_total; int length = global_size; accelerator default_device; accelerator cpuAcc = accelerator(accelerator::cpu_accelerator); array<int, 1> a_stage(length, cpuAcc.default_view, default_device.default_view); copy(a, a_stage); // staging array here, but this copy is not timed array<int, 1> a_arr(length); array<int, 1> b_arr(length); QueryPerformanceFrequency(&frequency); for (int i = 0; i < 10; i++) { QueryPerformanceCounter(&startTimeQPC); copy(a_stage, a_arr); a_arr.accelerator_view.wait(); // timing the copy from staging array to array QueryPerformanceCounter(&endTimeQPC); double copy_duration_a = (double)((endTimeQPC.QuadPart - startTimeQPC.QuadPart) * 1000.0 ) / (double) frequency.QuadPart; QueryPerformanceCounter(&startTimeQPC); parallel_for_each( b_arr.extent, [&](index<1> idx) restrict(amp) { b_arr[idx] = a_arr[idx]; } ); b_arr.accelerator_view.wait(); QueryPerformanceCounter(&endTimeQPC); double kernel_duration = (double)((endTimeQPC.QuadPart - startTimeQPC.QuadPart) * 1000.0 ) / (double) frequency.QuadPart; QueryPerformanceCounter(&startTimeQPC); copy(b_arr, b); QueryPerformanceCounter(&endTimeQPC); double copy_duration_b = (double)((endTimeQPC.QuadPart - startTimeQPC.QuadPart) * 1000.0 ) / (double) frequency.QuadPart; printf("a: %f, b: %f, kenerl time: %f ms\n", copy_duration_a, copy_duration_b, kernel_duration); } return 0; }
  
  In the above code, a staging array a_stage is defined and then contents pointed by the pointer a is copied to this staging array. This copy is not timed. And then in a 10 iterations, the copying from the staging array to the array is timed. In the 1st iteration, the arrays and the kernel should have been warmed up, so the 9 iterations after are in a warmed up environment, and the results of these 9 iterations are:
  
  Copy to GPU using staging arrays Execution in GPU Copy back to CPU
  
  AMP 87.36 ms 31.96 ms 89.07 ms
  
  The time consumed by copying via staging array is still too long compared with OpenCL.
  
  Is there any problems in my code? I need your suggestions.
  
  Thanks very much.
- results:
  
  Copy to GPU using staging arrays 87.36 ms Execution in GPU 31.96 ms Copy back to CPU 89.07 ms
  
  Wednesday, October 24, 2012 9:33 AM
- Your code looks fine. I executed your code and got expected results on a variety of discrete GPU cards (both from AMD and NVidia). The behavior you are observing seems to be specific to the IvyBridge GPU driver you are using.
  
  Can you share the version of the driver you are using?
  
  -Amit
- You're right! After the graphics driver was changed to Microsoft basic display adapter in Win 8, I got the expected results:
  
  Copy to GPU using staging arrays 6.55 ms Execution in GPU 10.02 ms Copy back to CPU 10.01 ms
  
  Thank you very much for your kind and patient help. Thanks.

http://social.msdn.microsoft.com/Forums/vstudio/en-US/d88a6eb9-44d9-4f02-b511-0b2a25e2372b/any-optimization-ways-in-memory-transfer-in-amp

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。