Question
-
Recently we did some research on Image processing using GPGPU, and we tired both OpenCL and AMP. It was found that AMP performs worse than OpenCL, consuming more time with the same workload. And we found that the bottleneck is memory transferring between Host and GPU. So we wrote a test code, using AMP or OpenCL just to copy the values of one vector to another. It contains three partitions: copying data from Host to GPU, execution in GPU, and copying data back from GPU to Host, and the time spent in each partition was recorded:
Platform: IvyBridge, I5-3450, with HD2500 GPU
Data length: 3264*2448*sizeof(int) bytes
OpenCL config: global_size = 3264*2448, local_size = 1, implement the copies via clEnqueueWriteBuffer and clEnqueueReadBuffer.
AMP config: use the array<int, 1>a(3264*2448) with no tile, implement the copies via the function copy().
Copy to GPU
Execution in GPU
Copy back to CPU
OpenCL
7.45 ms
100.57 ms
7.68 ms
AMP
98.64 ms
39.4 ms
89.11 ms
It can be found that, AMP consumes much more time in memory transfer between Host and GPU, but less time in GPU execution. We guess that maybe AMP copies data to a deeper memory buffer than OpenCL, so in GPU execution, fetching the data from the GPU buffer in OCL will spend much more time.
However, increasing the local_size in OpenCL can help shorten the GPU Execution time:
Local_size
1
4
8
16
32
64
GPU Execution in OpenCL
100.57ms
35.34ms
18.13ms
10.37ms
10.15ms
10.21ms
And time spent in the memory copy almost remains the same, since they have nothing to do with local_size.
But in AMP, the bottleneck is the memory transfer, not the GPU execution, and changing the tile size doesn’t help (in fact, it increased the consumed time of GPU execution in our test). So after changing the local_size in OpenCL, it outperforms a lot than AMP.
Anyone come across similar issues? And what’s your solutions? I need your advice.
Thanks a lot.
-
Your code looks fine. I executed your code and got expected results on a variety of discrete GPU cards (both from AMD and NVidia). The behavior you are observing seems to be specific to the IvyBridge GPU driver you are using.
Can you share the version of the driver you are using?
-Amit
Amit K Agarwal
-
The IvyBridge integrated GPU shares the same physical memory as the CPU. Hence copying from host memory to device memory ideally only requires a simple "memcpy" from the source host memory to the device buffer (which is accessible from the CPU). The OpenCL driver for IvyBridge takes advantage of this ability.In the VS2012 release, C++ AMP is unable to utilize this ability (due to the DirectX restriction of device buffers being inaccessible on the CPU). This is something we are actively working on addressing for the future versions of C++ AMP. The current implementation needs to first copy data from the host buffer to a "staging buffer" followed by a copy from the staging buffer to the device buffer. This difference only makes C++ AMP copy on IvyBridge GPUs to be ~50% slower compared to OpenCL (our test benchmarks validate this expectation). C++ AMP offers staging array to improve the copy performance. If a staging array is used as the CPU data container, copies to/from an accelerator array would be much faster.However, the results from your experiment indicate C++ AMP copy to be ~13 times slower compared to OpenCL, which cannot be explained by the aforementioned. We would have to take a closer look at your code to understand the cause of this discrepancy.A couple of our earlier blogs posts provide guidance regarding measuring the performance of C++ AMP programs. I would encourage you to take a look at these.
-
Hi Amit, thanks for your excellent detailed information. I have tried the staging array and warmup but the results are still not satisfying. This is my code:
int vec_copy_amp_stage(int *a, int *b, size_t global_size) { LARGE_INTEGER frequency; LARGE_INTEGER startTimeQPC; LARGE_INTEGER endTimeQPC; LARGE_INTEGER startTimeQPC_total; int length = global_size; accelerator default_device; accelerator cpuAcc = accelerator(accelerator::cpu_accelerator); array<int, 1> a_stage(length, cpuAcc.default_view, default_device.default_view); copy(a, a_stage); // staging array here, but this copy is not timed array<int, 1> a_arr(length); array<int, 1> b_arr(length); QueryPerformanceFrequency(&frequency); for (int i = 0; i < 10; i++) { QueryPerformanceCounter(&startTimeQPC); copy(a_stage, a_arr); a_arr.accelerator_view.wait(); // timing the copy from staging array to array QueryPerformanceCounter(&endTimeQPC); double copy_duration_a = (double)((endTimeQPC.QuadPart - startTimeQPC.QuadPart) * 1000.0 ) / (double) frequency.QuadPart; QueryPerformanceCounter(&startTimeQPC); parallel_for_each( b_arr.extent, [&](index<1> idx) restrict(amp) { b_arr[idx] = a_arr[idx]; } ); b_arr.accelerator_view.wait(); QueryPerformanceCounter(&endTimeQPC); double kernel_duration = (double)((endTimeQPC.QuadPart - startTimeQPC.QuadPart) * 1000.0 ) / (double) frequency.QuadPart; QueryPerformanceCounter(&startTimeQPC); copy(b_arr, b); QueryPerformanceCounter(&endTimeQPC); double copy_duration_b = (double)((endTimeQPC.QuadPart - startTimeQPC.QuadPart) * 1000.0 ) / (double) frequency.QuadPart; printf("a: %f, b: %f, kenerl time: %f ms\n", copy_duration_a, copy_duration_b, kernel_duration); } return 0; }
In the above code, a staging array a_stage is defined and then contents pointed by the pointer a is copied to this staging array. This copy is not timed. And then in a 10 iterations, the copying from the staging array to the array is timed. In the 1st iteration, the arrays and the kernel should have been warmed up, so the 9 iterations after are in a warmed up environment, and the results of these 9 iterations are:
Copy to GPU using staging arrays Execution in GPU Copy back to CPU
AMP 87.36 ms 31.96 ms 89.07 ms
The time consumed by copying via staging array is still too long compared with OpenCL.
Is there any problems in my code? I need your suggestions.
Thanks very much.
-
results:
Copy to GPU using staging arrays 87.36 ms Execution in GPU 31.96 ms Copy back to CPU 89.07 ms
Wednesday, October 24, 2012 9:33 AM -
Your code looks fine. I executed your code and got expected results on a variety of discrete GPU cards (both from AMD and NVidia). The behavior you are observing seems to be specific to the IvyBridge GPU driver you are using.
Can you share the version of the driver you are using?
-Amit
-
You're right! After the graphics driver was changed to Microsoft basic display adapter in Win 8, I got the expected results:
Copy to GPU using staging arrays 6.55 ms Execution in GPU 10.02 ms Copy back to CPU 10.01 ms
Thank you very much for your kind and patient help. Thanks.