C++ AMP分组优化原理,以矩阵乘法为例
这是我们普通的AMP矩阵乘法:
#include<amp.h>
Concurrency::array_view<const float, 2> a(M, W, vA);
Concurrency::array_view<const float, 2> b(W, N, vB);
Concurrency::array_view<float, 2> c(M, N, vC);
c.discard_data();
Concurrency::parallel_for_each(
c.extent,
[=](Concurrency::index<2> idx) restrict(amp) {
int row = idx[0];
int col = idx[1];
float sum = 0.0f;
for (int inner = 0; inner < W; ++inner) {
sum += a(row, inner) * b(inner, col);
}
c[idx] = sum;
});
c.synchronize();
代码的解读在上一篇文内,这里不赘述:
并行矩阵乘法的实现与对比,CPU单线程串行/CPU双线程并行/CPU,PPL并行库/GPU,C++AMP并发/GPU,C++AMP分组优化
不合适的地方
见矩阵乘法原理图:
分析:
计算第一个元素,c(0,0)我们需要第一行的a(0,0)~~a(0,n)和第一列的b(0,0)~~b(n,0)
计算第二个元素,c(0,1)我们需要第一行的a(0,0)~~a(0,n)和第一列的b(0,1)~~b(n,1)
计算第二个元素,c(0,2)我们需要第一行的a(0,0)~~a(0,n)和第一列的b(0,2)~~b(n,2)
…
很明显,我们需要对a(0,0)~~a(0, n)反复取n次,这不断的I/O读写会拖慢运行速度。所以会采用分组优化的方式
分组优化原理
代码部分:
static const int TileSize = 16;
Concurrency::array_view<const float, 2> a(M, W, vA);
Concurrency::array_view<const float, 2> b(W, N, vB);
Concurrency::array_view<float, 2> c(M, N, vC);
c.discard_data();
Concurrency::parallel_for_each(
c.extent.tile<TileSize,TileSize>(),
[=](Concurrency::tiled_index<TileSize,TileSize> tidx) restrict(amp) {
int row = tidx.local[0];
int col = tidx.local[1];
float sum = 0.0f;
for (int inner = 0; inner < W; inner+=TileSize) {
tile_static float sA[TileSize][TileSize];
tile_static float sB[TileSize][TileSize];
sA[row][col] = a(tidx.global[0], col + inner);
sB[row][col] = b(row + inner, tidx.global[1]);
tidx.barrier.wait();
for (int k = 0; k < TileSize; ++k) {
sum += sA[row][k] * sB[k][col];
}
tidx.barrier.wait();
}
c[tidx.global] = sum;
});
c.synchronize();
选择16为大小是经过测试的,低于16的优化效果并不好,不如默认的哪种算法。小伙伴们可以自己尝试运行。