windows C++ 并行编程-C++ AMP(一)

程序员王马

于 2024-08-31 00:15:00 发布

阅读量1.2k

点赞数 21

分类专栏： windows C++并行编程技术文章标签： c++ 开发语言

本文链接：https://blog.youkuaiyun.com/m0_72813396/article/details/141539110

版权

windows C++并行编程技术专栏收录该内容

100 篇文章

订阅专栏

C++ AMP (C++ Accelerated Massive Parallelism) 利用数据并行硬件（通常作为独立显卡上的图形处理单元 (GPU) 存在）来加速 C++ 代码的执行。 C++ AMP 编程模型包括多维数组、索引、内存传输和平铺的支持。它还包括数学函数库。可以使用 C++ AMP 语言扩展来控制如何在 CPU 与 GPU 之间来回移动数据。

从 Visual Studio 2022 版本 17.0 开始，已弃用 C++ AMP 头文件。包含任何 AMP 头文件都会导致生成错误。应在包含任何 AMP 头文件之前定义 _SILENCE_AMP_DEPRECATION_WARNINGS，以使警告静音。

使用平铺

可以使用平铺来最大化应用的加速。平铺将线程分成相等的矩形子集或平铺。如果使用适当的平铺大小和平铺算法，则可以从 C++ AMP 代码中实现更好的加速。平铺的基本组件为：

tile_static 变量。平铺的主要好处是可以从 tile_static 访问中获得性能增益。访问 tile_static 内存中的数据可能比访问全局空间（array 或 array_view 对象）中的数据要快得多。为每个平铺创建 tile_static 变量的实例，平铺中的所有线程可以访问该变量。在典型的平铺算法中，数据从全局内存复制到 tile_static 内存一次，然后从 tile_static 内存被访问多次；
tile_barrier::wait 方法。 tile_barrier::wait 调用会暂停当前线程的执行，直到同一平铺中的所有线程都到达 tile_barrier::wait 调用。不能保证线程的运行顺序，只有在所有线程都到达 tile_barrier::wait 调用之后，平铺中的线程才越过该调用执行。这意味着，使用 tile_barrier::wait 方法可以逐平铺而不是逐线程执行任务。典型的平铺算法提供用于初始化整个平铺的 tile_static 内存，然后调用 tile_barrier::wait 的代码。 tile_barrier::wait 后面的代码包含需要访问所有 tile_static 值的计算；
局部和全局索引。可以访问相对于整个 array_view 或 array 对象的索引以及相对于平铺的索引。使用局部索引可使代码更易于阅读和调试。通常，你会使用局部索引来访问 tile_static 变量，并使用全局索引来访问 array 和 array_view 变量；
tiled_extent 类和 tiled_index 类。在 parallel_for_each 调用中使用 tiled_extent 对象而不是 extent 对象。在 parallel_for_each 调用中使用 tiled_index 对象而不是 index 对象；

若要利用平铺，算法必须将计算域分区成平铺，然后将平铺数据复制到 tile_static 变量中以加快访问速度。

全局、平铺和局部索引的示例

下图显示了排列在 2x3 平铺中的 8x9 数据矩阵。

以下示例显示了此平铺矩阵的全局、平铺和局部索引。 array_view 对象是使用 Description 类型的元素创建的。 Description 保存矩阵中元素的全局、平铺和局部索引。 parallel_for_each 调用中的代码设置每个元素的全局、平铺和局部索引值。输出显示 Description 结构中的值。

#include <iostream>
#include <iomanip>
#include <Windows.h>
#include <amp.h>
using namespace concurrency;

const int ROWS = 8;
const int COLS = 9;

// tileRow and tileColumn specify the tile that each thread is in.
// globalRow and globalColumn specify the location of the thread in the array_view.
// localRow and localColumn specify the location of the thread relative to the tile.
struct Description {
    int value;
    int tileRow;
    int tileColumn;
    int globalRow;
    int globalColumn;
    int localRow;
    int localColumn;
};

// A helper function for formatting the output.
void SetConsoleColor(int color) {
    int colorValue = (color == 0)  4 : 2;
    SetConsoleTextAttribute(GetStdHandle(STD_OUTPUT_HANDLE), colorValue);
}

// A helper function for formatting the output.
void SetConsoleSize(int height, int width) {
    COORD coord;

    coord.X = width;
    coord.Y = height;
    SetConsoleScreenBufferSize(GetStdHandle(STD_OUTPUT_HANDLE), coord);

    SMALL_RECT* rect = new SMALL_RECT();
    rect->Left = 0;
    rect->Top = 0;
    rect->Right = width;
    rect->Bottom = height;
    SetConsoleWindowInfo(GetStdHandle(STD_OUTPUT_HANDLE), true, rect);
}

// This method creates an 8x9 matrix of Description structures.
// In the call to parallel_for_each, the structure is updated
// with tile, global, and local indices.
void TilingDescription() {
    // Create 72 (8x9) Description structures.
    std::vector<Description> descs;
    for (int i = 0; i < ROWS * COLS; i++) {
        Description d = {i, 0, 0, 0, 0, 0, 0};
        descs.push_back(d);
    }

    // Create an array_view from the Description structures.
    extent<2> matrix(ROWS, COLS);
    array_view<Description, 2> descriptions(matrix, descs);

    // Update each Description with the tile, global, and local indices.
    parallel_for_each(descriptions.extent.tile< 2, 3>(),
        [=] (tiled_index< 2, 3> t_idx) restrict(amp)
    {
        descriptions[t_idx].globalRow = t_idx.global[0];
        descriptions[t_idx].globalColumn = t_idx.global[1];
        descriptions[t_idx].tileRow = t_idx.tile[0];
        descriptions[t_idx].tileColumn = t_idx.tile[1];
        descriptions[t_idx].localRow = t_idx.local[0];
        descriptions[t_idx].localColumn= t_idx.local[1];
    });

    // Print out the Description structure for each element in the matrix.
    // Tiles are displayed in red and green to distinguish them from each other.
    SetConsoleSize(100, 150);
    for (int row = 0; row < ROWS; row++) {
        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Value: " << std::setw(2) << descriptions(row, column).value << "      ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Tile:   " << "(" << descriptions(row, column).tileRow << "," << descriptions(row, column).tileColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Global: " << "(" << descriptions(row, column).globalRow << "," << descriptions(row, column).globalColumn << ")  ";
        }
        std::cout << "\n";

        for (int column = 0; column < COLS; column++) {
            SetConsoleColor((descriptions(row, column).tileRow + descriptions(row, column).tileColumn) % 2);
            std::cout << "Local:  " << "(" << descriptions(row, column).localRow << "," << descriptions(row, column).localColumn << ")  ";
        }
        std::cout << "\n";
        std::cout << "\n";
    }
}

int main() {
    TilingDescription();
    char wait;
    std::cin >> wait;
}

该示例的主要工作是定义 array_view 对象和 parallel_for_each 调用。

Description 结构的向量复制到 8x9 array_view 对象中；
使用 tiled_extent 对象作为计算域来调用 parallel_for_each 方法。 tiled_extent 对象是通过调用 descriptions 变量的 extent::tile() 方法创建的。 extent::tile() 调用的类型参数 <2,3> 指定创建 2x3 平铺。因此，8x9 矩阵平铺成 12 个平铺图块（4 行 x 3列）；
使用 tiled_index<2,3> 对象 (t_idx) 作为索引来调用 parallel_for_each 方法。索引 (t_idx) 的类型参数必须与计算域 (descriptions.extent.tile< 2, 3>()) 的类型参数匹配；
执行每个线程时，索引 t_idx 会返回有关线程所在平铺（tiled_index::tile 属性）以及线程在平铺中的位置（tiled_index::local 属性）的信息；