Brook+ Programming (4)

最新推荐文章于 2024-08-06 15:57:44 发布

原创最新推荐文章于 2024-08-06 15:57:44 发布 · 412 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#float #input #c #output #matrix

GPU 专栏收录该内容

6 篇文章

订阅专栏

本文介绍了一种矩阵乘法的优化方法，通过预先加载数据并利用向量化操作减少内存访问次数，提高了计算效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

（接上篇）

由main函数开始，生成三个流接受输入矩阵（A和B）和输出矩阵（流一般用来表示矩阵）。然后三块内存缓冲区开辟出来（input_A,input_B和

input_C)接着streamRead()函数将数据从input_A复制到流A，input_B到流B。

simple_matmult( (float) Width, A, B, C) 这一行将Width、输入流A、B和输出流C等参数传递给kernel函数，同时也触发了kernel函数在流处理器上的执行。在简单的矩阵乘法操作中，kernel函数从一个矩阵读入一行向量，从另一个矩阵读入一列向量，将二者点乘写入结果。在上面的例子中，kernel函数遍历输出流的每个数据的位置。总结如下：

1.循环遍历矩阵A的行；

2.循环遍历矩阵B的列；

3.每次从每个矩阵取出一个值；

4.乘法运算得到结果

这里kernel利用了一个特性就是向量数据类型（float2和float4）。Brook+支持最多四个元素的数据类型。（注：即最多四个分量的向量）元素可以通过任何结合方式被访问。这也就是所谓的混合（swizzling）。

这里kernel的输入流和前面的sum kernel的流稍有不同。这里，流是用方括号的，表示输入流被当作内存数组并且可以直接对数据元素寻址。这就是所谓的“聚集流”。kernel代码和C代码之间的一个重要不同就是聚集流只能用向量类来寻址，而不是多个方括号（注：像数组那样）。

例如：A[x][y]是不允许的。

为了确定kernel该访问输出的哪一行/列，输出（注：指每次算得的输出矩阵中的一个值）的位置必须要确定。这是通过indexof()函数实现的，它返回一个整数表示输出域中的位置。

在while循环中，从矩阵A中的列和矩阵B中的值相乘（注：疑此处有误，AxB应该是A的行和B的列相乘），accumulator变量是乘积的结果。

和前面的sum例子一样，结果是不用方括号的。Brook+自动将结果写回正确的位置，也就是indexof()在输出流中的位置。

2.3.2Optimized Matrix Multiply Example

（注：更完善的矩阵乘法例子）

上面的kernel的一个缺点是：同样的数据在不同的地方被重用。例如：相邻的两个输出位置上，kernel使用了相同的行向量或是列向量。一般而言，从内存读取数据的开销大于从流处理器中读取。一个更完美的方法是在kernel中执行更多计算，这样读取操作就可以减少。下面是完善的矩阵乘法的代码：

      kernel void
optimized_matmult(int loopVar0,
        float4 A1[][], float4 A2[][], float4 A3[][], float4 A4[][],
        float4 A5[][], float4 A6[][], float4 A7[][], float4 A8[][],
        float4 B1[][], float4 B2[][], float4 B3[][], float4 B4[][],
        out float4 C1<>, out float4 C2<>, out float4 C3<>, out float4 C4<>,
        out float4 C5<>, out float4 C6<>, out float4 C7<>, out float4 C8<>)
{
    // Setting zero
    float4 zero = float4(0.0f, 0.0f, 0.0f, 0.0f);

    // Declaring and initializing accumulators
    float4 accumulator1 = zero;
    float4 accumulator2 = zero;
    float4 accumulator3 = zero;
    float4 accumulator4 = zero;
    float4 accumulator5 = zero;
    float4 accumulator6 = zero;
    float4 accumulator7 = zero;
    float4 accumulator8 = zero;

    // Row number of output position
    int i = instance().y;

    // Column number of output position
    int j = instance().x;

    int k = 0;
    for(; k < loopVar0; ++k)
    {
        // Fetching values from A
        float4 A11 = A1[i][k]; float4 A22 = A2[i][k];
        float4 A33 = A3[i][k]; float4 A44 = A4[i][k];
        float4 A55 = A5[i][k]; float4 A66 = A6[i][k];
        float4 A77 = A7[i][k]; float4 A88 = A8[i][k];

        // Fetching values from B
        float4 B11 = B1[k][j]; float4 B22 = B2[k][j];
        float4 B33 = B3[k][j]; float4 B44 = B4[k][j];

        accumulator1 += A11.xxxx * B11.xyzw + A11.yyyy * B22.xyzw + A11.zzzz * B33.xyzw + A11.wwww * B44.xyzw;
        accumulator2 += A22.xxxx * B11.xyzw + A22.yyyy * B22.xyzw + A22.zzzz * B33.xyzw + A22.wwww * B44.xyzw;
        accumulator3 += A33.xxxx * B11.xyzw + A33.yyyy * B22.xyzw + A33.zzzz * B33.xyzw + A33.wwww * B44.xyzw;
        accumulator4 += A44.xxxx * B11.xyzw + A44.yyyy * B22.xyzw + A44.zzzz * B33.xyzw + A44.wwww * B44.xyzw;
        accumulator5 += A55.xxxx * B11.xyzw + A55.yyyy * B22.xyzw + A55.zzzz * B33.xyzw + A55.wwww * B44.xyzw;
        accumulator6 += A66.xxxx * B11.xyzw + A66.yyyy * B22.xyzw + A66.zzzz * B33.xyzw + A66.wwww * B44.xyzw;
        accumulator7 += A77.xxxx * B11.xyzw + A77.yyyy * B22.xyzw + A77.zzzz * B33.xyzw + A77.wwww * B44.xyzw;
        accumulator8 += A88.xxxx * B11.xyzw + A88.yyyy * B22.xyzw + A88.zzzz * B33.xyzw + A88.wwww * B44.xyzw;
    }

    C1 = accumulator1;
    C2 = accumulator2;
    C3 = accumulator3;
    C4 = accumulator4;
    C5 = accumulator5;
    C6 = accumulator6;
    C7 = accumulator7;
    C8 = accumulator8;
}