Chapter12 The Compute Shader (Introduction to 3D Game Programming with DirectX 11)笔记

最新推荐文章于 2024-05-28 10:01:07 发布

繁弱

最新推荐文章于 2024-05-28 10:01:07 发布

阅读量583

点赞数

分类专栏： DirextX

本文链接：https://blog.youkuaiyun.com/zzgyy123/article/details/88171731

版权

相比于CPU 被设计为 Random Memory Access，GPU 针对single location 以及 sequential location做了很多优化（streaming operation 流操作）。另外 vertex 和pixel由于是互相独立的，因此GPU可以对他们进行并行操作，比如 [NVIDIA09] NVIDIA “Fermi” architecture 可以支持最多并行16个流处理器，每个处理器 32 核，共512核。

GPU 的并行在图形渲染方面带来很大优势，而非图形应用也可以利用GPU的并行优势，这种应用叫做 general purpose GPU (GPGPU) programming。并不是所有的程序都适合用GPU来计算，需要这些应用本身之间执行相同操作，互相独立，可以并行起来才可以。

如图中所示，CS并不属于pipeline的一部分，而是在一边作为独立的一部分，可以随时读写 GPU Resources。它允许我们访问GPU实现数据并行的算法，但不绘制任何物件。

12.1 THREADS AND THREAD GROUPS

在GPU编程中，将许多thread 分成 thread groups里，一个group在一个处理器上运行。一般来说，如果是16核处理器，那最少需要分成 16个 thread group，但实际上，会至少一个处理器上两个thread group，这是因为单个的 group也有可能在等待，此时可以切换执行第二个group，因此可以分成32个froup。

每个thread group里的thread 共享一块内存，不同group内的thread不能访问对方的内存，同步操作也是在一个group发生，不同group内的不会同步。

一个thread group包含 n个thread。实际在硬件中，会把这些thread 分成 warps(Nvidia 一般的warp size是 32 threads，ATI 的“wavefont” 是 64 threads)，然后一个warp 会被一个处理器同时处理SIMD32，即 Single Input Multi Data，一个Cuda processor 有32个 core，每个core处理一个 thread，所以是一个warp里有32个thread。在Dx里，也可以指定 thread group的size不是32的倍数，但从性能的方面考虑，最好还是32的倍数。

在Dx中，使用下面方法调用 thread group，生成 3D thread group grid，但书中之后只关心2D 的thread group。

void ID3D11DeviceContext::Dispatch(
UINT ThreadGroupCountX,
UINT ThreadGroupCountY,
UINT ThreadGroupCountZ);

12.2 A SIMPLE COMPUTE SHADER

cbuffer cbSettings
{
    // Compute shader can access values in constant buffers.
};
// Data sources and outputs.
Texture2D gInputA;
Texture2D gInputB;
RWTexture2D<float4> gOutput;
// The number of threads in the thread group. The threads in a group can
// be arranged in a 1D, 2D, or 3D grid layout.
[numthreads(16, 16, 1)]
void CS(int3 dispatchThreadID : SV_DispatchThreadID) // Thread ID
{
    // Sum the xyth texels and store the result in the xyth texel of
    // gOutput.
    gOutput[dispatchThreadID.xy] =
    gInputA[dispatchThreadID.xy] +
    gInputB[dispatchThreadID.xy];
}
technique11 AddTextures
{
    pass P0
    {
        SetVertexShader(NULL);
        SetPixelShader(NULL);
        SetComputeShader(CompileShader(cs_5_0, CS()));
    }
}

1. Global variable access via constant buffers. 通过constant buffer访问全局变量。
2. Input and output resources, which are discussed in the next section. 输入输出资源下节12.3讨论。
3. The [numthreads(X, Y, Z)] attribute, which specifies the number of threads in the thread group as a 3D grid of threads.
4. The shader body that has the instructions to execute for each thread. CS的shader函数体会被每个thread执行。
5. Thread identification system value parameters (discussed in §12.4). 12.4节讨论thread identification。

注意 [numthreads(X, Y, 1)] 的 X 和 Y 可以变化，但正如之前提到的，thread group的数量需要是 32(Nvidia) / 64 (ATI)的倍数，因此最好是64的倍数，这样两种卡都支持。