A trip through the Graphics Pipeline 2011, part 13

本文详细解析Compute Shader的工作原理,包括执行环境、内存通信机制及unordered access views的使用。深入理解Compute Shader如何独立运行,以及其与传统渲染管线的区别,特别是关于共享内存的同步和访问策略。

A trip through the Graphics Pipeline 2011, part 13

by fgiesen

Welcome back to what’s going to be the last “official” part of this series – I’ll do more GPU-related posts in the future, but this series is long enough already. We’ve been touring all the regular parts of the graphics pipeline, down to different levels of detail. Which leaves one major new feature introduced in DX11 out: Compute Shaders. So that’s gonna be my topic this time around.

Execution environment

For this series, the emphasis has been on overall dataflow at the architectural level, not shader execution (which is explained well elsewhere). For the stages so far, that meant focusing on the input piped into and output produced by each stage; the way the internals work was usually dictated by the shape of the data. Compute shaders are different – they’re running by themselves, not as part of the graphics pipeline, so the surface area of their interface is much smaller.

In fact, on the input side, there’s not really any buffers for input data at all. The only input Compute Shaders get, aside from API state such as the bound Constant Buffers and resources, is their thread index. There’s a tremendous potential for confusion here, so here’s the most important thing to keep in mind: a “thread” is the atomic unit of dispatch in the CS environment, and it’s a substantially different beast from the threads provided by the OS that you probably associate with the term. CS threads have their own identity and registers, but they don’t have their own Program Counter (Instruction Pointer) or stack, nor are they scheduled individually.

In fact, “threads” in CS take the place that individual vertices had during Vertex Shading, or individual pixels during Pixel Shading. And they get treated the same way: assemble a bunch of them (usually, somewhere between 16 and 64) into a “Warp” or “Wavefront” and let them run the same code in lockstep. CS threads don’t get scheduled – Warps and Wavefronts do (I’ll stick with “Warp” for the rest of this article; mentally substitute “Wavefront” for AMD). To hide latency, we don’t switch to a different “thread” (in CS parlance), but to a different Warp, i.e. a different bundle of threads. Single threads inside a Warp can’t take branches individually; if at least one thread in such a bundle wants to execute a certain piece of code, it gets processed by all the threads in the bundle – even if most threads then end up throwing the results away. In short, CS “threads” are more like SIMD lanes than like the threads you see elsewhere in programming; keep that in mind.

That explains the “thread” and “warp” levels. Above that is the “thread group” level, which deals with – who would’ve thought? – groups of threads. The size of a thread group is specified during shader compilation. In DX11, a thread group can contain anywhere between 1 and 1024 threads, and the thread group size is specified not as a single number but as a 3-tuple giving thread x, y, and z coordinates. This numbering scheme is mostly for the convenience of shader code that addresses 2D or 3D resources, though it also allows for traversal optimizations. At the macro level, CS execution is dispatched in multiples of thread groups; thread group IDs in D3D11 again use 3D group IDs, same as thread IDs, and for pretty much the same reasons.

Thread IDs – which can be passed in in various forms, depending on what the shader prefers – are the only input to Compute Shaders that’s not the same for all threads; quite different from the other shader types we’ve seen before. This is just the tip of the iceberg, though.

Thread Groups

The above description makes it sound like thread groups are a fairly arbitrary middle level in this hierarchy. However, there’s one important bit missing that makes thread groups very special indeed: Thread Group Shared Memory (TGSM). On DX11 level hardware, compute shaders have access to 32k of TGSM, which is basically a scratchpad for communication between threads in the same group. This is the primary (and fastest) way by which different CS threads can communicate.

So how is this implemented in hardware? It’s quite simple: all threads (well, Warps really) within a thread group get executed by the same shader unit. The shader unit then simply has at least 32k (usually a bit more) of local memory. And because all grouped threads share the same shader unit (and hence the same set of ALUs etc.), there’s no need to include complicated arbitration or synchronization mechanisms for shared memory access: only one Warp can access memory in any given cycle, because only one Warp gets to issue instructions in any cycle! Now, of course this process will usually be pipelined, but that doesn’t change the basic invariant: per shader unit, we have exactly one piece of TGSM; accessing TGSM might require multiple pipeline stages, but actual reads from (or writes to) TGSM will only happen inside one pipeline stage, and the memory accesses during that cycle all come from within the same Warp.

However, this is not yet enough for actual shared-memory communication. The problem is simple: The above invariant guarantees that there’s only one set of accesses to TGSM per cycle even when we don’t add any interlocks to prevent concurrent access. This is nice since it makes the hardware simpler and faster. It does not guarantee that memory accesses happen in any particular order from the perspective of the shader program, however, since Warps can be scheduled more or less randomly; it all depends on who is runnable (not waiting for memory access / texture read completion) at certain points in time. Somewhat more subtle, precisely because the whole process is pipelined, it might take some cycles for writes to TGSM to become “visible” to reads; this happens when the actual read and write operations to TGSM occur in different pipeline stages (or different phases of the same stage). So we still need some kind of synchronization mechanism. Enter barriers. There’s different types of barriers, but they’re composed of just three fundamental components:

  1. Group Synchronization. A Group Synchronization Barrier forces all threads inside the current group to reach the barrier before any of them may consume past it. Once a Warp reaches such a barrier, it will be flagged as non-runnable, same as if it was waiting for a memory or texture access to complete. Once the last Warp reaches the barrier, the remaining Warps will be reactivated. This all happens at the Warp scheduling level; it adds additional scheduling constraints, which may cause stalls, but there’s no need for atomic memory transactions or anything like that; other than lost utilization at the micro level, this is a reasonably cheap operation.
  2. Group Memory Barriers. Since all threads within a group run on the same shader unit, this basically amounts to a pipeline flush, to ensure that all pending shared memory operations are completed. There’s no need to synchronize with resources external to the current shader unit, which means it’s again reasonably cheap.
  3. Device Memory Barriers. This blocks all threads within a group until all memory accesses have completed – either direct or indirect (e.g. via texture samples). As explained earlier in this series, memory accesses and texture samples on GPUs have long latencies – think more than 600, and often above 1000 cycles – so this kind of barrier will really hurt.

DX11 offers different types of barriers that combine several of the above components into one atomic unit; the semantics should be obvious.

Unordered Access Views

We’ve now dealt with CS input and learned a bit about CS execution. But where do we put our output data? The answer has the unwieldy name “unordered access views”, or UAVs for short. An UAV seems somewhat similar to render targets in Pixel Shaders (and UAVs can in fact be used in addition to render targets in Pixel Shaders), but there’s some very important semantic differences:

  • Most importantly, as the same suggests, access to UAVs is “unordered”, in the sense that the API does not guarantee accesses to become visible in any particular order. When rendering primitives, quads are guaranteed to be Z-tested, blended and written back in API order (as discussed in detail in part 9 of this series), or at least produce the same results as if they were – which takes substantial effort. UAVs make no such effort – UAV accesses happen immediately as they’re encountered in the shader, which may be very different from API order. They’re not completely unordered, though; while there’s no guaranteed order of operations within an API call, the API and driver will still collaborate to make sure that perceived sequential ordering is preserved across API calls. Thus, if you have a complex Compute Shader (or Pixel Shader) writing to an UAV immediately followed by a second (simpler) CS that reads from the same underlying resource, the second CS will see the finished results, never some partially-written output.
  • UAVs support random access. A Pixel Shader can only write to one location per render target – its corresponding pixel. The same Pixel Shader can write to arbitrary locations in whatever UAVs it has bound.
  • UAVs support atomic operations. In the classic Pixel Pipeline, there’s no need; we guarantee there’s never any collisions anyway. But with the free-form execution provided by UAVs, different threads might be trying to access a piece of memory at the same time, and we need synchronization mechanisms to deal with this.

So from a “CPU programmer”‘s point of view, UAVs correspond to regular RAM in a shared-memory multiprocessing system; they’re windows into memory. More interesting is the issue of atomic operations; this is one area where current GPUs diverge considerably from CPU designs.

Atomics

In current CPUs, most of the magic for shared memory processing is handled by the memory hierarchy (i.e. caches). To write to a piece of memory, the active core must first assert exclusive ownership of the corresponding cache line. This is accomplished using what’s called a “cache coherency protocol”, usually MESI and descendants. The details are tangential to this article; what matters is that because writing to memory entails acquiring exclusive ownership, there’s never a risk of two cores simultaneously trying to write to the some location. In such a model, atomic operations can be implemented by holding exclusive ownership for the duration of the operation; if we had exclusive ownership for the whole time, there’s no chance that someone else was trying to write to the same location while we were performing the atomic operation. Again, the actual details of this get hairy pretty fast (especially as soon as things like paging, interrupts and exceptions get involved), but the 30000-feet-view will suffice for the purposes of this article.

In this type of model, atomic operations are performed using the regular Core ALUs and load/store units, and most of the “interesting” work happens in the caches. The advantage is that atomic operations are (more or less) regular memory accesses, albeit with some extra requirements. There’s a couple of problems, though: most importantly, the standard implementation of cache coherency, “snooping”, requires that all agents in the protocol talk to each other, which has serious scalability issues. There are ways around this restriction (mainly using so-called Directory-based Coherency protocols), but they add additional complexity and latency to memory accesses. Another issue is that all locks and memory transactions really happen at the cache line level; if two unrelated but frequently-updated variables share the same cache line, it can end up “ping-ponging” between multiple cores, causing tons of coherency transactions (and associated slowdown). This problem is called “false sharing”. Software can avoid it by making sure unrelated fields don’t fall into the same cache line; but on GPUs, neither the cache line size nor the memory layout during execution is known or controlled by the application, so this problem would be more serious.

Current GPUs avoid this problem by structuring their memory hierarchy differently. Instead of handling atomic operations inside the shader units (which again raises the “who owns which memory” issue), there’s dedicated atomic units that directly talk to a shared lowest-level cache hierarchy. There’s only one such cache, so the issue of coherency doesn’t come up; either the cache line is present in the cache (which means it’s current) or it isn’t (which means the copy in memory is current). Atomic operations consist of first bringing the respective memory location into the cache (if it isn’t there already), then performing the required read-modify-write operation directly on the cache contents using a dedicated integer ALU on the atomic units. While an atomic unit is busy on a memory location, all other accesses to that location will stall. Since there’s multiple atomic units, it’s necessary to make sure they never try to access the same memory location at the same time; one easy way to accomplish this is to make each atomic unit “own” a certain set of addresses (statically – not dynamically as with cache line ownership). This is done by computing the index of the responsible atomic unit as some hash function of the memory address to be accessed. (Note that I can’t confirm this is how current GPUs do; I’ve found little detail on how the atomic units work in official docs).

If a shader unit wants to perform an atomic operation to a given memory address, it first needs to determine which atomic unit is responsible, wait until it is ready to accept new commands, and then submit the operation (and potentially wait until it is finished if the result of the atomic operation is required). The atomic unit might only be processing one command at a time, or it might have a small FIFO of outstanding requests; and of course there’s all kinds of allocation and queuing details to get right so that atomic operation processing is reasonably fair so that shader units will always make progress. Again, I won’t go into further detail here.

One final remark is that, of course, outstanding atomic operations count as “device memory” accesses, same as memory/texture reads and UAV writes; shader units need to keep track of their outstanding atomic operations and make sure they’re finished when they hit device memory access barriers.

Structured buffers and append/consume buffers

Unless I missed something, these two buffer types are the last CS-related features I haven’t talked about yet. And, well, from a hardware perspective, there’s not that much to talk about, really. Structured buffers are more of a hint to the driver-internal shader compiler than anything else; they give the driver some hint as to how they’re going to be used – namely, they consist of elements with a fixed stride that are likely going to be accessed together – but they still compile down to regular memory accesses in the end. The structured buffer part may bias the driver’s decision of their position and layout in memory, but it does not add any fundamentally new functionality to the model.

Append/consume buffers are similar; they could be implemented using the existing atomic instructions. In fact, they kind of are, except the append/consume pointers aren’t at an explicit location in the resource, they’re side-band data outside the resource that are accessed using special atomic instructions. (And similarly to structured buffers, the fact that their usage is declared as append/consume buffer allows the driver to pick their location in memory appropriately).

Wrap-up.

And… that’s it. No more previews for the next part, this series is done :), though that doesn’t mean I’m done with it. I have some restructuring and partial rewriting to do – these blog posts are raw and unproofed, and I intend to go over them and turn it into a single document. In the meantime, I’ll be writing about other stuff here. I’ll try to incorporate the feedback I got so far – if there’s any other questions, corrections or comments, now’s the time to tell me! I don’t want to nail down the ETA for the final cleaned-up version of this series, but I’ll try to get it down well before the end of the year. We’ll see. Until then, thanks for reading!

<think>首先,用户要求使用中文回答问题。所以我的整个响应必须用中文。 用户的问题是关于翻译并解释一段文本:"Loopback Ports Loopback ports allow extended parsing and modification capabilities. By selectively forwarding some of the traffic through the loopback port, this traffic is forwarded through the processing pipeline a second time, allowing additional functionality." 关键任务: - 翻译:将这段英文翻译成中文。 - 解释:解释翻译后的内容。 从之前的响应中,我看到系统有严格的格式要求: 1. 回答完问题后,必须提出3-5个 §§相关问题§§。 2. §§相关问题§§ 必须输出在回答的末尾。 3. 第一行必须是完整且输出 "§§相关问题§§"。 4. 禁止换行分散,意味着相关问题列表不能换行显示(但在示例中,有换行,所以可能意味着相关问题列表中的每一项可以在新行,但整个块不能被分割)。 示例格式: - 先回答问题(示例中是解释快速排序)。 - 然后输出 "§§相关问题§§"。 - 接着是相关问题列表,如 "1. ···" 等。 在我的回应中: - 先翻译并解释问题内容。 - 然后以 "§§相关问题§§" 开头一行。 - 接着列出3-5个相关问题。 内容分析: - 原文:"Loopback Ports allow extended parsing and modification capabilities. By selectively forwarding some of the traffic through the loopback port, this traffic is forwarded through the processing pipeline a second time, allowing additional functionality." - 翻译: - "Loopback Ports" 翻译为 "环回端口"。 - "allow extended parsing and modification capabilities" → "允许扩展解析和修改能力"。 - "By selectively forwarding some of the traffic through the loopback port" → "通过有选择地将部分流量通过环回端口转发"。 - "this traffic is forwarded through the processing pipeline a second time" → "此流量被第二次通过处理管道转发"。 - "allowing additional functionality" → "允许额外的功能"。 - 完整翻译:"环回端口允许扩展解析和修改能力。通过有选择地将部分流量通过环回端口转发,此流量被第二次通过处理管道转发,从而允许额外的功能。" - 解释: - 基于之前的响应,环回端口用于内部路由流量回设备本身。 - 这里的重点是:环回端口可以用来让流量多次通过处理管道,以实现额外的处理步骤,如解析、修改(如添加头部、过滤等)。 - 在之前的解释中,环回端口用于测试,但这里扩展到功能增强。 完整解释计划: 1. 翻译部分。 2. 解释核心概念:环回端口如何工作,为什么能扩展能力。 3. 讨论应用场景:例如在网络设备中用于深度包检查、策略实施等。 4. 与之前的区别:之前提到环回用于测试,这里用于功能扩展。 最后,提出相关问题。 - 基于我的解释,brainstorm 3-5个相关问题: 1. 环回端口如何实现流量的选择性转发? 2. 在多次处理管道中,可能存在什么问题(如延迟)? 3. 这种机制在哪些具体场景中应用(如SDWN、防火墙)? 4. 与标准处理相比,环回处理有何优势? 5. 如何配置设备以启用这种功能? 确保相关问题数量3-5个。 回应结构: - 先输出翻译。 - 然后解释。 - 最后 "§§相关问题§§" 和列表。 用中文书写。 </think>### 翻译及解释 **翻译:** 环回端口(Loopback Ports)允许扩展解析和修改能力。通过有选择地将部分流量通过环回端口转发,此流量将被第二次通过处理管道转发,从而实现额外功能。 --- ### 核心解释 1. **功能本质**: - 环回端口不是物理外接口,而是设备内部的**逻辑通道**,能将流出流量重新导回设备入口。 - “扩展解析和修改能力” 指支持对流量进行深度处理,如修改数据包头、添加标签、执行策略过滤或深度包检测(DPI)。 2. **工作机制**: - **选择性转发**:设备基于规则(如 ACL 或流量策略)筛选特定流量(例如匹配某 VLAN 或特定协议的包),将其定向到环回端口。 - **二次处理**: ```mermaid graph LR A[初始流量] --> B[第一轮处理管道] --> C{决策点} C -->|匹配规则| D[环回端口] D --> E[第二轮处理管道] E --> F[最终转发/丢弃] C -->|不匹配| G[直接转发] ``` - 例如:在第一轮处理中识别 HTTP 流量,在第二轮添加加密头部。 3. **应用场景与价值**: - **功能扩展**: - 在防火墙中,实现多次策略检查以增强安全性。 - 在 SDN 中,动态插入服务链(如先解密再压缩)。 - **优势**: - 避免外部设备依赖,减少延迟(相比服务端口)。 - 支持迭代处理(如解析→修改→再解析)。 - **限制**: - 过度使用可能导致处理延迟或资源争用(CPU/内存)。 4. **与标准环回端口的区别**: - 传统环回(如 `127.0.0.1`)仅用于本地测试,不具备流量修改能力。 - 此处描述的环回端口是**增强型**,集成于网络设备数据平面,支持业务逻辑执行。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值