A trip through the Graphics Pipeline 2011, part 8

本文详细探讨了GPU像素管线的工作流程,包括像素处理的前半部分:分发和实际像素着色阶段。重点解释了工作分布、纹理采样、属性插值等关键概念,并介绍了不同类型的插值方式及其在现代GPU中的应用。

A trip through the Graphics Pipeline 2011, part 8

by fgiesen on July 10, 2011

In this part, I’ll be dealing with the first half of pixel processing: dispatch and actual pixel shading. In fact, this is really what most graphics programmer think about when talking about pixel processing; the alpha blend and late Z stages we’ll encounter in the next part seem like little more than an afterthought. In hardware, the story is a bit more complicated, as we’ll see – there’s a reason I’m splitting pixel processing into two parts. But I’m getting ahead of myself. At the point where we’re entering this stage, the coordinates of pixels (or, actually, quads) to shade, plus associated coverage masks, arrive from the rasterizer/early-Z unit – with triangle in the exact same order as submitted by the application, as I pointed out last time. What we need to do here is to take that linear, sequential stream of work and farm it out to hundreds of shader units, then once the results are back, we need to merge it back into one linear stream of memory updates.

That’s a textbook example of fork/join-parallelism. This part deals with the fork phase, where we go wide; the next part will explain the join phase, where we merge the hundreds of streams back into one. But first, I have a few more words to say about rasterization, because what I just told you about there being just one stream of quads coming in isn’t quite true.

Going wide during rasterization

To my defense, what I told you used to be true for quite a long time, but it’s a serial part of the pipeline, and once you throw in excess of 300 shader units at a problem, serial parts of the pipeline have the tendency to become bottlenecks. So GPU architects started using multiple rasterizers; as of 2010, NVidia employs four rasterizers and AMD uses two. As a side note, the NV presentation also has a few notes on the requirement to keep stuff in API order. In particular, you really do need to sort primitives back into order prior to rasterization/early-Z, like I mentioned last time; doing it just before alpha blend (as you might be inclined to do) doesn’t work.

The work distribution between rasterizers is based on the tiles we’ve already seen for early-Z and coarse rasterization. The frame buffer is divided into tile-sized regions, and each region is assigned to one of the rasterizers. After setup, the bounding box of the triangle is consulted to find out which triangles to hand over to which rasterizers; large triangles will always be sent to all rasterizers, but smaller ones can hit as little as one tile and will only be sent to the rasterizer that owns it.

The beauty of this scheme is that it only requires changes to the work distribution and the coarse rasterizers (which traverse tiles); everything that only sees individual tiles or quads (that is, the pipeline from hierarchical Z down) doesn’t need to be modified. The problem is that you’re now dividing jobs based on screen locations; this can lead to a severe load imbalance between the rasterizers (think a few hundred tiny triangles all inside a single tile) that you can’t really do anything about. But the nice thing is that everything that adds ordering constraints to the pipeline (Z-test/write order, blend order) comes attached to specific frame-buffer locations, so screen-space subdivision works without breaking API order – if this wasn’t the case, tiled renderers wouldn’t work.

You need to go wider!

Okay, so we don’t get just one linear stream of quad coordinates plus coverage masks in, but between two and four. We still need to farm them out to hundreds of shader units. It’s time for another dispatch unit! Which first means another buffer. But how big are the batches we send off to the shaders? Here I go with NVidia figures again, simply because they mention this number in public white papers; AMD probably also states that information somewhere, but I’m not familiar with their terminology for it so I couldn’t do a direct search for it. Anyway, for NVidia, the unit of dispatch to shader units is 32 threads, which they call a “Warp”. Each quad has 4 pixels (each of which in turn can be handled as one thread), so for each shading batch we issue, we need to grab 8 incoming quads from the rasterizer before we can send off a batch to the shader units (we might send less in case there’s a shader switch or pipeline flush).

Also, this is a good point to explain why we’re dealing with quads of 2×2 pixels and not individual pixels. The big reason is derivatives. Texture samplers depend on screen-space derivatives of texture coordinates to do their mip-map selection and filtering (as we saw back in part 4); and, as of shader model 3.0 and later, the same machinery is directly available to pixel shaders in the form of derivative instructions. In a quad, each pixel has both a horizontal and vertical neighbor within the same quad; this can be used to estimate the derivatives of parameters in the x and y directions using finite differencing (it boils down to a few subtractions). This gives you a very cheap way to get derivatives at the cost of always having to shade groups of 2×2 pixels at once. This is no problem in the interior of large triangles, but means that between 25-75% of the shading work for quads generated for triangle edges is wasted. That’s because all pixels in a quad, even the masked ones, get shaded. This is necessary to produce correct derivatives for the pixels in the quad that are visible. The invisible but still-shaded pixels are called “helper pixels”. Here’s an illustration for a small triangle:

Quad coverage of a small triangle

The triangle intersects 4 quads, but only generates visible pixels in 3 of them. Furthermore, in each of the 3 quads, only one pixel is actually covered (the sampling points for each pixel region are depicted as black circles) – the pixels that are filled are depicted in red. The remaining pixels in each partially-covered quad are helper pixels, and drawn with a lighter color. This illustration should make it clear that for small triangles, a large fraction of the total number of pixels shaded are helper pixels, which has attracted some research attention on how to merge quads of adjacent triangles. However, while clever, such optimizations are not permissible by current API rules, and current hardware doesn’t do them. Of course, if the HW vendors at some point decide that wasted shading work on quads is a significant enough problem to force the issue, this will likely change.

Attribute interpolation

Another unique feature of pixel shaders is attribute interpolation – all other shader types, both the ones we’ve seen so far (VS) and the ones we’re still to talk about (GS, HS, DS, CS) get inputs directly from a preceding shader stage or memory, but pixel shaders have an additional interpolation step in front of them. I’ve already talked a bit about this in the previous part when discussing Z, which was the first interpolated attribute we saw.

Other interpolated attributes work much the same way; a plane equation for them is computed during triangle setup (GPUs may choose to defer this computation somewhat, e.g. until it’s known that at least one tile of the triangle passed the hierarchical Z-test, but that shall not concern us here), and then during pixel shading, there’s a separate unit that performs attribute interpolation using the pixel positions of the quads and the plane equations we just computed.

Update: Marco Salvi points out (in the comments below) that while there used to be dedicated interpolators, by now the trend is towards just having them return the barycentric coordinates to plug into the plane equations. The actual evaluation (two multiply-adds per attribute) can be done in the shader unit.

All of this shouldn’t be surprising, but there’s a few extra interpolation types to discuss. First, there’s “constant” interpolators, which are (surprise!) constant across the primitive and take the value for each vertex attribute from the “leading vertex” (which vertex that is is determined during primitive setup). Hardware may either have a fast-path for this or just set up a corresponding plane equation; either way works fine.

Then there’s no-perspective interpolation. This will usually set up the plane equations differently; the plane equations for perspective-correct interpolation are set up either for X, Y-based interpolation by dividing the attribute values at each vertex by the corresponding w, or for barycentric interpolation by building the triangle edge vectors. Non-perspective interpolated attributes, however, are cheapest to evaluate when their plane equation is set up for X, Y-based interpolation without dividing the values at each vertex by the corresponding w.

“Centroid” interpolation is tricky

Next, we have “centroid” interpolation. This is a flag, not a separate mode; it can be combined both with the perspective and no-perspective modes (but not with constant interpolation, because it would be pointless). It’s also terribly named and a no-op unless multisampling is enabled. With multisampling ob, it’s a somewhat hacky solution to a real problem. The issue is that with multisampling, we’re evaluating triangle coverage at multiple sample points in the rasterizer, but we’re only doing the actual shading once per pixel. Attributes such as texture coordinates will be interpolated at the pixel center position, as if the whole pixel was covered by the primitive. This can lead to problems in situations such as this:

MSAA sample problem

Here, we have a pixel that’s partially covered by a primitive; the four small circles depict the 4 sampling points (this is the default 4x MSAA pattern) while the big circle in the middle depicts the pixel center. Note that the big circle is outside the primitive, and any “interpolated” value for it will actually be linear extrapolation; this is a problem if the app uses texture atlases, for example. Depending on the triangle size, the value at the pixel center can be very far off indeed. Centroid sampling solves this problem. The original explanation was that the GPU takes all of the samples covered by the primitive, computes their centroid, and samples at that position (hence the name). This is usually followed by the addition that this is just a conceptual model, and GPUs are free to do it differently, so long as the point they pick for sampling is within the primitive.

If you think it somewhat unlikely that the hardware actually counts the covered samples, sums them up, then divides by the count, then join the club. Here’s what actually happens:

  • If all sample points cover the primitive, interpolation is done as usual, i.e. at the pixel center (which happens to be the centroid of all sample positions for all reasonable sampling patterns).
  • If not all sample points cover the triangle, the hardware picks one of the sample points that do and evaluates there. All covered sample points are (by definition) inside the primitive so this works.

That picking used to be arbitrary (i.e. left to the hardware); I believe by now DX11 actually prescribes exactly how it’s done, but this more a matter of getting consistent results between different pieces of hardware than it is something that API users will actually care about. As said, it’s a bit hacky. It also tends to mess up derivative calculations for quads that have partially covered pixels – tough luck. What can I say, it may be industrial-strength duct tape, but it’s still duct tape.

Finally (new in DX11!) there’s “pull-model” attribute interpolation. Regular attribute interpolation is done automatically before the pixel shader starts; pull-model interpolation adds actual instructions that do the interpolation to the pixel shader. This allows the shader to compute its own position to sample values at, or to only interpolate attributes in some branches but not in others. What it boils down to is the pixel shader being able to send additional requests to the interpolation unit while the shader is running.

The actual shader body

Again, the general shader principles are well-explained in the API documentation, so I’m not going to talk about how individual instructions work; generally, the answer is “as you would expect them to”. There are however some interesting bits about pixel shader execution that are worth talking about.

The first one is: texture sampling! Wait, didn’t I wax on about texture samplers for quite some time in part 4 already? Yes, but that was the texture sampler side of things – and if you remember, there was that one bit about texture cache misses being so frequent that samplers are usually designed to sustain at least one miss to main memory per request (which is 16-32 pixels, remember!) without stalling. That’s a lot of cycles – hundreds of them. And it would be a tremendous waste of perfectly good ALUs to keep them idle while all this is going on.

So what shader units actually do is switch to a different batch after they’ve issued a texture sample; then when that batch issues a texture sample (or completes), it switches back to one of the previous batches and checks if the texture samples are there yet. As long as each shader unit has a few batches it can work on at any given time, this makes good use of available resources. It does increase latency for completion of individual batches though – again, a latency-vs-throughput trade-off. By now you should know which side wins on GPUs: Throughput! Always. One thing to note here is that keeping multiple batches (or “Warps” on NVidia hardware, or “Wavefronts” for AMD) running at the same time requires more registers. If a shader needs a lot of registers, a shader unit can keep less warps around; and if there are less of them, the chance that at some point you’ll run out of runnable batches that aren’t waiting on texture results is higher. If there’s no runnable batches, you’re out of luck and have to stall until one of them gets its results back. That’s unfortunate, but there’s limited hardware resources for this kind of thing – if you’re out of memory, you’re out of memory, period.

Another point I haven’t talked about yet: Dynamic branches in shaders (i.e. loops and conditionals). In shader units, work on all elements of each batch usually proceeds in lockstep. All “threads” run the same code, at the same time. That means that ifs are a bit tricky: If any of the threads want to execute the “then”-branch of an if, all of them have to – even though most of them may end up ignoring the results using a technique called predication, because they didn’t want to descend down there in the first place.. Similarly for the “else” branch. This works great if conditionals tend to be coherent across elements, and not so great if they’re more or less random. Worst case, you’ll always execute both branches of every if. Ouch. Loops work similarly – as long as at least one thread wants to keep running a loop, all of the threads in that batch/Warp/Wavefront will.

Another pixel shader specific is the discard instruction. A pixel shader can decide to “kill” the current pixel, which means it won’t get written. Again, if all pixels inside a batch get discarded, the shader unit can stop and go to another batch; but if there’s at least one thread left standing, the rest will be dragged along. DX11 adds more fine-grained control here by way of writing the output pixel coverage from the pixel shader (this is always ANDed with the original triangle/Z-test coverage, to make sure that a shader can’t write outside its primitive, for sanity). This allows the shader to discard individual samples instead of whole pixels; it can be used to implement Alpha-to-Coverage with a custom dithering algorithm in the shader, for example.

Pixel shaders can also write the output depth (this feature has been around for quite some time now). In my experience, this is an excellent way to shoot down early-Z, hierarchical Z and Z compression and in general get the slowest path possible. By now, you know enough about how these things work to see why. :)

Pixel shaders produce several outputs – in general, one 4-component vector for each render target, of which there can be (currently) up to 8. The shader then sends the results on down the pipeline towards what D3D calls the “Output Merger”. This’ll be our topic next time.

But before I sign off, there’s one final thing that pixel shaders can do starting with D3D11: they can write to Unordered Access Views (UAVs) – something which only compute and pixel shaders can do. Generally speaking, UAVs take the place of render targets during compute shader execution; but unlike render targets, the shader can determine the position to write to itself, and there’s no implicit API order guarantee (hence the “unordered access” part of the name). For now, I’ll only mention that this functionality exists – I’ll talk more about it when I get to Compute Shaders.

Update: In the comments, Steve gave me a heads-up about the correct AMD terminology (the first version of the post didn’t have the “Wavefronts” name because I couldn’t remember it) and also posted a link to this great presentation by Kayvon Fatahalian that explains shader execution on GPUs, with a lot more pretty pictures that I can be bothered to make :). You should really check it out if you’re interested in how shader cores work.

And… that’s it! No big list of caveats this time. If there’s something missing here, it’s because I’ve genuinely forgotten about it, not because I decided it was too arcane or specific to write up here. Feel free to point out omissions in the comments and I’ll see what I can do.

<think>首先,用户要求使用中文回答问题。所以我的整个响应必须用中文。 用户的问题是关于翻译并解释一段文本:"Loopback Ports Loopback ports allow extended parsing and modification capabilities. By selectively forwarding some of the traffic through the loopback port, this traffic is forwarded through the processing pipeline a second time, allowing additional functionality." 关键任务: - 翻译:将这段英文翻译成中文。 - 解释:解释翻译后的内容。 从之前的响应中,我看到系统有严格的格式要求: 1. 回答完问题后,必须提出3-5个 §§相关问题§§。 2. §§相关问题§§ 必须输出在回答的末尾。 3. 第一行必须是完整且输出 "§§相关问题§§"。 4. 禁止换行分散,意味着相关问题列表不能换行显示(但在示例中,有换行,所以可能意味着相关问题列表中的每一项可以在新行,但整个块不能被分割)。 示例格式: - 先回答问题(示例中是解释快速排序)。 - 然后输出 "§§相关问题§§"。 - 接着是相关问题列表,如 "1. ···" 等。 在我的回应中: - 先翻译并解释问题内容。 - 然后以 "§§相关问题§§" 开头一行。 - 接着列出3-5个相关问题。 内容分析: - 原文:"Loopback Ports allow extended parsing and modification capabilities. By selectively forwarding some of the traffic through the loopback port, this traffic is forwarded through the processing pipeline a second time, allowing additional functionality." - 翻译: - "Loopback Ports" 翻译为 "环回端口"。 - "allow extended parsing and modification capabilities" → "允许扩展解析和修改能力"。 - "By selectively forwarding some of the traffic through the loopback port" → "通过有选择地将部分流量通过环回端口转发"。 - "this traffic is forwarded through the processing pipeline a second time" → "此流量被第二次通过处理管道转发"。 - "allowing additional functionality" → "允许额外的功能"。 - 完整翻译:"环回端口允许扩展解析和修改能力。通过有选择地将部分流量通过环回端口转发,此流量被第二次通过处理管道转发,从而允许额外的功能。" - 解释: - 基于之前的响应,环回端口用于内部路由流量回设备本身。 - 这里的重点是:环回端口可以用来让流量多次通过处理管道,以实现额外的处理步骤,如解析、修改(如添加头部、过滤等)。 - 在之前的解释中,环回端口用于测试,但这里扩展到功能增强。 完整解释计划: 1. 翻译部分。 2. 解释核心概念:环回端口如何工作,为什么能扩展能力。 3. 讨论应用场景:例如在网络设备中用于深度包检查、策略实施等。 4. 与之前的区别:之前提到环回用于测试,这里用于功能扩展。 最后,提出相关问题。 - 基于我的解释,brainstorm 3-5个相关问题: 1. 环回端口如何实现流量的选择性转发? 2. 在多次处理管道中,可能存在什么问题(如延迟)? 3. 这种机制在哪些具体场景中应用(如SDWN、防火墙)? 4. 与标准处理相比,环回处理有何优势? 5. 如何配置设备以启用这种功能? 确保相关问题数量3-5个。 回应结构: - 先输出翻译。 - 然后解释。 - 最后 "§§相关问题§§" 和列表。 用中文书写。 </think>### 翻译及解释 **翻译:** 环回端口(Loopback Ports)允许扩展解析和修改能力。通过有选择地将部分流量通过环回端口转发,此流量将被第二次通过处理管道转发,从而实现额外功能。 --- ### 核心解释 1. **功能本质**: - 环回端口不是物理外接口,而是设备内部的**逻辑通道**,能将流出流量重新导回设备入口。 - “扩展解析和修改能力” 指支持对流量进行深度处理,如修改数据包头、添加标签、执行策略过滤或深度包检测(DPI)。 2. **工作机制**: - **选择性转发**:设备基于规则(如 ACL 或流量策略)筛选特定流量(例如匹配某 VLAN 或特定协议的包),将其定向到环回端口。 - **二次处理**: ```mermaid graph LR A[初始流量] --> B[第一轮处理管道] --> C{决策点} C -->|匹配规则| D[环回端口] D --> E[第二轮处理管道] E --> F[最终转发/丢弃] C -->|不匹配| G[直接转发] ``` - 例如:在第一轮处理中识别 HTTP 流量,在第二轮添加加密头部。 3. **应用场景与价值**: - **功能扩展**: - 在防火墙中,实现多次策略检查以增强安全性。 - 在 SDN 中,动态插入服务链(如先解密再压缩)。 - **优势**: - 避免外部设备依赖,减少延迟(相比服务端口)。 - 支持迭代处理(如解析→修改→再解析)。 - **限制**: - 过度使用可能导致处理延迟或资源争用(CPU/内存)。 4. **与标准环回端口的区别**: - 传统环回(如 `127.0.0.1`)仅用于本地测试,不具备流量修改能力。 - 此处描述的环回端口是**增强型**,集成于网络设备数据平面,支持业务逻辑执行。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值