To understand what HT is, you need to understand the pipeline. The pipeline is a system in the CPU that breaks down each operation into smaller chunks and executes each of these chunks separately. Think of it like doing your laundry: you can wash your clothes in the washing machine while drying your previous washer load in the dryer. If you have a lot of laundry to do, it would be silly to leave the washer empty just because the dryer is being used, right? But that’s exactly how pre-Pentium Intel CPU’s worked.
Imagine this hypothetical 4-stage pipelione for something like the 8080 CPU:
- retrieve the instruction from memory
- Retrieve data from memory (if necessary)
- Execute the instruction
- Write data back to memory (if necessary)
Some instructions, like “NOOP” would only take 2 ticks to run: get the instruction, then execute it. An instruction like “Store the value in the A register to the specified memory location” would take 5 ticks, since the CPU has to read two bytes to get the memory address, then actually write the register value to memory.
Older CPU’s like the 8080 execute all of the steps in an operation before advancing to the next program step. Starting with the 80486 processor, Intel added a pipeline, which allows these steps to operate in parallel. Like the laundry example, a 408 chip can actually operate on parts of more than one instruction at a time. This gives the CPU a faster effective throughput, even though each instruction still happens in series.
But there are some problems with pipelining: one thing that can happen is that the pipeline can become stalled with branch instructions. Imagine the following sequence of operations:
Code:
start:
1 load A from memory location 123
2 compare A and B
3 if they are equal, skip to notequal
4 print “not equal”
5 jump to start
notequal:
6 print equal
So in our 4-stage pipeline, steps 1-3 are running through the pipeline. By the time step 3 hits the “execute” stage, step 4 should be in the “retrieve data from memory” stage, and step 5 should be in the “retrieve instruction from memory” stage.
But wait… A and B are equal, so we can’t actually execute the ‘print “not equal”’ statement! Instead, we have to jump down to step 6. What happens the stuff already loaded in the pipeline? Basically, it gets thrown away, and the CPU sits idle for 3 ticks while it loads the new code sequence.
So the way data moves through pipeline is with the help of pipeline registers. What those do is hold the output from one pipeline stage and feed it to the next stage. So what would happen if we split up the pipeline so that it was doing two different things at once? Essentially, we’d have this:
Code:
Task 2 Step 2 - STAGE 1
--------------------------------- STAGE 2 - Task 1 Step 2
Task 2 Step 1 - STAGE 3
----------------------------------STAGE 4 - Task 1 Step 1
One of the benefits here is that since stages are separated, we know the outcome of test instructions before we read that tasks’s next instruction. So every time we have a branch that gets taken, we gain some time over the single-threaded architecture. Further refinements in HT actually allow HT to be executed dynamically, so things like reading data from main memory (rather than the local cache) can bump the other thread’s priority, allowing that thread to execute faster while the blocked thread waits for memory to return the data.
Things like this are why Hyperthreading can actually speed up execution of two tasks, compared to serializing them.
Finally, I tested this when the first HT CPUs came out. I ran benchmarks with Hyperthreading enabled and disabled, and I compared those results to a non-Hyperthreaded CPU. Generally speaking, a single, high priority task would execute at the same speed, with or without HT. However, when running two or more tasks, the HT processor did a better job of multitasking, and the overall throughput of the HT CPU was better than the non-HT CPU. It was not double, however. It was probably closer to the 15-20% that other people have mentioned.
本文通过洗衣流程比喻,详细解析了CPU管道的工作原理及其如何提高运算效率。介绍了传统CPU与采用管道技术的现代CPU在执行指令上的区别,并探讨了超线程技术如何进一步优化多任务处理。
671

被折叠的 条评论
为什么被折叠?



