Skip to content

Commit cb58ea3

Browse files
committed
instruction pipelining
1 parent e75c376 commit cb58ea3

File tree

4 files changed

+165
-74
lines changed

4 files changed

+165
-74
lines changed

content/english/hpc/pipelining/_index.md

Lines changed: 117 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -3,74 +3,114 @@ title: Instruction-Level Parallelism
33
weight: 3
44
---
55

6-
In the previous version, we have an inherently sequential chain of operations in the innermost loop. We accumulate the minimum in variable v by a sequence of min operations. There is no way to start the second operation before we know the result of the first operation; there is no room for parallelism here:
7-
8-
...
9-
v = std::min(v, z0);
10-
v = std::min(v, z1);
11-
v = std::min(v, z2);
12-
v = std::min(v, z3);
13-
v = std::min(v, z4);
14-
...
15-
Independent operations
16-
There is a simple way to reorganize the operations so that we have more room for parallelism. Instead of accumulating one minimum, we could accumulate two minimums, and at the very end combine them:
17-
18-
...
19-
v0 = std::min(v0, z0);
20-
v1 = std::min(v1, z1);
21-
v0 = std::min(v0, z2);
22-
v1 = std::min(v1, z3);
23-
v0 = std::min(v0, z4);
24-
...
25-
v = std::min(v0, v1);
26-
The result will be clearly the same, but we are calculating the operations in a different order. In essence, we split the work in two independent parts, calculating the minimum of odd elements and the minimum of even elements, and finally combining the results. If we calculate the odd minimum v0 and even minimum v1 in an interleaved manner, as shown above, we will have more opportunities for parallelism. For example, the 1st and 2nd operation could be calculated simultaneously in parallel (or they could be executed in a pipelined fashion in the same execution unit). Once these results are available, the 3rd and 4th operation could be calculated simultaneously in parallel, etc. We could potentially obtain a speedup of a factor of 2 here, and naturally the same idea could be extended to calculating e.g. 4 minimums in an interleaved fashion.
27-
28-
Instruction-level parallelism is automatic
29-
Now that we know how to reorganize calculations so that there is potential for parallelism, we will need to know how to realize the potential. For example, if we have these two operations in the C++ code, how do we tell the computer that the operations can be safely executed in parallel?
30-
31-
v0 = std::min(v0, z0);
32-
v1 = std::min(v1, z1);
33-
The delightful answer is that it happens completely automatically, there is nothing we need to do (and nothing we can do)!
6+
When programmers hear the word *parallelism*, they mostly think about *multi-core parallelism*, the practice of explicitly splitting a computation into semi-independent *threads* that work together to solve a common problem.
347

35-
The magic takes place inside the CPU. The compiler just produces two machine language instructions, without any special annotation that indicates whether or not these instructions can be executed in parallel. The CPU will then automatically figure out which of the instructions can be executed in parallel.
8+
This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as many computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware), and parallel algorithm design is becoming an increasingly more important area, for now, we will consider the use of more than one CPU core cheating.
369

37-
A bit more precisely, the CPU will look at the instruction stream up to some distance in the future. If there are branches, it will do branch prediction to produce a sequential stream of instructions. Then it will see which of the instructions are ready for execution. For example, if it sees a future instruction X that only uses registers A and B, and there are no instructions before it that touch those registers, and none of the instructions that are currently in the pipeline modify those registers, either, then it is safe to start to execute X as soon as there is an execution unit that is available.
10+
But there are other types of parallelism, already existing inside a CPU core, that you can use *for free*.
3811

39-
All of this happens in the hardware, all the time, fully automatically. The only thing that the programmer needs to do is to make sure there are sufficiently many independent instructions always available for execution.
12+
<!--
13+
14+
This technique only applies
15+
16+
Parallel hardware is now everywhere. When you opened this page in your browser, it was retrieved by a 50-core server CPU, then parsed by an 8-core desktop CPU, and then rendered by a 400-core GPU. Not all cores were involved with serving you this page at all times — they might have been doing something else.
17+
18+
Parallelism helps in reducing *latency*. It is important, but for now, our main concern is not *scalability*, but *efficiency* of algorithms.
19+
20+
Sharing computations is an art in itself, but for now, we want to learn how to use resources that we already have more efficiently.
21+
22+
While multi-core parallelism is "cheating", many form of parallelism exist "for free".
23+
24+
Adapting algorithms for parallel hardware is important for achieving *scalability*. In the first part of this book, we will consider this technique "cheating". We only do optimizations that are truly free, and preferably don't take away resources from other processes that might be running concurrently.
25+
26+
-->
4027

4128
### Instruction Pipelining
4229

43-
The same things applies to CPUs and other hardware. To increase the utilization, instructions are processed in a pipeline.
30+
To execute *any* instruction, processors need to do a lot of preparatory work first, which includes:
31+
32+
- **fetching** a chunk of machine code from memory,
33+
- **decoding** it and splitting into instructions,
34+
- **executing** these instructions, which may involve doing some **memory** operations, and
35+
- **writing** the results back into registers.
36+
37+
This whole sequence of operations is *long*. It takes up to 15-20 CPU cycles even for something simple like `add`-ing two register-stored values together. To hide this latency, modern CPUs use *pipelining*: after an instruction passes through the first stage, they start processing the next one right away, without waiting for the previous one to fully complete.
38+
39+
![](img/pipeline.png)
40+
41+
Pipelining does not reduce *actual* latency but functionally makes it seem like if it was composed of only the execution and memory stage. You still need to pay these 15-20 cycles, but you only need to do it once after you've found the sequence of instructions you are going to execute.
42+
43+
### Latency and Throughput of Instructions
44+
45+
It makes sense to duplicate frequently used stages. Such processors are called *superscalar*.
46+
47+
![Pipeline of a superscalar CPU with the width of 2](img/superscalar.png)
48+
49+
Interleaving the stages of execution is a general idea in hardware, and it is applied not only in the general CPU pipeline, but also on the level of separate instructions and [memory](/hpc/cpu-cache/mlp).
50+
51+
The latency and throughput numbers are architecture-specific. Some samples for my Zen 2:
52+
53+
All are specified for 32-bit integers.
54+
55+
| Instruction | Latency | RThroughput |
56+
|-------------|---------|:------------|
57+
| `jmp` | - | 2 |
58+
| `mov r, r` | - | 1/4 |
59+
| `mov r, m` | 4 | 1/2 |
60+
| `mov m, r` | 3 | 1 |
61+
| `add` | 1 | 1/3 |
62+
| `cmp` | 1 | 1/4 |
63+
| `popcnt` | 1 | 1/4 |
64+
| `mul` | 3 | 1 |
65+
| `div` | 13-28 | 13-28 |
66+
67+
[Integer division](/hpc/arithmetic/division) is an exception: it is either very poorly pipelined or not pipelined at all (like in this case).
68+
69+
You could consider that the latency is zero or undefined. For memory operations, latency is usually specified for L1 cache.
70+
71+
Decode width. You can't get throughput higher than that.
72+
73+
Some instructions. They have the same latency
74+
75+
Sometimes operations have many forms. For example, "mov" with memory operands does. For
76+
77+
"RThroughput" is shorthand for "reciprocal throughput". Values less than one mean that.
78+
79+
Execution ports (or sometimes "pipes"). This is mostly relevant for SIMD.
80+
81+
Some instructions have a latency of 0. This means that these instruction are used to control the scheduler, and they don't reach the execution stage. This is by the virtue of renaming. But they still have non-zero latency because we first need to [process them](/hpc/architecture/layout).
82+
83+
You can get this data from special documents called [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf).
84+
85+
You can schedule independent instructions separately, but only up to some extent.
86+
87+
### Instruction Scheduling
4488

4589
Modern processors don’t actually execute instructions one-by-one, but maintain a *pipeline* of pending instructions so that two independent operations can be executed concurrently without waiting for each other to finish.
4690

47-
When I said that `add` instruction only takes one cycle, I lied a little bit. Every instruction needs a bit more than that. The whole thing takes around 5-6 clock cycles. But still, when you use it, it appears and feels like a single instruction. How does CPU achieve that?
91+
uOps ("micro-ops", the first letter is meant to be greek letter mu as in us (microsecond), but nobody cares enough to type it).
4892

49-
The thing is, most of CPU isn't about computing.
93+
Each architecture has its own set of "ports", each capable of executing its own set of instructions (uOps, to be more exact).
5094

51-
![](img/pipeline.png)
95+
But still, when you use it, it appears and feels like a single instruction. How does CPU achieve that?
96+
97+
The thing is, most of CPU isn't about computing.
5298

5399
Although logically it takes fundamentally 3 cycles, in CPUs it is much more.
54100

55-
### An Education Metaphor
101+
But there is much more that can benefit from parallel thinking.
56102

57-
As a everyday metaphor, consider how a university works. It could have one student at a time and around 50 professors, which would take turns in tutoring, but this would be highly inefficient and result in one bachelor's degree every 4 year.
58103

59-
Maybe this is how the members of the British royal family study.
60-
61-
But for better of worse, the education is scaled.
104+
or specialized hardware. But actually there is a lot of parallelism happening inside CPU.
62105

63-
Instead, universities do two smart things:
106+
The magic takes place inside the CPU. The compiler just produces two machine language instructions, without any special annotation that indicates whether or not these instructions can be executed in parallel. The CPU will then automatically figure out which of the instructions can be executed in parallel.
64107

65-
1. They teach to large groups of students at once instead of individuals.
66-
2. They overlap their "classes" so that each can all professors keep busy. This way you can increase throughput by 4x.
108+
A bit more precisely, the CPU will look at the instruction stream up to some distance in the future. If there are branches, it will do branch prediction to produce a sequential stream of instructions. Then it will see which of the instructions are ready for execution. For example, if it sees a future instruction X that only uses registers A and B, and there are no instructions before it that touch those registers, and none of the instructions that are currently in the pipeline modify those registers, either, then it is safe to start to execute X as soon as there is an execution unit that is available.
67109

68-
For the first trick, the CPU world analogue is SIMD, which we covered in the previous chapter. And for the second, it is the technique called pipelining, which we are going to discuss next.
110+
All of this happens in the hardware, all the time, fully automatically. The only thing that the programmer needs to do is to make sure there are sufficiently many independent instructions always available for execution.
69111

70112
### Latency and Throughput
71113

72-
![](img/superscalar.png)
73-
74114
and adds a new level of complexity
75115

76116
Programming pipelined and superscalar processors presents its own challenges, which we are going to address in this chapter.
@@ -93,3 +133,34 @@ You know that your documentation is good when people have to reverse engineer it
93133
There are reasons to believe that folks at Intel don't know that themselves.
94134

95135
llvm-mca
136+
137+
138+
### An Education Analogy
139+
140+
As a everyday metaphor, consider how a university works. It could have one student at a time and around 50 professors, which would take turns in tutoring, but this would be highly inefficient and result in one bachelor's degree every 4 year.
141+
142+
Maybe this is how the members of the British royal family study.
143+
144+
But for better of worse, the education is scaled.
145+
146+
Instead, universities do two smart things:
147+
148+
1. They teach to large groups of students at once instead of individuals, broadcasting the same thing (SIMD).
149+
2. They might split work between different parallel groups (superscalar processing).
150+
2. They overlap their classes so that each can all professors keep busy. This way you can increase throughput by 4x.
151+
152+
For the first trick, the CPU world analogue is SIMD, which we covered in the previous chapter. And for the second, it is the technique called pipelining, which we are going to discuss next.
153+
154+
Kind of match.
155+
156+
1. SIMD to process 16, 32, or 64 bytes of data at a time.
157+
2. Superscalar processing to handle 2 to 4 SIMD blocks at a time
158+
3. Pipelining (~15, roughly equal to the number of years between kindergarten and PhD)
159+
160+
In addition to that, other aspects are also true. Execution paths become more divergent. Some are stalled at various stages. Also some are interrupted. Some are speculated without knowing what happens.
161+
162+
There are many aspects, and in this chapter we are going to explore them
163+
164+
You might fail a course, but proceed somewhere else.
165+
166+
Similar to education, these also cause problems, and the first thing we will do in this chapter is learn how to avoid them.

content/english/hpc/pipelining/branching.md

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -72,18 +72,6 @@ for (int i = 0; i < N; i++)
7272

7373
N = 1e6, цикл запускается много раз, а переменная sum помечена как volatile — то есть компилятор не может векторизовать цикл, объединить соседние итерации в одну, либо как-нибудь ещё считерить.
7474

75-
---
76-
77-
Объяснение: чтобы исполнить любую инструкцию, процессору нужно сначала проделать много подготовительной работы: прочитать машинный код из памяти, понять что это за инструкция и где она должна исполняться, найти и положить нужные данные в регистры-операнды, записать результат куда надо… Весь этот процесс занимает приличное время (~15 тактов), но при этом в каждом шаге зайдествован только какой-то отдельный модуль CPU, и поэтому в современных процессорах используют *пайплайнинг*: когда инструкция N начинает исполняться, процессор на следующем такте сразу возьмет в обработку инструкцию (N + 1), не дожидаясь, пока N завершится.
78-
79-
Аналогия: система образования, разделенная на классы и курсы. Преподаватели на первом курсе вуза на следующий год будут вести у следующего набора, а не ждать 4 года, пока первый выпустится.
80-
81-
Такая техника позволяет одновременно обрабатывать в очереди много инструкций и скрыть их задержку, но если возникает ситуация, что процессор, например, ждет данных от какой-то инструкции, либо не может заранее определить, какую инструкцию ему дальше исполнять, то в пайплайне возникает «пузырь».
82-
83-
---
84-
85-
Есть два основных типа пузырей: условно лёгкий, когда процессор ждет данные от предыдущей операции (зависит от задержки этой операции, но в нашем случае это ~5 циклов) и тяжелый, когда процессор ждет новых инструкций (~15 циклов).
86-
8775
"if" и любой другой control flow — это как раз второй тип, и в случае с исходным циклом он проверяется на каждой итерации, создавая пробку из инструкций. Так как ифы исполняются часто, и с увеличением пайплайна эти пузыри стали серьёзной проблемой, производители процессоров добавили специальную инструкцию cmov ("conditional move"), которая позволяет по произвольному условию записать в переменную либо одно значение, либо другое — но не исполнять произвольный код. Когда мы заменяем явный if на тернарное условие в духе s += (cond ? x : 0), компилятор делает подобную оптимизацию, заменяя ветку на cmov.
8876

8977
Это примерно эквивалентно такому алгебраическому трюку:

content/english/hpc/pipelining/hazards.md

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,21 @@ title: Pipeline Hazards
33
weight: 1
44
---
55

6+
Такая техника позволяет одновременно обрабатывать в очереди много инструкций и скрыть их задержку, но если возникает ситуация, что процессор, например, ждет данных от какой-то инструкции, либо не может заранее определить, какую инструкцию ему дальше исполнять, то в пайплайне возникает «пузырь».
7+
8+
9+
by analogy with an air bubble in a fluid pipe — it propagates through the pipeline.
10+
11+
![Pipeline stall on the execution stage](../img/bubble.png)
12+
13+
Есть два основных типа пузырей: условно лёгкий, когда процессор ждет данные от предыдущей операции (зависит от задержки этой операции, но в нашем случае это ~5 циклов) и тяжелый, когда процессор ждет новых инструкций (~15 циклов).
14+
15+
16+
17+
Let's talk more about the instruction scheduling and what can go wrong in the pipeline.
18+
19+
20+
621
situations that prevent the next instruction in the instruction stream from executing during its designated clock cycles
722

823
Let's dive deeper into microarchitecture.
@@ -37,22 +52,6 @@ $\text{# of instructions} \to \infty,\; CPI \to 1$
3752
3. Execute: sends on a separate execution unit
3853
4. Write: write data back to registers or set some flag
3954

40-
## Latency and Throughput
41-
42-
| Operation | Latency | $\frac{1}{throughput}$ |
43-
| --------- | ------- |:------------ |
44-
| MOV | 1 | 1/3 |
45-
| JMP | 0 | 2 |
46-
| ADD | 1 | 1/3 |
47-
| SUM | 1 | 1/3 |
48-
| CMP | 1 | 1/3 |
49-
| POPCNT | 3 | 1 |
50-
| MUL | 3 | 1 |
51-
| DIV | 11-21 | 7-11 |
52-
53-
"Nehalem" (Intel i7) op tables
54-
https://www.agner.org/optimize/instruction_tables.pdf
55-
5655
### Superscalar Processors
5756

5857
![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Superscalarpipeline.svg/2880px-Superscalarpipeline.svg.png =500x)

0 commit comments

Comments
 (0)