You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the previous version, we have an inherently sequential chain of operations in the innermost loop. We accumulate the minimum in variable v by a sequence of min operations. There is no way to start the second operation before we know the result of the first operation; there is no room for parallelism here:
7
-
8
-
...
9
-
v = std::min(v, z0);
10
-
v = std::min(v, z1);
11
-
v = std::min(v, z2);
12
-
v = std::min(v, z3);
13
-
v = std::min(v, z4);
14
-
...
15
-
Independent operations
16
-
There is a simple way to reorganize the operations so that we have more room for parallelism. Instead of accumulating one minimum, we could accumulate two minimums, and at the very end combine them:
17
-
18
-
...
19
-
v0 = std::min(v0, z0);
20
-
v1 = std::min(v1, z1);
21
-
v0 = std::min(v0, z2);
22
-
v1 = std::min(v1, z3);
23
-
v0 = std::min(v0, z4);
24
-
...
25
-
v = std::min(v0, v1);
26
-
The result will be clearly the same, but we are calculating the operations in a different order. In essence, we split the work in two independent parts, calculating the minimum of odd elements and the minimum of even elements, and finally combining the results. If we calculate the odd minimum v0 and even minimum v1 in an interleaved manner, as shown above, we will have more opportunities for parallelism. For example, the 1st and 2nd operation could be calculated simultaneously in parallel (or they could be executed in a pipelined fashion in the same execution unit). Once these results are available, the 3rd and 4th operation could be calculated simultaneously in parallel, etc. We could potentially obtain a speedup of a factor of 2 here, and naturally the same idea could be extended to calculating e.g. 4 minimums in an interleaved fashion.
27
-
28
-
Instruction-level parallelism is automatic
29
-
Now that we know how to reorganize calculations so that there is potential for parallelism, we will need to know how to realize the potential. For example, if we have these two operations in the C++ code, how do we tell the computer that the operations can be safely executed in parallel?
30
-
31
-
v0 = std::min(v0, z0);
32
-
v1 = std::min(v1, z1);
33
-
The delightful answer is that it happens completely automatically, there is nothing we need to do (and nothing we can do)!
6
+
When programmers hear the word *parallelism*, they mostly think about *multi-core parallelism*, the practice of explicitly splitting a computation into semi-independent *threads* that work together to solve a common problem.
34
7
35
-
The magic takes place inside the CPU. The compiler just produces two machine language instructions, without any special annotation that indicates whether or not these instructions can be executed in parallel. The CPU will then automatically figure out which of the instructions can be executed in parallel.
8
+
This type of parallelism is mainly about reducing *latency* and achieving *scalability*, but not about improving *efficiency*. You can solve a problem ten times as big with a parallel algorithm, but it would take at least ten times as many computational resources. Although parallel hardware is becoming [ever more abundant](/hpc/complexity/hardware), and parallel algorithm design is becoming an increasingly more important area, for now, we will consider the use of more than one CPU core cheating.
36
9
37
-
A bit more precisely, the CPU will look at the instruction stream up to some distance in the future. If there are branches, it will do branch prediction to produce a sequential stream of instructions. Then it will see which of the instructions are ready for execution. For example, if it sees a future instruction X that only uses registers A and B, and there are no instructions before it that touch those registers, and none of the instructions that are currently in the pipeline modify those registers, either, then it is safe to start to execute X as soon as there is an execution unit that is available.
10
+
But there are other types of parallelism, already existing inside a CPU core, that you can use *for free*.
38
11
39
-
All of this happens in the hardware, all the time, fully automatically. The only thing that the programmer needs to do is to make sure there are sufficiently many independent instructions always available for execution.
12
+
<!--
13
+
14
+
This technique only applies
15
+
16
+
Parallel hardware is now everywhere. When you opened this page in your browser, it was retrieved by a 50-core server CPU, then parsed by an 8-core desktop CPU, and then rendered by a 400-core GPU. Not all cores were involved with serving you this page at all times — they might have been doing something else.
17
+
18
+
Parallelism helps in reducing *latency*. It is important, but for now, our main concern is not *scalability*, but *efficiency* of algorithms.
19
+
20
+
Sharing computations is an art in itself, but for now, we want to learn how to use resources that we already have more efficiently.
21
+
22
+
While multi-core parallelism is "cheating", many form of parallelism exist "for free".
23
+
24
+
Adapting algorithms for parallel hardware is important for achieving *scalability*. In the first part of this book, we will consider this technique "cheating". We only do optimizations that are truly free, and preferably don't take away resources from other processes that might be running concurrently.
25
+
26
+
-->
40
27
41
28
### Instruction Pipelining
42
29
43
-
The same things applies to CPUs and other hardware. To increase the utilization, instructions are processed in a pipeline.
30
+
To execute *any* instruction, processors need to do a lot of preparatory work first, which includes:
31
+
32
+
-**fetching** a chunk of machine code from memory,
33
+
-**decoding** it and splitting into instructions,
34
+
-**executing** these instructions, which may involve doing some **memory** operations, and
35
+
-**writing** the results back into registers.
36
+
37
+
This whole sequence of operations is *long*. It takes up to 15-20 CPU cycles even for something simple like `add`-ing two register-stored values together. To hide this latency, modern CPUs use *pipelining*: after an instruction passes through the first stage, they start processing the next one right away, without waiting for the previous one to fully complete.
38
+
39
+

40
+
41
+
Pipelining does not reduce *actual* latency but functionally makes it seem like if it was composed of only the execution and memory stage. You still need to pay these 15-20 cycles, but you only need to do it once after you've found the sequence of instructions you are going to execute.
42
+
43
+
### Latency and Throughput of Instructions
44
+
45
+
It makes sense to duplicate frequently used stages. Such processors are called *superscalar*.
46
+
47
+

48
+
49
+
Interleaving the stages of execution is a general idea in hardware, and it is applied not only in the general CPU pipeline, but also on the level of separate instructions and [memory](/hpc/cpu-cache/mlp).
50
+
51
+
The latency and throughput numbers are architecture-specific. Some samples for my Zen 2:
52
+
53
+
All are specified for 32-bit integers.
54
+
55
+
| Instruction | Latency | RThroughput |
56
+
|-------------|---------|:------------|
57
+
|`jmp`| - | 2 |
58
+
|`mov r, r`| - | 1/4 |
59
+
|`mov r, m`| 4 | 1/2 |
60
+
|`mov m, r`| 3 | 1 |
61
+
|`add`| 1 | 1/3 |
62
+
|`cmp`| 1 | 1/4 |
63
+
|`popcnt`| 1 | 1/4 |
64
+
|`mul`| 3 | 1 |
65
+
|`div`| 13-28 | 13-28 |
66
+
67
+
[Integer division](/hpc/arithmetic/division) is an exception: it is either very poorly pipelined or not pipelined at all (like in this case).
68
+
69
+
You could consider that the latency is zero or undefined. For memory operations, latency is usually specified for L1 cache.
70
+
71
+
Decode width. You can't get throughput higher than that.
72
+
73
+
Some instructions. They have the same latency
74
+
75
+
Sometimes operations have many forms. For example, "mov" with memory operands does. For
76
+
77
+
"RThroughput" is shorthand for "reciprocal throughput". Values less than one mean that.
78
+
79
+
Execution ports (or sometimes "pipes"). This is mostly relevant for SIMD.
80
+
81
+
Some instructions have a latency of 0. This means that these instruction are used to control the scheduler, and they don't reach the execution stage. This is by the virtue of renaming. But they still have non-zero latency because we first need to [process them](/hpc/architecture/layout).
82
+
83
+
You can get this data from special documents called [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf).
84
+
85
+
You can schedule independent instructions separately, but only up to some extent.
86
+
87
+
### Instruction Scheduling
44
88
45
89
Modern processors don’t actually execute instructions one-by-one, but maintain a *pipeline* of pending instructions so that two independent operations can be executed concurrently without waiting for each other to finish.
46
90
47
-
When I said that `add` instruction only takes one cycle, I lied a little bit. Every instruction needs a bit more than that. The whole thing takes around 5-6 clock cycles. But still, when you use it, it appears and feels like a single instruction. How does CPU achieve that?
91
+
uOps ("micro-ops", the first letter is meant to be greek letter mu as in us (microsecond), but nobody cares enough to type it).
48
92
49
-
The thing is, most of CPU isn't about computing.
93
+
Each architecture has its own set of "ports", each capable of executing its own set of instructions (uOps, to be more exact).
50
94
51
-

95
+
But still, when you use it, it appears and feels like a single instruction. How does CPU achieve that?
96
+
97
+
The thing is, most of CPU isn't about computing.
52
98
53
99
Although logically it takes fundamentally 3 cycles, in CPUs it is much more.
54
100
55
-
### An Education Metaphor
101
+
But there is much more that can benefit from parallel thinking.
56
102
57
-
As a everyday metaphor, consider how a university works. It could have one student at a time and around 50 professors, which would take turns in tutoring, but this would be highly inefficient and result in one bachelor's degree every 4 year.
58
103
59
-
Maybe this is how the members of the British royal family study.
60
-
61
-
But for better of worse, the education is scaled.
104
+
or specialized hardware. But actually there is a lot of parallelism happening inside CPU.
62
105
63
-
Instead, universities do two smart things:
106
+
The magic takes place inside the CPU. The compiler just produces two machine language instructions, without any special annotation that indicates whether or not these instructions can be executed in parallel. The CPU will then automatically figure out which of the instructions can be executed in parallel.
64
107
65
-
1. They teach to large groups of students at once instead of individuals.
66
-
2. They overlap their "classes" so that each can all professors keep busy. This way you can increase throughput by 4x.
108
+
A bit more precisely, the CPU will look at the instruction stream up to some distance in the future. If there are branches, it will do branch prediction to produce a sequential stream of instructions. Then it will see which of the instructions are ready for execution. For example, if it sees a future instruction X that only uses registers A and B, and there are no instructions before it that touch those registers, and none of the instructions that are currently in the pipeline modify those registers, either, then it is safe to start to execute X as soon as there is an execution unit that is available.
67
109
68
-
For the first trick, the CPU world analogue is SIMD, which we covered in the previous chapter. And for the second, it is the technique called pipelining, which we are going to discuss next.
110
+
All of this happens in the hardware, all the time, fully automatically. The only thing that the programmer needs to do is to make sure there are sufficiently many independent instructions always available for execution.
69
111
70
112
### Latency and Throughput
71
113
72
-

73
-
74
114
and adds a new level of complexity
75
115
76
116
Programming pipelined and superscalar processors presents its own challenges, which we are going to address in this chapter.
@@ -93,3 +133,34 @@ You know that your documentation is good when people have to reverse engineer it
93
133
There are reasons to believe that folks at Intel don't know that themselves.
94
134
95
135
llvm-mca
136
+
137
+
138
+
### An Education Analogy
139
+
140
+
As a everyday metaphor, consider how a university works. It could have one student at a time and around 50 professors, which would take turns in tutoring, but this would be highly inefficient and result in one bachelor's degree every 4 year.
141
+
142
+
Maybe this is how the members of the British royal family study.
143
+
144
+
But for better of worse, the education is scaled.
145
+
146
+
Instead, universities do two smart things:
147
+
148
+
1. They teach to large groups of students at once instead of individuals, broadcasting the same thing (SIMD).
149
+
2. They might split work between different parallel groups (superscalar processing).
150
+
2. They overlap their classes so that each can all professors keep busy. This way you can increase throughput by 4x.
151
+
152
+
For the first trick, the CPU world analogue is SIMD, which we covered in the previous chapter. And for the second, it is the technique called pipelining, which we are going to discuss next.
153
+
154
+
Kind of match.
155
+
156
+
1. SIMD to process 16, 32, or 64 bytes of data at a time.
157
+
2. Superscalar processing to handle 2 to 4 SIMD blocks at a time
158
+
3. Pipelining (~15, roughly equal to the number of years between kindergarten and PhD)
159
+
160
+
In addition to that, other aspects are also true. Execution paths become more divergent. Some are stalled at various stages. Also some are interrupted. Some are speculated without knowing what happens.
161
+
162
+
There are many aspects, and in this chapter we are going to explore them
163
+
164
+
You might fail a course, but proceed somewhere else.
165
+
166
+
Similar to education, these also cause problems, and the first thing we will do in this chapter is learn how to avoid them.
Copy file name to clipboardExpand all lines: content/english/hpc/pipelining/branching.md
-12Lines changed: 0 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,18 +72,6 @@ for (int i = 0; i < N; i++)
72
72
73
73
N = 1e6, цикл запускается много раз, а переменная sum помечена как volatile — то есть компилятор не может векторизовать цикл, объединить соседние итерации в одну, либо как-нибудь ещё считерить.
74
74
75
-
---
76
-
77
-
Объяснение: чтобы исполнить любую инструкцию, процессору нужно сначала проделать много подготовительной работы: прочитать машинный код из памяти, понять что это за инструкция и где она должна исполняться, найти и положить нужные данные в регистры-операнды, записать результат куда надо… Весь этот процесс занимает приличное время (~15 тактов), но при этом в каждом шаге зайдествован только какой-то отдельный модуль CPU, и поэтому в современных процессорах используют *пайплайнинг*: когда инструкция N начинает исполняться, процессор на следующем такте сразу возьмет в обработку инструкцию (N + 1), не дожидаясь, пока N завершится.
78
-
79
-
Аналогия: система образования, разделенная на классы и курсы. Преподаватели на первом курсе вуза на следующий год будут вести у следующего набора, а не ждать 4 года, пока первый выпустится.
80
-
81
-
Такая техника позволяет одновременно обрабатывать в очереди много инструкций и скрыть их задержку, но если возникает ситуация, что процессор, например, ждет данных от какой-то инструкции, либо не может заранее определить, какую инструкцию ему дальше исполнять, то в пайплайне возникает «пузырь».
82
-
83
-
---
84
-
85
-
Есть два основных типа пузырей: условно лёгкий, когда процессор ждет данные от предыдущей операции (зависит от задержки этой операции, но в нашем случае это ~5 циклов) и тяжелый, когда процессор ждет новых инструкций (~15 циклов).
86
-
87
75
"if" и любой другой control flow — это как раз второй тип, и в случае с исходным циклом он проверяется на каждой итерации, создавая пробку из инструкций. Так как ифы исполняются часто, и с увеличением пайплайна эти пузыри стали серьёзной проблемой, производители процессоров добавили специальную инструкцию cmov ("conditional move"), которая позволяет по произвольному условию записать в переменную либо одно значение, либо другое — но не исполнять произвольный код. Когда мы заменяем явный if на тернарное условие в духе s += (cond ? x : 0), компилятор делает подобную оптимизацию, заменяя ветку на cmov.
88
76
89
77
Это примерно эквивалентно такому алгебраическому трюку:
Copy file name to clipboardExpand all lines: content/english/hpc/pipelining/hazards.md
+15-16Lines changed: 15 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,21 @@ title: Pipeline Hazards
3
3
weight: 1
4
4
---
5
5
6
+
Такая техника позволяет одновременно обрабатывать в очереди много инструкций и скрыть их задержку, но если возникает ситуация, что процессор, например, ждет данных от какой-то инструкции, либо не может заранее определить, какую инструкцию ему дальше исполнять, то в пайплайне возникает «пузырь».
7
+
8
+
9
+
by analogy with an air bubble in a fluid pipe — it propagates through the pipeline.
10
+
11
+

12
+
13
+
Есть два основных типа пузырей: условно лёгкий, когда процессор ждет данные от предыдущей операции (зависит от задержки этой операции, но в нашем случае это ~5 циклов) и тяжелый, когда процессор ждет новых инструкций (~15 циклов).
14
+
15
+
16
+
17
+
Let's talk more about the instruction scheduling and what can go wrong in the pipeline.
18
+
19
+
20
+
6
21
situations that prevent the next instruction in the instruction stream from executing during its designated clock cycles
0 commit comments