You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we established in [the pervious section](../branching), branches that can't be effectively predicted by the CPU are expensive as they may cause a long pipeline stall to fetch new instructions after a branch mispredict. In this section, we discuss the means of removing branches in the first place.
7
7
8
-
We are going to continue [the case study of branching](../branching)
8
+
### Predication
9
9
10
-
We can remove explicit branching completely by using a special `cmov` ("conditional move") instruction that assigns a value based on a condition.
10
+
We are going to continue the same case study we've started before — we create an array of random numbers and sum up all its elements below 50:
11
11
12
12
```c++
13
13
for (int i = 0; i < N; i++)
14
-
s += (a[i] < 50 : a[i] : 0);
14
+
a[i] = rand() % 100;
15
+
16
+
volatileint s;
17
+
18
+
for (int i = 0; i < N; i++)
19
+
if (a[i] < 50)
20
+
s += a[i];
15
21
```
16
22
17
-
This is roughly equivalent to this algebraic trick:
23
+
Our goal is to eliminate the branch caused by the `if` statement. We can try to get rid of it like this:
18
24
19
25
```c++
20
-
s += (a[i] < 50) * a[i];
26
+
for (int i = 0; i < N; i++)
27
+
s += (a[i] < 50) * a[i];
21
28
```
22
29
30
+
Suddenly, the loop now takes ~7 cycles per element, instead of the original ~14. Also, the performance remains constant if we change `50` to some other threshold, so it doesn't depend on the branch probability.
31
+
32
+
But wait… shouldn't there still be a branch? How does `(a[i] < 50)` map to assembly?
33
+
34
+
There are no boolean types in assembly, nor any instructions that yield either one or zero based on the result of the comparison, but we can compute it indirectly like this: `(a[i] - 50) >> 31`. This trick relies on the [binary representation of integers](/hpc/arithmetic/integer), specifically on the fact that if the expression `a[i] - 50` is negative (implying `a[i] < 50`), then the highest bit of the result will be set to one, which we can then extract using a right-shift.
35
+
23
36
```nasm
24
-
mov ecx, -4000000
25
-
; todo
26
-
27
-
mov esi, dword ptr [rdx + a + 4000000]
28
-
cmp esi, 50
29
-
cmovge esi, eax
30
-
add dword ptr [rsp + 12], esi
31
-
add rdx, 4
32
-
jne .LBB0_4
37
+
mov ebx, eax ; t = x
38
+
sub ebx, 50 ; t -= 50
39
+
sar ebx, 31 ; t >>= 31
40
+
mul eax, ebx ; x *= t
33
41
```
34
-
---
42
+
43
+
But the compiler actually produced something different. Instead of going with this arithmetic trick, it used a special `cmov` ("conditional move") instruction that assigns a value based on a condition (which is computed and checked using the flags register, the same way as for jumps):
44
+
45
+
```nasm
46
+
mov ebx, 0 ; cmov doesn't support immediate values, so we need a zero register
47
+
cmp eax, 50
48
+
cmovge eax, ebx ; eax = (eax >= 50 ? eax : ebx=0)
49
+
```
50
+
51
+
So the code above is actually closer to using a ternary operator like this:
52
+
53
+
```c++
54
+
for (int i = 0; i < N; i++)
55
+
s += (a[i] < 50 : a[i] : 0);
56
+
```
57
+
58
+
Both variants are optimized by the compiler and produce the following assembly:
This general technique is called *predication*, and it is roughly equivalent to this algebraic trick:
73
+
74
+
$$
75
+
x = c \cdot a + (1 - c) \cdot b
76
+
$$
77
+
78
+
This way you can eliminate branching, but this comes at the cost of evaluating *both* branches and the `cmov` itself. Because evaluating the ">=" branch costs nothing, the performance is exactly equal to [the "always yes" case](branching/#branch-prediction) in the branchy version.
79
+
80
+
### When It Is Beneficial
81
+
82
+
Using predication eliminates [a structural hazard](../hazard), but introduces a data hazard. These is still a pipeline stall, but it is a cheaper one: you only need to wait for `cmov` to be resolved, and not flush the entire pipeline in case of a mispredict.
83
+
84
+
However, there are many situations when it is more efficient to leave branchy code as it is. This is the case when the cost of computing *both* branches instead of just *one* outweighs the penalty for the potential branch mispredictions.
85
+
86
+
In our example, the branchy code wins when the branch can be predicted with a probability of more than ~75%.
87
+
88
+

89
+
90
+
This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the `cmov` or not. Unfortunately, this probability is usually unknown at the compile-time, so it needs to provided in one of several ways:
91
+
92
+
- We can use [profile-guided optimization](/hpc/compilation/pgo) which will decide for itself whether to use predication or not.
93
+
- We can use [compiler-specific intrinsics](/hpc/compilation/situational) to hint the likeliness of branches: `__builtin_expect_with_probability` in GCC and `__builtin_unpredictable` in Clang.
94
+
- We can rewrite branchy code using the ternary operator or various arithmetic tricks, which acts as sort of an implicit contract between programmers and compilers: if the programmer wrote the code this way, then it was probably meant to be branchless.
95
+
96
+
The "right way" is to use branching hints, but unfortunately, the support for them is lacking. Right now [these hints seem to be lost](https://bugs.llvm.org/show_bug.cgi?id=40027) by the time the compiler back-end decides whether a `cmov` is more beneficial. Currently, there is no good way of forcing the compiler to generate branch-free code, so sometimes the best hope is to just write a small snippet in assembly.
97
+
98
+
<!--
99
+
100
+
Because this is very architecture-specific.
101
+
102
+
in the absence of branch likeliness hints
103
+
104
+
While any program that uses a ternary operator is equivalent to a program that uses an `if` statement
35
105
36
106
The codes seem equivalent. My guess is that the compiler doesn't know that `s + a[i]` does not cause integer overflow.
37
107
@@ -43,55 +113,115 @@ The `cmov` variant doesn't care about probabilities of branches. It only wins if
43
113
44
114
This is a legal optimization, but I guess an implicit contract has evolved between application programmers and compiler engineers that if you write a ternary operator, then you kind of telling that it is likely going to be an unpredictable branch.
45
115
46
-
Such techniques are called *branchless* or *branch-free* programming. It is very beneficial for SIMD.
116
+
The general technique is called *branchless* or *branch-free* programming. Predication is the main tool of it, but there are more complicated ways.
117
+
118
+
-->
47
119
48
-
When you get rid of `volatile`, compiler gets permission to vectorize the loop. It looks remarkably similar, but using vector instructions instead of the scalar ones:
120
+
<!--
121
+
122
+
Let's do a few more examples as an exercise.
49
123
50
124
```c++
51
-
/* volatile */int s = 0;
125
+
int max(int a, int b) {
126
+
return (a > b) * a + (a <= b) * b;
127
+
}
128
+
```
52
129
53
-
for (int i = 0; i < N; i++)
54
-
if (a[i] < 50)
55
-
s += a[i];
130
+
```c++
131
+
int max(int a, int b) {
132
+
return (a > b ? a : b);
133
+
}
56
134
```
57
135
58
-
~0.3 per element, which is mainly bottlenecked by [the memory](/hpc/cpu-cache/bandwidth).
59
136
60
-
### Forcing Predication
137
+
```c++
138
+
int abs(int a, int b) {
139
+
return max(diff, -diff);
140
+
}
141
+
```
142
+
143
+
```c++
144
+
int abs(int a, int b) {
145
+
int diff = a - b;
146
+
return (diff < 0 ? -diff : diff);
147
+
}
148
+
```
61
149
62
-
There are ways how to tell the compiler that `cmov` is beneficial You can use `__builtin_expect_with_probability(cond, true, 0.5` in GCC and `__builtin_unpredictable(cond)` in Clang, but both these hints are lost by the time optimizer decides what is more beneficial: branch or a cmov.
150
+
```c++
151
+
int abs(int a) {
152
+
return (a > 0 ? a : -a);
153
+
}
154
+
```
155
+
156
+
```c++
157
+
int abs(int a) {
158
+
int mask = a >> 31;
159
+
a ^= mask;
160
+
a -= mask;
161
+
return a;
162
+
}
163
+
```
63
164
64
-
https://bugs.llvm.org/show_bug.cgi?id=40027
165
+
-->
65
166
66
-
Branching is expensive.
167
+
### Larger Examples
67
168
68
-
You can use `cmov` instructions. And something more complex in simd.
169
+
**Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated char array (also known as "C-string") allocated somewhere on the heap and one integer containing the string size.
69
170
70
-
Unfortunately.
171
+
A very common value for strings is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic thing to do is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings.
71
172
72
-
Some more complex examples.
173
+
However, this requires a separate branch, which is costly unless most strings are empty. What we can do to get rid of it is to to allocate a "zero C-string", which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction.
73
174
74
-
Branchless and sometimes branch-free.
175
+
**Binary search.** The standard binary search [can be implemented](/hpc/data-structures/binary-search) without branches, and on small arrays (that fit into cache) it works ~4x faster than the branchy `std::lower_bound`:
75
176
76
-
Predication
177
+
```c++
178
+
intlower_bound(int x) {
179
+
int *base = t, len = n;
180
+
while (len > 1) {
181
+
int half = len / 2;
182
+
base = (base[half] < x ? &base[half] : base);
183
+
len -= half;
184
+
}
185
+
return *(base + (*base < x));
186
+
}
187
+
```
77
188
78
-
Hardware (stats-based) branch predictor is built-in in CPUs,
79
-
$\implies$ Replacing high-entropy `if`'s with predictable ones help
80
-
You can replace conditional assignments with arithmetic:
81
-
`x = cond * a + (1 - cond) * b`
82
-
This became a common pattern, so CPU manufacturers added `CMOV` op
83
-
that does `x = (cond ? a : b)` in one cycle
84
-
*^This masking trick will be used a lot for SIMD and CUDA later*
189
+
Other than being more complex, it has another slight drawback in that it potentially does more comparisons (constant $\lceil \log_2 n \rceil$ instead of either $\lfloor \log_2 n \rfloor$ or $\lceil \log_2 n \rceil$) and can't speculate on future memory reads (which acts as prefetching, so it loses on very large arrays).
85
190
86
-
### Real-World Examples
191
+
In general, data structures are made branchless by implicitly or explicitly *padding* them, so that their operations take a constant number of iterations. Refer to [the article](/hpc/data-structures/binary-search) for more complex examples.
87
192
88
-
**Strings.** Oversimplifying things, the struct of an `std::string` is composed of a pointer to a null-terminated char array (aka c-string) allocated somewhere on the heap and the string size.
193
+
<!--
89
194
90
-
A very common value for strings is the empty string (which is also its default value), which you also need to handle somehow. An idiomatic thing to do is to put `nullptr` as pointer and 0 as the string length, and then check if the pointer is null or if the size is zero.
195
+
The only downside of the branchless implementation is that it potentially does more memory reads:
91
196
92
-
However, this check requires a separate branch and divergence in the execution path. So what we can do instead is to allocate a zero string somewhere (a pointer to a zero byte) and then simply point all empty strings there. All string operations have to read this useless zero byte, but it is still cheaper than doing the branch check.
197
+
There are typically two ways to achieve this:
93
198
94
-
**Binary search.** Binary search [can be implemented](/hpc/algorithms/binary-search) without branches, and on small (fitting into cache) arrays it works ~4x faster than the branchy `std::lower_bound`.
199
+
And in general, data structures can be "padded" to be made constant size or height.
95
200
96
201
That there are no substantial reasons why compilers can't do this on their own, but unfortunately this is just how it is right now.
97
202
203
+
-->
204
+
205
+
**Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications, including GPU programming, because they don't have branching in the first place.
206
+
207
+
In our array sum example, if you remove the `volatile` type qualifier from the accumulator, the compiler becomes able to [vectorize](/hpc/simd/autovectorization) the loop:
208
+
209
+
```c++
210
+
/* volatile */ int s = 0;
211
+
212
+
for (int i = 0; i < N; i++)
213
+
if (a[i] < 50)
214
+
s += a[i];
215
+
```
216
+
217
+
It now works in ~0.3 per element, which is mainly [bottlenecked by the memory](/hpc/cpu-cache/bandwidth).
218
+
219
+
The compiler is usually able to vectorize any loop that doesn't have branches or dependencies between the iterations — and some specific deviations from that, such as [reductions](/hpc/simd/reduction) or simple loops that contain just one if-without-else. Vectorization of anything more complex is a very nontrivial problem, which may involve various techniques such as [masking](/hpc/simd/masking) and [in-register permutations](/hpc/simd/permutation).
220
+
221
+
<!--
222
+
223
+
**Binary exponentiation.** However, when it is constant
224
+
225
+
When we can iterate in small batches, [autovectorization](/hpc/simd/autovectorization) speeds it up 13x.
0 commit comments