Skip to content

Commit de9bb58

Browse files
committed
branchless programming
1 parent 1e14ae6 commit de9bb58

File tree

2 files changed

+1339
-44
lines changed

2 files changed

+1339
-44
lines changed

content/english/hpc/pipelining/branchless.md

Lines changed: 174 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -3,35 +3,105 @@ title: Branchless Programming
33
weight: 3
44
---
55

6-
Sometimes, the best way
6+
As we established in [the pervious section](../branching), branches that can't be effectively predicted by the CPU are expensive as they may cause a long pipeline stall to fetch new instructions after a branch mispredict. In this section, we discuss the means of removing branches in the first place.
77

8-
We are going to continue [the case study of branching](../branching)
8+
### Predication
99

10-
We can remove explicit branching completely by using a special `cmov` ("conditional move") instruction that assigns a value based on a condition.
10+
We are going to continue the same case study we've started before — we create an array of random numbers and sum up all its elements below 50:
1111

1212
```c++
1313
for (int i = 0; i < N; i++)
14-
s += (a[i] < 50 : a[i] : 0);
14+
a[i] = rand() % 100;
15+
16+
volatile int s;
17+
18+
for (int i = 0; i < N; i++)
19+
if (a[i] < 50)
20+
s += a[i];
1521
```
1622

17-
This is roughly equivalent to this algebraic trick:
23+
Our goal is to eliminate the branch caused by the `if` statement. We can try to get rid of it like this:
1824

1925
```c++
20-
s += (a[i] < 50) * a[i];
26+
for (int i = 0; i < N; i++)
27+
s += (a[i] < 50) * a[i];
2128
```
2229

30+
Suddenly, the loop now takes ~7 cycles per element, instead of the original ~14. Also, the performance remains constant if we change `50` to some other threshold, so it doesn't depend on the branch probability.
31+
32+
But wait… shouldn't there still be a branch? How does `(a[i] < 50)` map to assembly?
33+
34+
There are no boolean types in assembly, nor any instructions that yield either one or zero based on the result of the comparison, but we can compute it indirectly like this: `(a[i] - 50) >> 31`. This trick relies on the [binary representation of integers](/hpc/arithmetic/integer), specifically on the fact that if the expression `a[i] - 50` is negative (implying `a[i] < 50`), then the highest bit of the result will be set to one, which we can then extract using a right-shift.
35+
2336
```nasm
24-
mov ecx, -4000000
25-
; todo
26-
27-
mov esi, dword ptr [rdx + a + 4000000]
28-
cmp esi, 50
29-
cmovge esi, eax
30-
add dword ptr [rsp + 12], esi
31-
add rdx, 4
32-
jne .LBB0_4
37+
mov ebx, eax ; t = x
38+
sub ebx, 50 ; t -= 50
39+
sar ebx, 31 ; t >>= 31
40+
mul eax, ebx ; x *= t
3341
```
34-
---
42+
43+
But the compiler actually produced something different. Instead of going with this arithmetic trick, it used a special `cmov` ("conditional move") instruction that assigns a value based on a condition (which is computed and checked using the flags register, the same way as for jumps):
44+
45+
```nasm
46+
mov ebx, 0 ; cmov doesn't support immediate values, so we need a zero register
47+
cmp eax, 50
48+
cmovge eax, ebx ; eax = (eax >= 50 ? eax : ebx=0)
49+
```
50+
51+
So the code above is actually closer to using a ternary operator like this:
52+
53+
```c++
54+
for (int i = 0; i < N; i++)
55+
s += (a[i] < 50 : a[i] : 0);
56+
```
57+
58+
Both variants are optimized by the compiler and produce the following assembly:
59+
60+
```nasm
61+
mov eax, 0
62+
mov ecx, -4000000
63+
loop:
64+
mov esi, dword ptr [rdx + a + 4000000] ; load a[i]
65+
cmp esi, 50
66+
cmovge esi, eax ; esi = (esi >= 50 ? esi : eax=0)
67+
add dword ptr [rsp + 12], esi ; s += esi
68+
add rdx, 4
69+
jnz loop ; "iterate while rdx is not zero"
70+
```
71+
72+
This general technique is called *predication*, and it is roughly equivalent to this algebraic trick:
73+
74+
$$
75+
x = c \cdot a + (1 - c) \cdot b
76+
$$
77+
78+
This way you can eliminate branching, but this comes at the cost of evaluating *both* branches and the `cmov` itself. Because evaluating the ">=" branch costs nothing, the performance is exactly equal to [the "always yes" case](branching/#branch-prediction) in the branchy version.
79+
80+
### When It Is Beneficial
81+
82+
Using predication eliminates [a structural hazard](../hazard), but introduces a data hazard. These is still a pipeline stall, but it is a cheaper one: you only need to wait for `cmov` to be resolved, and not flush the entire pipeline in case of a mispredict.
83+
84+
However, there are many situations when it is more efficient to leave branchy code as it is. This is the case when the cost of computing *both* branches instead of just *one* outweighs the penalty for the potential branch mispredictions.
85+
86+
In our example, the branchy code wins when the branch can be predicted with a probability of more than ~75%.
87+
88+
![](../img/branchy-vs-branchless.svg)
89+
90+
This 75% threshold is commonly used by the compilers as a heuristic for determining whether to use the `cmov` or not. Unfortunately, this probability is usually unknown at the compile-time, so it needs to provided in one of several ways:
91+
92+
- We can use [profile-guided optimization](/hpc/compilation/pgo) which will decide for itself whether to use predication or not.
93+
- We can use [compiler-specific intrinsics](/hpc/compilation/situational) to hint the likeliness of branches: `__builtin_expect_with_probability` in GCC and `__builtin_unpredictable` in Clang.
94+
- We can rewrite branchy code using the ternary operator or various arithmetic tricks, which acts as sort of an implicit contract between programmers and compilers: if the programmer wrote the code this way, then it was probably meant to be branchless.
95+
96+
The "right way" is to use branching hints, but unfortunately, the support for them is lacking. Right now [these hints seem to be lost](https://bugs.llvm.org/show_bug.cgi?id=40027) by the time the compiler back-end decides whether a `cmov` is more beneficial. Currently, there is no good way of forcing the compiler to generate branch-free code, so sometimes the best hope is to just write a small snippet in assembly.
97+
98+
<!--
99+
100+
Because this is very architecture-specific.
101+
102+
in the absence of branch likeliness hints
103+
104+
While any program that uses a ternary operator is equivalent to a program that uses an `if` statement
35105
36106
The codes seem equivalent. My guess is that the compiler doesn't know that `s + a[i]` does not cause integer overflow.
37107
@@ -43,55 +113,115 @@ The `cmov` variant doesn't care about probabilities of branches. It only wins if
43113
44114
This is a legal optimization, but I guess an implicit contract has evolved between application programmers and compiler engineers that if you write a ternary operator, then you kind of telling that it is likely going to be an unpredictable branch.
45115
46-
Such techniques are called *branchless* or *branch-free* programming. It is very beneficial for SIMD.
116+
The general technique is called *branchless* or *branch-free* programming. Predication is the main tool of it, but there are more complicated ways.
117+
118+
-->
47119

48-
When you get rid of `volatile`, compiler gets permission to vectorize the loop. It looks remarkably similar, but using vector instructions instead of the scalar ones:
120+
<!--
121+
122+
Let's do a few more examples as an exercise.
49123
50124
```c++
51-
/* volatile */ int s = 0;
125+
int max(int a, int b) {
126+
return (a > b) * a + (a <= b) * b;
127+
}
128+
```
52129
53-
for (int i = 0; i < N; i++)
54-
if (a[i] < 50)
55-
s += a[i];
130+
```c++
131+
int max(int a, int b) {
132+
return (a > b ? a : b);
133+
}
56134
```
57135
58-
~0.3 per element, which is mainly bottlenecked by [the memory](/hpc/cpu-cache/bandwidth).
59136
60-
### Forcing Predication
137+
```c++
138+
int abs(int a, int b) {
139+
return max(diff, -diff);
140+
}
141+
```
142+
143+
```c++
144+
int abs(int a, int b) {
145+
int diff = a - b;
146+
return (diff < 0 ? -diff : diff);
147+
}
148+
```
61149
62-
There are ways how to tell the compiler that `cmov` is beneficial You can use `__builtin_expect_with_probability(cond, true, 0.5` in GCC and `__builtin_unpredictable(cond)` in Clang, but both these hints are lost by the time optimizer decides what is more beneficial: branch or a cmov.
150+
```c++
151+
int abs(int a) {
152+
return (a > 0 ? a : -a);
153+
}
154+
```
155+
156+
```c++
157+
int abs(int a) {
158+
int mask = a >> 31;
159+
a ^= mask;
160+
a -= mask;
161+
return a;
162+
}
163+
```
63164
64-
https://bugs.llvm.org/show_bug.cgi?id=40027
165+
-->
65166

66-
Branching is expensive.
167+
### Larger Examples
67168

68-
You can use `cmov` instructions. And something more complex in simd.
169+
**Strings.** Oversimplifying things, an `std::string` is comprised of a pointer to a null-terminated char array (also known as "C-string") allocated somewhere on the heap and one integer containing the string size.
69170

70-
Unfortunately.
171+
A very common value for strings is the empty string — which is also its default value. You also need to handle them somehow, and the idiomatic thing to do is to assign `nullptr` as the pointer and `0` as the string size, and then check if the pointer is null or if the size is zero at the beginning of every procedure involving strings.
71172

72-
Some more complex examples.
173+
However, this requires a separate branch, which is costly unless most strings are empty. What we can do to get rid of it is to to allocate a "zero C-string", which is just a zero byte allocated somewhere, and then simply point all empty strings there. Now all string operations with empty strings have to read this useless zero byte, but this is still much cheaper than a branch misprediction.
73174

74-
Branchless and sometimes branch-free.
175+
**Binary search.** The standard binary search [can be implemented](/hpc/data-structures/binary-search) without branches, and on small arrays (that fit into cache) it works ~4x faster than the branchy `std::lower_bound`:
75176

76-
Predication
177+
```c++
178+
int lower_bound(int x) {
179+
int *base = t, len = n;
180+
while (len > 1) {
181+
int half = len / 2;
182+
base = (base[half] < x ? &base[half] : base);
183+
len -= half;
184+
}
185+
return *(base + (*base < x));
186+
}
187+
```
77188
78-
Hardware (stats-based) branch predictor is built-in in CPUs,
79-
$\implies$ Replacing high-entropy `if`'s with predictable ones help
80-
You can replace conditional assignments with arithmetic:
81-
`x = cond * a + (1 - cond) * b`
82-
This became a common pattern, so CPU manufacturers added `CMOV` op
83-
that does `x = (cond ? a : b)` in one cycle
84-
*^This masking trick will be used a lot for SIMD and CUDA later*
189+
Other than being more complex, it has another slight drawback in that it potentially does more comparisons (constant $\lceil \log_2 n \rceil$ instead of either $\lfloor \log_2 n \rfloor$ or $\lceil \log_2 n \rceil$) and can't speculate on future memory reads (which acts as prefetching, so it loses on very large arrays).
85190
86-
### Real-World Examples
191+
In general, data structures are made branchless by implicitly or explicitly *padding* them, so that their operations take a constant number of iterations. Refer to [the article](/hpc/data-structures/binary-search) for more complex examples.
87192
88-
**Strings.** Oversimplifying things, the struct of an `std::string` is composed of a pointer to a null-terminated char array (aka c-string) allocated somewhere on the heap and the string size.
193+
<!--
89194
90-
A very common value for strings is the empty string (which is also its default value), which you also need to handle somehow. An idiomatic thing to do is to put `nullptr` as pointer and 0 as the string length, and then check if the pointer is null or if the size is zero.
195+
The only downside of the branchless implementation is that it potentially does more memory reads:
91196
92-
However, this check requires a separate branch and divergence in the execution path. So what we can do instead is to allocate a zero string somewhere (a pointer to a zero byte) and then simply point all empty strings there. All string operations have to read this useless zero byte, but it is still cheaper than doing the branch check.
197+
There are typically two ways to achieve this:
93198
94-
**Binary search.** Binary search [can be implemented](/hpc/algorithms/binary-search) without branches, and on small (fitting into cache) arrays it works ~4x faster than the branchy `std::lower_bound`.
199+
And in general, data structures can be "padded" to be made constant size or height.
95200
96201
That there are no substantial reasons why compilers can't do this on their own, but unfortunately this is just how it is right now.
97202
203+
-->
204+
205+
**Data-parallel programming.** Branchless programming is very important for [SIMD](/hpc/simd) applications, including GPU programming, because they don't have branching in the first place.
206+
207+
In our array sum example, if you remove the `volatile` type qualifier from the accumulator, the compiler becomes able to [vectorize](/hpc/simd/autovectorization) the loop:
208+
209+
```c++
210+
/* volatile */ int s = 0;
211+
212+
for (int i = 0; i < N; i++)
213+
if (a[i] < 50)
214+
s += a[i];
215+
```
216+
217+
It now works in ~0.3 per element, which is mainly [bottlenecked by the memory](/hpc/cpu-cache/bandwidth).
218+
219+
The compiler is usually able to vectorize any loop that doesn't have branches or dependencies between the iterations — and some specific deviations from that, such as [reductions](/hpc/simd/reduction) or simple loops that contain just one if-without-else. Vectorization of anything more complex is a very nontrivial problem, which may involve various techniques such as [masking](/hpc/simd/masking) and [in-register permutations](/hpc/simd/permutation).
220+
221+
<!--
222+
223+
**Binary exponentiation.** However, when it is constant
224+
225+
When we can iterate in small batches, [autovectorization](/hpc/simd/autovectorization) speeds it up 13x.
226+
227+
-->

0 commit comments

Comments
 (0)