Skip to content

Commit 12bd964

Browse files
committed
b-tree draft
1 parent 34b02ef commit 12bd964

File tree

4 files changed

+1215
-1047
lines changed

4 files changed

+1215
-1047
lines changed

content/english/hpc/data-structures/b-tree.md

Lines changed: 180 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -4,68 +4,63 @@ weight: 3
44
draft: true
55
---
66

7-
In the [previous article](../s-tree), we designed *static* B-trees (*S-trees*), and we [briefly discussed](../s-tree/#as-a-dynamic-tree) how to turn them *dynamic* while retaining performance gains from [SIMD](/hpc/simd).
7+
In the [previous article](../s-tree), we designed *static* B-trees to speed up binary searching in sorted arrays, designing S-tree and S+ Tree. In the last section we [briefly discussed](../s-tree/#as-a-dynamic-tree) how to turn them *dynamic* while retaining performance gains from [SIMD](/hpc/simd), making a proof-of-concept. Simply adding pointers to S+ tree.
88

9-
In this article
9+
In this section, we follow up on that promise and design a minimally functional search tree for integer keys, called *B− tree*, that achieves significant improvements over [improvements](#evaluation): 7-18 times faster for large arrays and 3-8 faster for inserts. The [absl::btree](https://abseil.io/blog/20190812-btree) 3-7 times faster for searches and 1.5-2 times faster for with yet ample room for improvement.
1010

11-
The problem is multi-dimensional.
11+
The memory overhead of the structure around 30%. The [final implementation](https://github.com/sslotin/amh-code/blob/main/b-tree/btree-final.cc) is around 150 lines of C.
1212

13-
Of course, this comparison is not fair, as implementing a dynamic search tree is a more high-dimensional problem.
14-
15-
We’d also need to implement the update operation, which will not be that efficient, and for which we’d need to sacrifice the fanout factor. But it still seems possible to implement a 10-20x faster std::set and a 3-5x faster absl::btree_set, depending on how you define “faster” — and this is one of the things we’ll attempt to do next.
16-
17-
Static as
18-
19-
![](../img/btree-absolute.svg)
20-
21-
![](../img/btree-relative.svg)
22-
23-
![](../img/btree-absl.svg)
24-
25-
When the data set is small, the latency increases in discrete steps: 3.5ns for under 32 elements, 6.5ns, and to 12ns, until it hits the L2 cache (not shown on graphs) and starts increasing more smoothly yet still with noticeable spikes when the tree grows upwards.
26-
27-
One interesting use case is rope, also known as cord, which is used for wrapping strings in a tree to support mass operations. For example, editing a very large text file. Which is the topic.
28-
29-
It is common that >90% of operations are lookups. Optimizing searches is important because every other operation starts with locating a key.
30-
31-
I don't know (yet) why insertions are *that* slow. My guess is that it has something to do with data dependencies between queries.
32-
33-
I apologize to everyone else, but this is sort of your fault for not using a public benchmark.
13+
We give more details in th evaluation section.
3414

3515
## B− Tree
3616

37-
[B+ tree](../s-tree/#b-tree-layout-1).
17+
Instead of making small incremental changes, we will design just one data structure in this article, which is based on [B+ tree](../s-tree/#b-tree-layout-1) with a few minor exceptions:
3818

39-
B− ("B minus") tree. The difference is:
19+
- We do not store any pointers except for the children (while B+ stores one pointer for the next leaf node).
20+
- We define key $i$ as the *maximum* key in the subtree of child $i$ instead of the *minimum* key in the subtree of child $(i + 1)$. This removes the need.
21+
- We use a small node size $B=32$. This is needed simd to be efficient (we will discuss other node sizes later).
4022

41-
- We are specifically storing the *last* element. This is needed
42-
- We use a small node size $B=32$. This is needed simd to be efficient (we will discuss other node sizes in the future)
43-
- We don't store any pointers except for the children (while B+ stores one pointer for the next leaf node).
23+
There is some overhead, so it makes sense to use more than one cache line.
4424

45-
The difference is that
25+
Analogous to the B+ tree, we call this modification *B− tree*.
4626

4727
### Layout
4828

49-
To simplify memory all with infinities.
29+
We rely on arena allocation.
5030

5131
```c++
52-
const int B = 32;
53-
int H = 1; // tree height
32+
const int B = 32; // node size
5433

5534
const int R = 1e8; // reserve
56-
5735
alignas(64) int tree[R];
58-
int n_tree = B; // 31 (+ 1) + 32 for internal nodes and 31 for data nodes
59-
int root = 0;
6036

37+
int root = 0; // where the tree root starts
38+
int n_tree = B;
39+
int H = 1; // tree height
40+
```
41+
42+
To further simplify the implementation, we set all array cells with infinities:
43+
44+
```c++
6145
for (int i = 0; i < R; i++)
6246
tree[i] = INT_MAX;
6347
```
6448

49+
We can do this — this does not affect performance. Memory allocation and initialization is not the bottleneck.
50+
51+
To save precious cache space, we use [indices instead of pointers](/hpc/cpu-cache/pointers/).
52+
Despite that they are in separate cache lines, it still [makes sense](/hpc/cpu-cache/aos-soa/) to store them close to keys.
53+
54+
This way, leaf nodes occupy 2 cache lines and waste 1 slot, while internal nodes occupy 4 cache lines and waste 2+1=3 slots.
55+
6556
To "allocate" a new node, we simply increase `n_tree` by $B$ if it is a data node or by $2 \cdot B$ if it is an internal node.
6657

6758
### Searching
6859

60+
We used permutations when we implemented [S-trees](../s-tree/#optimization). Storing values in permuted order will make inserts much harder, so we change the approach.
61+
62+
Using popcount instead of tzcnt: the index i is equal to the number of keys less than x, so we can compare x against all keys, combine the vector mask any way we want, call maskmov, and then calculate the number of set bits with popcnt. This removes the need to store the keys in any particular order, which lets us skip the permutation step and also use this procedure on the last layer as well.
63+
6964
```c++
7065
typedef __m256i reg;
7166

@@ -89,9 +84,12 @@ unsigned rank32(reg x, int *node) {
8984
}
9085
```
9186
87+
This is also the reason why the "key area" in the nodes should not be contaminated and only store keys padded with infinities — or masked out.
88+
89+
To implement `lower_bound`, we just use the same procedure, but fetch the pointer after we computed the child number:
90+
9291
```c++
9392
int lower_bound(int _x) {
94-
//std::cerr << std::endl << "lb " << _x << std::endl;
9593
unsigned k = root;
9694
reg x = _mm256_set1_epi32(_x);
9795
@@ -106,7 +104,17 @@ int lower_bound(int _x) {
106104
}
107105
```
108106

109-
### Insertions
107+
Implementing `lower_bound` is easy, and it doesn't introduce much overhead. The hard part is to implement insertion.
108+
109+
### Insertion
110+
111+
Insertion needs a lot of logic, but the good news is that it does not have to be executed frequently.
112+
113+
Most of the time, all we need is to reach a leaf node and then insert a key into it, moving some other keys one position to the right.
114+
115+
Occasionally, we also need to split the node and/or update some parents, but this is relatively rare, so let's focus on the most common part.
116+
117+
To insert efficiently.
110118

111119
```c++
112120
struct Precalc {
@@ -131,45 +139,49 @@ void insert(int *node, int i, int x) {
131139
}
132140
node[i] = x;
133141
}
142+
```
143+
144+
Next, let's try. To split a node, we need. So let's write another primitive:
134145
146+
```c++
135147
// move the second half of a node and fill it with infinities
136148
void move(int *from, int *to) {
137149
const reg infs = _mm256_set1_epi32(INT_MAX);
138150
for (int i = 0; i < B / 2; i += 8) {
139151
reg t = _mm256_load_si256((reg*) &from[B / 2 + i]);
140152
_mm256_store_si256((reg*) &to[i], t);
141-
_mm256_store_si256((reg*) &from[B / 2 + i], infs); // probably not necessary for pointers
153+
_mm256_store_si256((reg*) &from[B / 2 + i], infs);
142154
}
143155
}
144156
```
145157

158+
Now we need to (very carefully)
159+
146160
```c++
147161
void insert(int _x) {
148-
unsigned sk[20], si[20];
162+
// we save the path we visited in case we need to update some of our ancestors
163+
unsigned sk[10], si[10];
149164

150165
unsigned k = root;
151166
reg x = _mm256_set1_epi32(_x);
152167

153168
for (int h = 0; h < H - 1; h++) {
154169
unsigned i = rank32(x, &tree[k]);
155-
sk[h] = k, si[h] = i;
170+
171+
// check if we need to update the key right away
172+
tree[k + i] = (_x > tree[k + i] ? _x : tree[k + i]);
173+
sk[h] = k, si[h] = i; // and save the path
174+
156175
k = tree[k + B + i];
157176
}
158177

159178
unsigned i = rank32(x, &tree[k]);
160179

180+
// we can start computing this check ahead of insertion
161181
bool filled = (tree[k + B - 2] != INT_MAX);
162-
bool updated = (tree[k + i] == INT_MAX);
163182

164183
insert(tree + k, i, _x);
165184

166-
if (updated) {
167-
for (int h = H - 2; h >= 0; h--) {
168-
int idx = sk[h] + si[h];
169-
tree[idx] = (tree[idx] < _x ? _x : tree[idx]);
170-
}
171-
}
172-
173185
if (filled) {
174186
// create a new leaf node
175187
move(tree + k, tree + n_tree);
@@ -180,16 +192,19 @@ void insert(int _x) {
180192
n_tree += B;
181193

182194
for (int h = H - 2; h >= 0; h--) {
195+
// for each parent node we repeat this process
196+
// until we reach the root of determine that the node is not split
183197
k = sk[h], i = si[h];
184198

185199
filled = (tree[k + B - 3] != INT_MAX);
186200

187-
// the node already has a correct key (right one) and a correct pointer (left one)
201+
// the node already has a correct key (right one)
202+
// and a correct pointer (left one)
188203
insert(tree + k, i, v);
189204
insert(tree + k + B, i + 1, p);
190205
191206
if (!filled)
192-
return;
207+
return; // we're done
193208

194209
// create a new internal node
195210
move(tree + k, tree + n_tree); // move keys
@@ -202,26 +217,128 @@ void insert(int _x) {
202217
n_tree += 2 * B;
203218
}
204219

205-
if (filled) {
206-
// tree grows
220+
// if we've reached here, this means we've reached the root, and it was split
221+
tree[n_tree] = v;
207222

208-
tree[n_tree] = v;
223+
tree[n_tree + B] = root;
224+
tree[n_tree + B + 1] = p;
209225

210-
tree[n_tree + B] = root;
211-
tree[n_tree + B + 1] = p;
226+
root = n_tree;
227+
n_tree += 2 * B;
228+
H++;
229+
}
230+
}
231+
```
212232
213-
root = n_tree;
214-
n_tree += 2 * B;
215-
H++;
216-
}
233+
There are many inefficiencies, but luckily they are rarely called.
234+
235+
## Evaluation
236+
237+
We need to implement `insert` and `lower_bound`. Deletions, iteration, and other things are not our concern for now.
238+
239+
Of course, this comparison is not fair, as implementing a dynamic search tree is a more high-dimensional problem.
240+
241+
Technically, we use `std::multiset` and `absl::btree_multiset` to support repeated keys.
242+
243+
Keys are uniform, but we should not rely on that fact (e. g. using ).
244+
245+
It is common that >90% of operations are lookups. Optimizing searches is important because every other operation starts with locating a key.
246+
247+
(a different set each time)
248+
249+
We use different points between $10^4$ and $10^7$ in (arount 250 in total). After, we use $10^6$ queries (independently random each time). All data is generated uniformly in the range $[0, 2^{30})$ and independent between stages.
250+
251+
$1.17^k$ and $1.17^{k+1}$.
252+
253+
It may or may not be representative of your use case.
254+
255+
As predicted, the performance is much better:
256+
257+
![](../img/btree-absolute.svg)
258+
259+
When the data set is small, the latency increases in discrete steps: 3.5ns for under 32 elements, 6.5ns, and to 12ns, until it hits the L2 cache (not shown on graphs) and starts increasing more smoothly yet still with noticeable spikes when the tree grows upwards.
260+
261+
![](../img/btree-relative.svg)
262+
263+
I apologize to everyone else, but this is sort of your fault for not using a public benchmark.
264+
265+
![](../img/btree-absl.svg)
266+
267+
I don't know (yet) why insertions are *that* slow. My guess is that it has something to do with data dependencies between queries.
268+
269+
### Possible Optimizations
270+
271+
Maximum height was 6.
272+
273+
Compile. I tried it, but couldn't get the compiler to generate optimal code.
274+
275+
The idiomatic C++ way is to use virtual functions, but we will be explicit:
276+
277+
```c++
278+
void (*insert_ptr)(int);
279+
int (*lower_bound_ptr)(int);
280+
281+
void insert(int x) {
282+
insert_ptr(x);
283+
}
284+
285+
int lower_bound(int x) {
286+
return lower_bound_ptr(x);
287+
}
288+
```
289+
290+
```c++
291+
template <int H>
292+
void insert_impl(int _x) {
293+
// ...
294+
}
295+
296+
template <int H>
297+
void insert_impl(int _x) {
298+
// ...
299+
if (/* tree grows */) {
300+
// ...
301+
insert_ptr = &insert_impl<H + 1>;
302+
lower_bound_ptr = &lower_bound_impl<H + 1>;
217303
}
218304
}
305+
306+
template <>
307+
void insert_impl<10>(int x) {
308+
std::cerr << "This depth was not supposed to be reached" << std::endl;
309+
exit(1);
310+
}
219311
```
220312
221-
## Optimizations
313+
```c++
314+
insert_ptr = &insert_impl<1>;
315+
lower_bound_ptr = &lower_bound_impl<1>;
316+
```
317+
318+
Recursion unrolled.
319+
320+
### Other Operations
321+
322+
Going to father and fetching $B$ pointers at a time is faster as it negates [pointer chasing](/hpc/cpu-cache/latency/).
323+
324+
Pointer to either parent or next node.
325+
326+
Stack of ancestors.
327+
328+
Nodes are at least ½ full (because they are created ½ full), except for the root, and, on average, ¾ full assuming random inserts.
329+
330+
We can't store junk in keys.
331+
332+
B* split
333+
334+
If the node is at least half-full, we're done. Otherwise, we try to borrow keys from siblings (no expensive two-pointer merging is necessary: we can just append them to the end/beginning and swap key of the parent).
335+
336+
If that fails, we can merge the two nodes together, and iteratively delete the key in the parent.
337+
338+
One interesting use case is *rope*, also known as *cord*, which is used for wrapping strings in a tree to support mass operations. For example, editing a very large text file. Which is the topic.
222339

223-
...
340+
[Skip list](https://en.wikipedia.org/wiki/Skip_list), which [some attempts to vectorize it](https://doublequan.github.io/), although it may achieve higher total throughput in concurrent setting. I have low hope that it can be improved.
224341

225342
## Acknowledgements
226343

227-
Thanks to Danila Kutenin for meaningful discussions of applicability.
344+
Thanks to [Danila Kutenin](https://danlark.org/) from Google for meaningful discussions of applicability and possible replacement in Abseil.

0 commit comments

Comments
 (0)