You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the [previous article](../s-tree), we designed *static* B-trees (*S-trees*), and we [briefly discussed](../s-tree/#as-a-dynamic-tree) how to turn them *dynamic* while retaining performance gains from [SIMD](/hpc/simd).
7
+
In the [previous article](../s-tree), we designed *static* B-trees to speed up binary searching in sorted arrays, designing S-tree and S+ Tree. In the last section we [briefly discussed](../s-tree/#as-a-dynamic-tree) how to turn them *dynamic* while retaining performance gains from [SIMD](/hpc/simd), making a proof-of-concept. Simply adding pointers to S+ tree.
8
8
9
-
In this article
9
+
In this section, we follow up on that promise and design a minimally functional search tree for integer keys, called *B− tree*, that achieves significant improvements over [improvements](#evaluation): 7-18 times faster for large arrays and 3-8 faster for inserts. The [absl::btree](https://abseil.io/blog/20190812-btree) 3-7 times faster for searches and 1.5-2 times faster for with yet ample room for improvement.
10
10
11
-
The problem is multi-dimensional.
11
+
The memory overhead of the structure around 30%. The [final implementation](https://github.com/sslotin/amh-code/blob/main/b-tree/btree-final.cc) is around 150 lines of C.
12
12
13
-
Of course, this comparison is not fair, as implementing a dynamic search tree is a more high-dimensional problem.
14
-
15
-
We’d also need to implement the update operation, which will not be that efficient, and for which we’d need to sacrifice the fanout factor. But it still seems possible to implement a 10-20x faster std::set and a 3-5x faster absl::btree_set, depending on how you define “faster” — and this is one of the things we’ll attempt to do next.
16
-
17
-
Static as
18
-
19
-

20
-
21
-

22
-
23
-

24
-
25
-
When the data set is small, the latency increases in discrete steps: 3.5ns for under 32 elements, 6.5ns, and to 12ns, until it hits the L2 cache (not shown on graphs) and starts increasing more smoothly yet still with noticeable spikes when the tree grows upwards.
26
-
27
-
One interesting use case is rope, also known as cord, which is used for wrapping strings in a tree to support mass operations. For example, editing a very large text file. Which is the topic.
28
-
29
-
It is common that >90% of operations are lookups. Optimizing searches is important because every other operation starts with locating a key.
30
-
31
-
I don't know (yet) why insertions are *that* slow. My guess is that it has something to do with data dependencies between queries.
32
-
33
-
I apologize to everyone else, but this is sort of your fault for not using a public benchmark.
13
+
We give more details in th evaluation section.
34
14
35
15
## B− Tree
36
16
37
-
[B+ tree](../s-tree/#b-tree-layout-1).
17
+
Instead of making small incremental changes, we will design just one data structure in this article, which is based on [B+ tree](../s-tree/#b-tree-layout-1) with a few minor exceptions:
38
18
39
-
B− ("B minus") tree. The difference is:
19
+
- We do not store any pointers except for the children (while B+ stores one pointer for the next leaf node).
20
+
- We define key $i$ as the *maximum* key in the subtree of child $i$ instead of the *minimum* key in the subtree of child $(i + 1)$. This removes the need.
21
+
- We use a small node size $B=32$. This is needed simd to be efficient (we will discuss other node sizes later).
40
22
41
-
- We are specifically storing the *last* element. This is needed
42
-
- We use a small node size $B=32$. This is needed simd to be efficient (we will discuss other node sizes in the future)
43
-
- We don't store any pointers except for the children (while B+ stores one pointer for the next leaf node).
23
+
There is some overhead, so it makes sense to use more than one cache line.
44
24
45
-
The difference is that
25
+
Analogous to the B+ tree, we call this modification *B− tree*.
46
26
47
27
### Layout
48
28
49
-
To simplify memory all with infinities.
29
+
We rely on arena allocation.
50
30
51
31
```c++
52
-
constint B = 32;
53
-
int H = 1; // tree height
32
+
constint B = 32; // node size
54
33
55
34
constint R = 1e8; // reserve
56
-
57
35
alignas(64) int tree[R];
58
-
int n_tree = B; // 31 (+ 1) + 32 for internal nodes and 31 for data nodes
59
-
int root = 0;
60
36
37
+
int root = 0; // where the tree root starts
38
+
int n_tree = B;
39
+
int H = 1; // tree height
40
+
```
41
+
42
+
To further simplify the implementation, we set all array cells with infinities:
43
+
44
+
```c++
61
45
for (int i = 0; i < R; i++)
62
46
tree[i] = INT_MAX;
63
47
```
64
48
49
+
We can do this — this does not affect performance. Memory allocation and initialization is not the bottleneck.
50
+
51
+
To save precious cache space, we use [indices instead of pointers](/hpc/cpu-cache/pointers/).
52
+
Despite that they are in separate cache lines, it still [makes sense](/hpc/cpu-cache/aos-soa/) to store them close to keys.
53
+
54
+
This way, leaf nodes occupy 2 cache lines and waste 1 slot, while internal nodes occupy 4 cache lines and waste 2+1=3 slots.
55
+
65
56
To "allocate" a new node, we simply increase `n_tree` by $B$ if it is a data node or by $2 \cdot B$ if it is an internal node.
66
57
67
58
### Searching
68
59
60
+
We used permutations when we implemented [S-trees](../s-tree/#optimization). Storing values in permuted order will make inserts much harder, so we change the approach.
61
+
62
+
Using popcount instead of tzcnt: the index i is equal to the number of keys less than x, so we can compare x against all keys, combine the vector mask any way we want, call maskmov, and then calculate the number of set bits with popcnt. This removes the need to store the keys in any particular order, which lets us skip the permutation step and also use this procedure on the last layer as well.
63
+
69
64
```c++
70
65
typedef __m256i reg;
71
66
@@ -89,9 +84,12 @@ unsigned rank32(reg x, int *node) {
89
84
}
90
85
```
91
86
87
+
This is also the reason why the "key area" in the nodes should not be contaminated and only store keys padded with infinities — or masked out.
88
+
89
+
To implement `lower_bound`, we just use the same procedure, but fetch the pointer after we computed the child number:
// we can start computing this check ahead of insertion
161
181
bool filled = (tree[k + B - 2] != INT_MAX);
162
-
bool updated = (tree[k + i] == INT_MAX);
163
182
164
183
insert(tree + k, i, _x);
165
184
166
-
if (updated) {
167
-
for (int h = H - 2; h >= 0; h--) {
168
-
int idx = sk[h] + si[h];
169
-
tree[idx] = (tree[idx] < _x ? _x : tree[idx]);
170
-
}
171
-
}
172
-
173
185
if (filled) {
174
186
// create a new leaf node
175
187
move(tree + k, tree + n_tree);
@@ -180,16 +192,19 @@ void insert(int _x) {
180
192
n_tree += B;
181
193
182
194
for (int h = H - 2; h >= 0; h--) {
195
+
// for each parent node we repeat this process
196
+
// until we reach the root of determine that the node is not split
183
197
k = sk[h], i = si[h];
184
198
185
199
filled = (tree[k + B - 3] != INT_MAX);
186
200
187
-
// the node already has a correct key (right one) and a correct pointer (left one)
201
+
// the node already has a correct key (right one)
202
+
// and a correct pointer (left one)
188
203
insert(tree + k, i, v);
189
204
insert(tree + k + B, i + 1, p);
190
205
191
206
if (!filled)
192
-
return;
207
+
return; // we're done
193
208
194
209
// create a new internal node
195
210
move(tree + k, tree + n_tree); // move keys
@@ -202,26 +217,128 @@ void insert(int _x) {
202
217
n_tree += 2 * B;
203
218
}
204
219
205
-
if (filled) {
206
-
// tree grows
220
+
// if we've reached here, this means we've reached the root, and it was split
221
+
tree[n_tree] = v;
207
222
208
-
tree[n_tree] = v;
223
+
tree[n_tree + B] = root;
224
+
tree[n_tree + B + 1] = p;
209
225
210
-
tree[n_tree + B] = root;
211
-
tree[n_tree + B + 1] = p;
226
+
root = n_tree;
227
+
n_tree += 2 * B;
228
+
H++;
229
+
}
230
+
}
231
+
```
212
232
213
-
root = n_tree;
214
-
n_tree += 2 * B;
215
-
H++;
216
-
}
233
+
There are many inefficiencies, but luckily they are rarely called.
234
+
235
+
## Evaluation
236
+
237
+
We need to implement `insert` and `lower_bound`. Deletions, iteration, and other things are not our concern for now.
238
+
239
+
Of course, this comparison is not fair, as implementing a dynamic search tree is a more high-dimensional problem.
240
+
241
+
Technically, we use `std::multiset` and `absl::btree_multiset` to support repeated keys.
242
+
243
+
Keys are uniform, but we should not rely on that fact (e. g. using ).
244
+
245
+
It is common that >90% of operations are lookups. Optimizing searches is important because every other operation starts with locating a key.
246
+
247
+
(a different set each time)
248
+
249
+
We use different points between $10^4$ and $10^7$ in (arount 250 in total). After, we use $10^6$ queries (independently random each time). All data is generated uniformly in the range $[0, 2^{30})$ and independent between stages.
250
+
251
+
$1.17^k$ and $1.17^{k+1}$.
252
+
253
+
It may or may not be representative of your use case.
254
+
255
+
As predicted, the performance is much better:
256
+
257
+

258
+
259
+
When the data set is small, the latency increases in discrete steps: 3.5ns for under 32 elements, 6.5ns, and to 12ns, until it hits the L2 cache (not shown on graphs) and starts increasing more smoothly yet still with noticeable spikes when the tree grows upwards.
260
+
261
+

262
+
263
+
I apologize to everyone else, but this is sort of your fault for not using a public benchmark.
264
+
265
+

266
+
267
+
I don't know (yet) why insertions are *that* slow. My guess is that it has something to do with data dependencies between queries.
268
+
269
+
### Possible Optimizations
270
+
271
+
Maximum height was 6.
272
+
273
+
Compile. I tried it, but couldn't get the compiler to generate optimal code.
274
+
275
+
The idiomatic C++ way is to use virtual functions, but we will be explicit:
276
+
277
+
```c++
278
+
void (*insert_ptr)(int);
279
+
int (*lower_bound_ptr)(int);
280
+
281
+
void insert(int x) {
282
+
insert_ptr(x);
283
+
}
284
+
285
+
int lower_bound(int x) {
286
+
return lower_bound_ptr(x);
287
+
}
288
+
```
289
+
290
+
```c++
291
+
template <int H>
292
+
voidinsert_impl(int _x) {
293
+
// ...
294
+
}
295
+
296
+
template <intH>
297
+
void insert_impl(int _x) {
298
+
// ...
299
+
if (/* tree grows */) {
300
+
// ...
301
+
insert_ptr = &insert_impl<H + 1>;
302
+
lower_bound_ptr = &lower_bound_impl<H + 1>;
217
303
}
218
304
}
305
+
306
+
template <>
307
+
void insert_impl<10>(int x) {
308
+
std::cerr << "This depth was not supposed to be reached" << std::endl;
309
+
exit(1);
310
+
}
219
311
```
220
312
221
-
## Optimizations
313
+
```c++
314
+
insert_ptr = &insert_impl<1>;
315
+
lower_bound_ptr = &lower_bound_impl<1>;
316
+
```
317
+
318
+
Recursion unrolled.
319
+
320
+
### Other Operations
321
+
322
+
Going to father and fetching $B$ pointers at a time is faster as it negates [pointer chasing](/hpc/cpu-cache/latency/).
323
+
324
+
Pointer to either parent or next node.
325
+
326
+
Stack of ancestors.
327
+
328
+
Nodes are at least ½ full (because they are created ½ full), except for the root, and, on average, ¾ full assuming random inserts.
329
+
330
+
We can't store junk in keys.
331
+
332
+
B* split
333
+
334
+
If the node is at least half-full, we're done. Otherwise, we try to borrow keys from siblings (no expensive two-pointer merging is necessary: we can just append them to the end/beginning and swap key of the parent).
335
+
336
+
If that fails, we can merge the two nodes together, and iteratively delete the key in the parent.
337
+
338
+
One interesting use case is *rope*, also known as *cord*, which is used for wrapping strings in a tree to support mass operations. For example, editing a very large text file. Which is the topic.
222
339
223
-
...
340
+
[Skip list](https://en.wikipedia.org/wiki/Skip_list), which [some attempts to vectorize it](https://doublequan.github.io/), although it may achieve higher total throughput in concurrent setting. I have low hope that it can be improved.
224
341
225
342
## Acknowledgements
226
343
227
-
Thanks to Danila Kuteninfor meaningful discussions of applicability.
344
+
Thanks to [Danila Kutenin](https://danlark.org/) from Google for meaningful discussions of applicability and possible replacement in Abseil.
0 commit comments