Skip to content

Commit 0d811cc

Browse files
committed
add s-tree rank example
1 parent da216d6 commit 0d811cc

File tree

1 file changed

+15
-3
lines changed

1 file changed

+15
-3
lines changed

content/english/hpc/data-structures/s-tree.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,19 @@ int i = __builtin_ffs(mask) - 1;
102102
// now i is the number of the correct child node
103103
```
104104

105-
Unfortunately, the compilers are not smart enough yet to auto-vectorize this code, so we need to manually vectorize it with intrinsics:
105+
Unfortunately, the compilers are not smart enough to [auto-vectorize](/hpc/simd/auto-vectorization/) this code yet, so we have to optimize it manually. In AVX2, we can load 8 elements, compare them against the search key, producing a [vector mask](/hpc/simd/masking/), and then extract the scalar mask from it with `movemask`. Here is a minimized illustrated example of what we want to do:
106+
107+
```center
108+
y = 4 17 65 103
109+
x = 42 42 42 42
110+
y ≥ x = 00000000 00000000 11111111 11111111
111+
├┬┬┬─────┴────────┴────────┘
112+
movemask = 0011
113+
┌─┘
114+
ffs = 3
115+
```
116+
117+
Since we are limited to processing 8 elements at a time (half our block / cache line size), we have to split the elements into two groups and then combine the two 8-bit masks. To do this, it will be slightly easier to swap the condition for `x > y` and compute the inverted mask instead:
106118

107119
```c++
108120
typedef __m256i reg;
@@ -114,7 +126,7 @@ int cmp(reg x_vec, int* y_ptr) {
114126
}
115127
```
116128
117-
This function works for 8-element vectors, which is half our block / cache line size. To process the entire block, we need to call it twice and then combine the masks:
129+
Now, to process the entire block, we need to call it twice and combine the masks:
118130
119131
```c++
120132
int mask = ~(
@@ -123,7 +135,7 @@ int mask = ~(
123135
);
124136
```
125137

126-
Now, to descend down the tree, we use `ffs` on that mask to get the correct child number and just call the `go` function we defined earlier:
138+
To descend down the tree, we use `ffs` on that mask to get the correct child number and just call the `go` function we defined earlier:
127139

128140
```c++
129141
int i = __builtin_ffs(mask) - 1;

0 commit comments

Comments
 (0)