Skip to content

Commit a0707a4

Browse files
committed
reorganize simd reduction
1 parent 1305a25 commit a0707a4

File tree

1 file changed

+35
-29
lines changed

1 file changed

+35
-29
lines changed

content/english/hpc/simd/reduction.md

Lines changed: 35 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -48,56 +48,62 @@ int sum_simd(v8si *a, int n) {
4848

4949
You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array.
5050

51-
### Horizontal Summation
52-
53-
The last part, where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum, is called "horizontal summation."
54-
55-
Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register.
56-
57-
![Horizontal summation in SSE/AVX. Note how the output is stored: the (a b a b) interleaving is common for reducing operations](../img/hsum.png)
58-
59-
Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway:
60-
61-
```c++
62-
int hsum(__m256i x) {
63-
__m128i l = _mm256_extracti128_si256(x, 0);
64-
__m128i h = _mm256_extracti128_si256(x, 1);
65-
l = _mm_add_epi32(l, h);
66-
l = _mm_hadd_epi32(l, l);
67-
return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 1);
68-
}
69-
```
70-
71-
There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
72-
73-
There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.
74-
7551
### Instruction-Level Parallelism
7652

77-
Our implementation matches what the compiler produces automatically, but it is actually [suboptimal](/hpc/pipelining/throughput): when we use just one accumulator, we have to wait one cycle between the loop iterations for vector addition to complete, while its throughput is 2 on this microarchitecture.
53+
Our implementation matches what the compiler produces automatically, but it is actually suboptimal: when we use just one accumulator, [we have to wait](/hpc/pipelining/throughput) one cycle between the loop iterations for a vector addition to complete, while the [throughput](/hpc/pipelining/tables/) of corresponding instruction is 2 on this microarchitecture.
7854

7955
If we again divide the array in $B \geq 2$ parts and use a *separate* accumulator for each, we can saturate the throughput of vector addition and increase the performance twofold:
8056

8157
```c++
82-
const int B = 2;
58+
const int B = 2; // how many vector accumulators to use
8359

8460
int sum_simd(v8si *a, int n) {
8561
v8si b[B] = {0};
8662

87-
for (int i = 0; i < n / 8; i += B)
63+
for (int i = 0; i + (B - 1) < n / 8; i += B)
8864
for (int j = 0; j < B; j++)
8965
b[j] += a[i + j];
90-
66+
67+
// sum all vector accumulators into one
9168
for (int i = 1; i < B; i++)
9269
b[0] += b[i];
9370

9471
int s = 0;
9572

73+
// sum 8 scalar accumulators into one
9674
for (int i = 0; i < 8; i++)
9775
s += b[0][i];
9876

77+
// add the remainder of a
78+
for (int i = n / (8 * B) * (8 * B); i < n; i++)
79+
s += a[i];
80+
9981
return s;
10082
}
10183
```
10284
103-
If you have more than 2 relevant execution ports, you can increase `B` accordingly. But the n-fold performance increase will only apply to arrays that fit L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger.
85+
If you have more than 2 relevant execution ports, you can increase the `B` constant accordingly, but the $n$-fold performance increase will only apply to arrays that fit into L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger.
86+
87+
### Horizontal Summation
88+
89+
The part where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum is called "horizontal summation."
90+
91+
Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register.
92+
93+
![Horizontal summation in SSE/AVX. Note how the output is stored: the (a b a b) interleaving is common for reducing operations](../img/hsum.png)
94+
95+
Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway:
96+
97+
```c++
98+
int hsum(__m256i x) {
99+
__m128i l = _mm256_extracti128_si256(x, 0);
100+
__m128i h = _mm256_extracti128_si256(x, 1);
101+
l = _mm_add_epi32(l, h);
102+
l = _mm_hadd_epi32(l, l);
103+
return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 1);
104+
}
105+
```
106+
107+
There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
108+
109+
There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.

0 commit comments

Comments
 (0)