You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/english/hpc/simd/reduction.md
+35-29Lines changed: 35 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,56 +48,62 @@ int sum_simd(v8si *a, int n) {
48
48
49
49
You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array.
50
50
51
-
### Horizontal Summation
52
-
53
-
The last part, where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum, is called "horizontal summation."
54
-
55
-
Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register.
56
-
57
-

58
-
59
-
Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway:
There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
72
-
73
-
There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.
74
-
75
51
### Instruction-Level Parallelism
76
52
77
-
Our implementation matches what the compiler produces automatically, but it is actually [suboptimal](/hpc/pipelining/throughput): when we use just one accumulator, we have to wait one cycle between the loop iterations for vector addition to complete, while its throughput is 2 on this microarchitecture.
53
+
Our implementation matches what the compiler produces automatically, but it is actually suboptimal: when we use just one accumulator, [we have to wait](/hpc/pipelining/throughput) one cycle between the loop iterations for a vector addition to complete, while the [throughput](/hpc/pipelining/tables/) of corresponding instruction is 2 on this microarchitecture.
78
54
79
55
If we again divide the array in $B \geq 2$ parts and use a *separate* accumulator for each, we can saturate the throughput of vector addition and increase the performance twofold:
80
56
81
57
```c++
82
-
const int B = 2;
58
+
constint B = 2;// how many vector accumulators to use
83
59
84
60
intsum_simd(v8si *a, int n) {
85
61
v8si b[B] = {0};
86
62
87
-
for (int i = 0; i < n / 8; i += B)
63
+
for (int i = 0; i + (B - 1) < n / 8; i += B)
88
64
for (int j = 0; j < B; j++)
89
65
b[j] += a[i + j];
90
-
66
+
67
+
// sum all vector accumulators into one
91
68
for (int i = 1; i < B; i++)
92
69
b[0] += b[i];
93
70
94
71
int s = 0;
95
72
73
+
// sum 8 scalar accumulators into one
96
74
for (int i = 0; i < 8; i++)
97
75
s += b[0][i];
98
76
77
+
// add the remainder of a
78
+
for (int i = n / (8 * B) * (8 * B); i < n; i++)
79
+
s += a[i];
80
+
99
81
return s;
100
82
}
101
83
```
102
84
103
-
If you have more than 2 relevant execution ports, you can increase `B` accordingly. But the n-fold performance increase will only apply to arrays that fit L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger.
85
+
If you have more than 2 relevant execution ports, you can increase the `B` constant accordingly, but the $n$-fold performance increase will only apply to arrays that fit into L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger.
86
+
87
+
### Horizontal Summation
88
+
89
+
The part where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum is called "horizontal summation."
90
+
91
+
Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register.
92
+
93
+

94
+
95
+
Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway:
There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
108
+
109
+
There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.
0 commit comments