reorganize simd reduction

sslotin · sslotin · commit a0707a409d0e · 2022-08-29T01:36:40.000+03:00
diff --git a/content/english/hpc/simd/reduction.md b/content/english/hpc/simd/reduction.md
@@ -48,56 +48,62 @@ int sum_simd(v8si *a, int n) {
 
 You can use this approach for for other reductions, such as for finding the minimum or the xor-sum of an array.
 
-### Horizontal Summation
-
-The last part, where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum, is called "horizontal summation."
-
-Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register.
-
-![Horizontal summation in SSE/AVX. Note how the output is stored: the (a b a b) interleaving is common for reducing operations](../img/hsum.png)
-
-Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway:
-
-```c++
-int hsum(__m256i x) {
-    __m128i l = _mm256_extracti128_si256(x, 0);
-    __m128i h = _mm256_extracti128_si256(x, 1);
-    l = _mm_add_epi32(l, h);
-    l = _mm_hadd_epi32(l, l);
-    return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 1);
-}
-```
-
-There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
-
-There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.
-
 ### Instruction-Level Parallelism
 
-Our implementation matches what the compiler produces automatically, but it is actually [suboptimal](/hpc/pipelining/throughput): when we use just one accumulator, we have to wait one cycle between the loop iterations for vector addition to complete, while its throughput is 2 on this microarchitecture.
+Our implementation matches what the compiler produces automatically, but it is actually suboptimal: when we use just one accumulator, [we have to wait](/hpc/pipelining/throughput) one cycle between the loop iterations for a vector addition to complete, while the [throughput](/hpc/pipelining/tables/) of corresponding instruction is 2 on this microarchitecture.
 
 If we again divide the array in $B \geq 2$ parts and use a *separate* accumulator for each, we can saturate the throughput of vector addition and increase the performance twofold:
 
 ```c++
-const int B = 2;
+const int B = 2; // how many vector accumulators to use
 
 int sum_simd(v8si *a, int n) {
     v8si b[B] = {0};
 
-    for (int i = 0; i < n / 8; i += B)
+    for (int i = 0; i + (B - 1) < n / 8; i += B)
         for (int j = 0; j < B; j++)
             b[j] += a[i + j];
-    
+
+    // sum all vector accumulators into one
     for (int i = 1; i < B; i++)
         b[0] += b[i];
     
     int s = 0;
 
+    // sum 8 scalar accumulators into one
     for (int i = 0; i < 8; i++)
         s += b[0][i];
 
+     // add the remainder of a
+    for (int i = n / (8 * B) * (8 * B); i < n; i++)
+        s += a[i];
+
     return s;
 }
 ```
 
-If you have more than 2 relevant execution ports, you can increase `B` accordingly. But the n-fold performance increase will only apply to arrays that fit L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger.
+If you have more than 2 relevant execution ports, you can increase the `B` constant accordingly, but the $n$-fold performance increase will only apply to arrays that fit into L1 cache — [memory bandwidth](/hpc/cpu-cache/bandwidth) will be the bottleneck for anything larger.
+
+### Horizontal Summation
+
+The part where we sum up the 8 accumulators stored in a vector register into a single scalar to get the total sum is called "horizontal summation."
+
+Although extracting and adding every scalar one by one only takes a constant number of cycles, it can be computed slightly faster using a [special instruction](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&text=_mm256_hadd_epi32&expand=2941) that adds together pairs of adjacent elements in a register.
+
+![Horizontal summation in SSE/AVX. Note how the output is stored: the (a b a b) interleaving is common for reducing operations](../img/hsum.png)
+
+Since it is a very specific operation, it can only be done with SIMD intrinsics — although the compiler probably emits roughly the same procedure for the scalar code anyway:
+
+```c++
+int hsum(__m256i x) {
+    __m128i l = _mm256_extracti128_si256(x, 0);
+    __m128i h = _mm256_extracti128_si256(x, 1);
+    l = _mm_add_epi32(l, h);
+    l = _mm_hadd_epi32(l, l);
+    return _mm_extract_epi32(l, 0) + _mm_extract_epi32(l, 1);
+}
+```
+
+There are [other similar instructions](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX,AVX2&ig_expand=3037,3009,5135,4870,4870,4872,4875,833,879,874,849,848,6715,4845&text=horizontal), e.g., for integer multiplication or calculating absolute differences between adjacent elements (used in image processing).
+
+There is also one specific instruction, `_mm_minpos_epu16`, that calculates the horizontal minimum and its index among eight 16-bit integers. This is the only horizontal reduction that works in one go: all others are computed in multiple steps.