montgomery multiplication

sslotin · sslotin · commit 73fbdf4a3f32 · 2022-05-18T09:45:27.000+03:00
diff --git a/content/english/hpc/number-theory/montgomery.md b/content/english/hpc/number-theory/montgomery.md
@@ -1,7 +1,6 @@
 ---
 title: Montgomery Multiplication
 weight: 4
-draft: true
 ---
 
 Unsurprisingly, large fractions of computations in [modular arithmetic](../modular) are often spent on calculating the modulo operation, which is as slow as general integer division and typically taking 15-20 cycles, depending on the operand size.
@@ -87,73 +86,122 @@ This means that, after we normally multiply two numbers in the Montgomery space,
 
 ### Montgomery reduction
 
-Assume that $r=2^{32}$, the modulo $n$ is 32-bit, and the number $x$ we need to reduce (multiply by $r^{-1}$ and take it modulo $n$) is the 64-bit the product of two 32-bit numbers.
+Assume that $r=2^{32}$, the modulo $n$ is 32-bit, and the number $x$ we need to reduce is 64-bit (the product of two 32-bit numbers). Our goal is to calculate $y = x \cdot r^{-1} \bmod n$. 
 
-By definition, $\gcd(n, r) = 1$, so we know that there are two numbers $r^{-1}$ and $n'$ in the $[0, n)$ range such that
+Since $r$ is coprime with $n$, we know that there are two numbers $r^{-1}$ and $n^\prime$ in the $[0, n)$ range such that
 
 $$
-r \cdot r^{-1} + n \cdot n' = 1
+r \cdot r^{-1} + n \cdot n^\prime = 1
 $$
 
-and both $r^{-1}$ and $n'$ can be computed using the [extended Euclidean algorithm](../euclid-extended).
+and both $r^{-1}$ and $n^\prime$ can be computed e. g. using the [extended Euclidean algorithm](../euclid-extended).
 
-Using this identity, we can express $r \cdot r^{-1}$ as $(-n \cdot n' + 1)$ and write $x \cdot r^{-1}$ as
+Using this identity, we can express $r \cdot r^{-1}$ as $(1 - n \cdot n^\prime)$ and write $x \cdot r^{-1}$ as
 
 $$
 \begin{aligned}
 x \cdot r^{-1} &= x \cdot r \cdot r^{-1} / r
-\\             &= x \cdot (-n \cdot n^{\prime} + 1) / r
-\\             &= (-x \cdot n \cdot n^{\prime} + x) / r
-\\             &\equiv (-x \cdot n \cdot n^{\prime} + l \cdot r \cdot n + x) / r \bmod n
-\\             &\equiv ((-x \cdot n^{\prime} + l \cdot r) \cdot n + x) / r \bmod n
+\\             &= x \cdot (1 - n \cdot n^{\prime}) / r
+\\             &= (x - x \cdot n \cdot n^{\prime}    ) / r
+\\             &\equiv (x - x \cdot n \cdot n^{\prime} + k \cdot r \cdot n) / r &\pmod n &\;\;\text{(for any integer $k$)}
+\\             &\equiv (x - (x \cdot n^{\prime} - k \cdot r) \cdot n) / r &\pmod n
 \end{aligned}
 $$
 
-The equivalences hold for any integer $l$. This means that we can add or subtract an arbitrary multiple of $r$ to $x \cdot n'$, or in other words, we can compute $q = x \cdot n'$ modulo $r$.
+Now, if we choose $k$ to be $\lfloor x \cdot n^\prime / r \rfloor$ (the upper 64 bits of the $x \cdot n^\prime$ product), it will cancel out, and $(k \cdot r - x \cdot n^{\prime})$ will simply be equal to $x \cdot n^{\prime} \bmod r$ (the lower 32 bits of $x \cdot n^\prime$), implying:
 
-This gives us the following algorithm to compute $x \cdot r^{-1} \bmod n$:
+$$
+x \cdot r^{-1} \equiv (x - x \cdot n^{\prime} \bmod r \cdot n) / r
+$$
+
+The algorithm itself just evaluates this formula, performing two multiplications to calculate $q = x \cdot n^{\prime} \bmod r$ and $m = q \cdot n$, and then subtracts it from $x$ and right-shifts the result to divide it by $r$.
+
+The only remaining thing to handle is that the result may not be in the $[0, n)$ range; but since
+
+$$
+x < n \cdot n < r \cdot n \implies x / r < n
+$$
+
+and
+
+$$
+m = q \cdot n < r \cdot n \implies m / r < n
+$$
+
+it is guaranteed that
+
+$$
+-n < (x - m) / r < n
+$$
+
+Therefore, we can simply check if the result is negative and in that case, add $n$ to it, giving the following algorithm:
+
+```c++
+typedef __uint32_t u32;
+typedef __uint64_t u64;
+
+const u32 n = 1e9 + 7, nr = inverse(n, 1ull << 32);
 
-```python
-def reduce(x):
-    q = (x % r) * nr % r
-    a = (x - q * n) / r
-    if a < 0:
-        a += n
-    return a
+u32 reduce(u64 x) {
+    u32 q = u32(x) * nr;      // q = x * n' mod r
+    u64 m = (u64) q * n;      // m = q * n
+    u32 y = (x - m) >> 32;    // y = (x - m) / r
+    return x < m ? y + n : y; // if y < 0, add n to make it be in the [0, n) range
+}
 ```
 
-Since $x < n \cdot n < r \cdot n$ and $q \cdot n < r \cdot n$, we know that
+This last check is relatively cheap, but it is still on the critical path. If we are fine with the result being in the $[0, 2 \cdot n - 2]$ range instead of $[0, n)$, we can remove it and add $n$ to the result unconditionally:
+
+```c++
+u32 reduce(u64 x) {
+    u32 q = u32(x) * nr;
+    u64 m = (u64) q * n;
+    u32 y = (x - m) >> 32;
+    return y + n
+}
+```
+
+We can also move the `>> 32` operation one step earlier in the computation graph and compute $\lfloor x / r \rfloor - \lfloor m / r \rfloor$ instead of $(x - m) / r$. This is correct because the lower 32 bits of $x$ and $m$ are equal anyway since
 
 $$
--n < (x - q \cdot n) / r < n
+m = x \cdot n^\prime \cdot n \equiv x \pmod r
 $$
 
-Therefore, the final modulo operation can be implemented using a single bound check and addition.
+But why would we voluntarily choose to perfom two right-shifts instead of just one? This is beneficial because for `((u64) q * n) >> 32` we need to do a 32-by-32 multiplication and take the upper 32 bits of the result (which the x86 `mul` instruction [already writes](../hpc/arithmetic/integer/#128-bit-integers) in a separate register, so it doesn't cost anything), and the other right-shift `x >> 32` is not on the critical path.
+
+```c++
+u32 reduce(u64 x) {
+    u32 q = u32(x) * nr;
+    u32 m = ((u64) q * n) >> 32;
+    return (x >> 32) + n - m;
+}
+```
 
-Here is an equivalent C implementation for 64-bit integers:
+One of the main advantages of Montgomery multiplication over other modular reduction methods is that it doesn't require very large data types: it only needs a $r \times r$ multiplication that extracts the lower and higher $r$ bits of the result, which [has special support](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=7395,7392,7269,4868,7269,7269,1820,1835,6385,5051,4909,4918,5051,7269,6423,7410,150,2138,1829,1944,3009,1029,7077,519,5183,4462,4490,1944,5055,5012,5055&techs=AVX,AVX2&text=mul) on most hardware also makes it easily generalizable to [SIMD](../hpc/simd/) and larger data types:
 
 ```c++
-typedef unsigned long long u64;
 typedef __uint128_t u128;
 
-u64 reduce(u128 x) {
+u64 reduce(u128 x) const {
     u64 q = u64(x) * nr;
     u64 m = ((u128) q * n) >> 64;
-    u64 xhi = (x >> 64);
-    if (xhi >= m)
-        return (xhi - m);
-    else
-        return (xhi - m) + n;
+    return (x >> 64) + n - m;
 }
 ```
 
-We also need to implement calculating calculating the inverse of $n$ (`nr`) and transformation of numbers in and our of Montgomery space. Before providing complete implementation, let's discuss how to do that smarter, although they are just done once.
+Note that a 128-by-64 modulo is not possible with general integer division tricks: the compiler [falls back](https://godbolt.org/z/fbEE4v4qr) to calling a slow [long arithmetic library function](https://github.com/llvm-mirror/compiler-rt/blob/69445f095c22aac2388f939bedebf224a6efcdaf/lib/builtins/udivmodti4.c#L22) to support it.
+
+### Faster Inverse and Transform
 
-To transfer a number back from the Montgomery space we can just use Montgomery reduction.
+Montgomery multiplication itself is fast, but it requires some precomputation:
 
-### Fast inverse
+- inverting $n$ modulo $r$ to compute $n^\prime$,
+- transforming a number *to* the Montgomery space,
+- transforming a number *from* the Montgomery space.
 
-For computing the inverse $n' = n^{-1} \bmod r$ more efficiently, we can use the following trick inspired from the Newton's method:
+The last operation is already efficiently performed with the `reduce` procedure we just implemented, but the first two can be slightly optimized.
+
+**Computing the inverse** $n^\prime = n^{-1} \bmod r$ can be done faster than with the extended Euclidean algorithm by taking advantage of the fact that $r$ is a power of two and using the following identity:
 
 $$
 a \cdot x \equiv 1 \bmod 2^k
@@ -163,7 +211,7 @@ a \cdot x \cdot (2 - a \cdot x)
 1 \bmod 2^{2k}
 $$
 
-This can be proven this way:
+Proof:
 
 $$
 \begin{aligned}
@@ -176,41 +224,36 @@ a \cdot x \cdot (2 - a \cdot x)
 \end{aligned}
 $$
 
-This means we can start with $x = 1$ as the inverse of $a$ modulo $2^1$, apply the trick a few times and in each iteration we double the number of correct bits of $x$.
-
-### Fast transformation
+We can start with $x = 1$ as the inverse of $a$ modulo $2^1$ and apply this identity exactly $\log_2 r$ times, each time doubling the number of bits in the inverse — somewhat reminiscent of [the Newton's method](../hpc/arithmetic/newton/).
 
-Although we can just multiply a number by $r$ and compute one modulo the usual way, there is a faster way that makes use of the following relation:
+**Transforming** a number into the Montgomery space can be done by multiplying it by $r$ and computing modulo [the usual way](../hpc/arithmetic/division/), but we can also take advantage of this relation:
 
 $$
 \bar{x} = x \cdot r \bmod n = x * r^2
 $$
 
-Transforming a number into the space is just a multiplication inside the space of the number with $r^2$. Therefore we can precompute $r^2 \bmod n$ and just perform a multiplication and reduction instead.
+Transforming a number into the space is just a multiplication by $r^2$. Therefore, we can precompute $r^2 \bmod n$ and perform a multiplication and reduction instead — which may or may not be actually faster because multiplying a number by $r=2^{k}$ can be implemented with a left-shift, while multiplication by $r^2 \bmod n$ can not.
 
 ### Complete Implementation
 
-```c++
-typedef __uint32_t u32;
-typedef __uint64_t u64;
+It is convenient to wrap everything into a single `constexpr` structure:
 
-struct montgomery {
+```c++
+struct Montgomery {
     u32 n, nr;
     
-    constexpr montgomery(u32 n) : n(n), nr(1) {
-        for (int i = 0; i < 6; i++)
+    constexpr Montgomery(u32 n) : n(n), nr(1) {
+        // log(2^32) = 5
+        for (int i = 0; i < 5; i++)
             nr *= 2 - n * nr;
     }
 
     u32 reduce(u64 x) const {
         u32 q = u32(x) * nr;
         u32 m = ((u64) q * n) >> 32;
-        u32 xhi = (x >> 32);
-        return xhi + n - m;
-        
-        // if you need 
-        // u32 t = xhi - m;
-        // return xhi >= m ? t : t + n;
+        return (x >> 32) + n - m;
+        // returns a number in the [0, 2 * n - 2] range
+        // (add a "x < n ? x : x - n" type of check if you need a proper modulo)
     }
 
     u32 multiply(u32 x, u32 y) const {
@@ -219,44 +262,15 @@ struct montgomery {
 
     u32 transform(u32 x) const {
         return (u64(x) << 32) % n;
+        // can also be implemented as multiply(x, r^2 mod n)
     }
 };
 ```
 
-```c++
-montgomery m(n);
-
-a = m.transform(a);
-b = m.transform(b);
-c = m.multiply(a, b);
-c = m.reduce(c);
-```
-
-```c++
-int inverse(int _a) {
-    u32 a = space.transform(_a);
-    u32 r = space.transform(1);
-    
-    int n = M - 2;
-    while (n) {
-        if (n & 1)
-            r = space.multiply(r, a);
-        a = space.multiply(a, a);
-        n >>= 1;
-    }
-    
-    return space.reduce(r);
-}
-```
-
-SIMD
-
-166.79 ns
-
-207.04 ns
+To test its performance, we can plug Montgomery multiplication into the [binary exponentiation](../hpc/number-theory/exponentiation/):
 
 ```c++
-constexpr montgomery space(M);
+constexpr Montgomery space(M);
 
 int inverse(int _a) {
     u64 a = space.transform(_a);
@@ -273,4 +287,6 @@ int inverse(int _a) {
 }
 ```
 
+While vanilla binary exponentiation with a compiler-generated fast modulo trick requires ~170ns per `inverse` call, this implementation takes ~166ns, going down to ~158s we omit `transform` and `reduce` (a reasonable use case in modular arithmetic is for `inverse` to be used as a subprocedure in a bigger computation). This is a small improvement, but Montgomery multiplication becomes much more advantageous for SIMD applications and larger data types.
+
 **Exercise.** Implement efficient *modular* [matix multiplication](/hpc/algorithms/matmul).