Do unsigned magic division #10996

mikedn · 2017-04-15T18:44:23Z

Fixes #10970

russellhadley · 2017-04-15T20:05:07Z

src/jit/lower.cpp

+//   and Peter L. Montgomery in PLDI 94
+
+template <typename U>
+U GetUnsignedMagicNumberForDivide(U denom, int* shift /*out*/, bool* add /*out*/)


I'm interested to see if this shows up in throughput. Just reading through, I was wondering if it would be worth pre-computing some set of common cases and leaving them in a table?

There's only one occurrence in corelib and a couple of others in System.Runtime.Numerics and System.Xml.ReaderWriter so there's not much to measure.

FWIW I measured how much time this function takes for denom = 10 and it's ~12ns on my machine. The time seems to reach ~70ns for larger denominators.

The book from which this code was adapted actually suggests the possibility of using a precomputed table but then the book was written in 1996, computers where kind of slow back then :)

Interesting, the VC++ compiler has a table for 3-12.

This is the classic algorithm, isn't it?
Did you consider Mölller and Granlund's updated one from 2011: Improved Division by Invariant Integers?

@Tornhoof Yes, it's the original algorithm. I've seen "Improved Division by Invariant Integers" but it seems that it skewed towards multi word division so I didn't pay too much attention to it. Also, one of its basic premises is that that MULHI is slightly slower than a normal MUL but that doesn't seem to be the case anymore on current CPUs. Granlund's own instruction timing tables show that the latency dropped from 10 cycles (on Nehalem) to 3 cycles (on Haswell).

I was actually looking at another potential improvement: http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. It simplifies a bit the adjustment code required by stubborn divisors like 7. But I haven't seen any compiler using that and I have no idea if it's actually correct.

Thanks for the explanation

danmoseley · 2017-04-18T00:07:59Z

@dotnet-bot test Tizen armel Cross Release Build please (break)
@dotnet-bot test Ubuntu arm Cross Release Build please (break)

mikedn · 2017-05-27T10:38:33Z

Well, this is pretty much done but there are a few questions:

This implements the "classic" algorithm that everyone uses. There is another algorithm for unsigned division that generates slightly shorter code for certain divisors (e.g. 7) and, interesting in the context of JIT, avoids the need for creating a temporary lclvar. But no native compiler I looked at (cl, clang, gcc, icc) uses this algorithm and I have no idea why. Perhaps it's actually incorrect, perhaps there are legal issues (though the reference code is in public domain). Should we looking into this more?
Should we use a precomputed table? This applies to the signed version as well.
Should this optimization be done in minopts/debug mode? This applies to the signed version as well.

russellhadley · 2017-05-30T15:54:29Z

1 - This seems like a good place to start - I think we should get this in before taking the next step.
2 - I think the pre-computed table is a good idea - for the "we're a JIT" reason.
3 - My thought is that we should only do this under opts. (not minopt) It doesn't reduce IR node count so it won't likely make a speedup.

mikedn · 2017-06-06T16:25:06Z

I've added precomputed tables and disabled magic division optimization under minopts (for both signed and unsigned division). The tables cover the range 3-12, they're identical to the ones found in "The PowerPC Compiler Writer's Guide" so they're easy to double check.

russellhadley · 2017-06-06T16:34:35Z

Remove the WIP tag on the PR? @pgavlin do you have any feedback here? LGTM.

pgavlin

LGTM aside from a few small comments.

pgavlin · 2017-06-06T17:04:24Z

src/jit/lower.cpp


-        delta = absDenom - r2;
-    } while (q1 < delta || (q1 == delta && r1 == 0));
+        // Insert a new GT_MULHI node in front of the existing GT_UDIV/GT_UMOD node.


Nit: s/in front of/before/

pgavlin · 2017-06-06T17:16:45Z

src/jit/lower.cpp

+            divMod->gtOp1 = dividend;
+            divMod->gtOp2 = mul;
+
+            BlockRange().InsertBefore(divMod, dividend, div, divisor, mul);


Nit: I think this is probably better ordered as BlockRange().InsertBefore(divMod, div, divisor, mul, dividend) in order to keep the lifetime of dividend shorter in the case its lclVar does not end up tracked.

in the case its lclVar does not end up tracked

Heh, that would be quite unfortunate. I'll update.

pgavlin · 2017-06-06T17:20:13Z

src/jit/utils.cpp

+template <>
+const UnsignedMagic<uint32_t>* TryGetUnsignedMagic(uint32_t divisor)
+{
+    // clang-format off


What sort of results were you getting from clang-format here?

static const UnsignedMagic<uint32_t> table[]{ {0xaaaaaaab, false, 1}, // 3 {}, {0xcccccccd, false, 2}, // 5 {0xaaaaaaab, false, 2}, // 6 {0x24924925, true, 3}, // 7 {}, {0x38e38e39, false, 1}, // 9 {0xcccccccd, false, 3}, // 10 {0xba2e8ba3, false, 3}, // 11 {0xaaaaaaab, false, 3}, // 12 };

I suppose it's reasonable, though I hate the misplaced opening brace :)

Got rid of custom formatting, it's not worth it.

pgavlin · 2017-06-06T17:22:00Z

src/jit/utils.cpp

+    typedef typename jitstd::make_signed<T>::type ST;
+
+    const unsigned bits         = sizeof(T) * 8;
+    const unsigned bits_minus_1 = bits - 1;


Nit: bitsMinus1, twoNMinus1, etc.

Yeah, I should change those. I tried to be consistent with the signed version but there's no point in doing that.

mikedn · 2017-06-07T16:41:35Z

jit-diff summary:

Total bytes of diff: 143 (0.00 % of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
         133 : System.Runtime.Numerics.dasm (0.23 % of base)
          10 : System.Private.CoreLib.dasm (0.00 % of base)
2 total files with size differences (0 improved, 2 regressed), 77 unchanged.
Top method regessions by size (bytes):
         122 : System.Runtime.Numerics.dasm - BigNumber:FormatBigInteger(struct,ref,ref):ref
          11 : System.Runtime.Numerics.dasm - Number:Int32ToDecChars(long,byref,int,int)
          10 : System.Private.CoreLib.dasm - TimerQueue:get_TickCount():int
3 total methods with size differences (0 improved, 3 regressed), 65548 unchanged.

All diffs come from the newly added unsigned magic division optimization (that is, the signed magic division changes don't generate any diffs as expected).

1 occurrence in corelib and a few more in System.Runtime.Numerics, some BigInteger formatting code.

pgavlin · 2017-06-07T16:42:33Z

All diffs come from the newly added unsigned magic division optimization (that is, the signed magic division changes don't generate any diffs as expected).

Can you provide an example of one of the diffs?

mikedn · 2017-06-07T16:46:18Z

A x % 10 from BigInteger code:

; before
       mov      r15d, 10
       mov      eax, ecx
       xor      rdx, rdx
       div      edx:eax, r15d
       add      edx, 48
; after
       mov      eax, ecx
       mov      edx, 0xD1FFAB1E
       mov      dword ptr [rbp-64H], eax
       mul      edx:eax, edx
       shr      edx, 3
       imul     eax, edx, 10
       mov      edx, dword ptr [rbp-64H]
       sub      edx, eax
       add      edx, 48

The temporary lclvar makes an appearance...

pgavlin · 2017-06-07T16:53:44Z

The temporary lclvar makes an appearance...

Yeah, it would be nice if that reload was just a copy from ecx. Seems okay to me, though.

cc @CarolEidt @BruceForstall @dotnet/jit-contrib

pgavlin · 2017-06-07T19:45:16Z

OS X failure looks infrastructural in nature.

@dotnet-bot test OSX10.12 x64 Checked Build and Test

CarolEidt

LGTM

CarolEidt · 2017-06-08T00:16:13Z

src/jit/lower.cpp

-        q2 *= 2; // update q2 = 2^p / abs(denom)
-        r2 *= 2; // update r2 = rem(2^p / abs(denom))
+        // Depending on the "add" flag returned by GetUnsignedMagicNumberForDivide we need to generate:
+        // add == false (when divisor == 3 for example):


Minor: I would probably have named the flag needsAdd or useAdd or something that implies that it is a boolean.

"add" is the name used in "The PowerPC Compiler Writer's Guide". I don't like it but I don't think any variations around "add" are any better, IMO it should somehow indicate why that addition needs to be performed. Maybe I'll change it to "overflow" or something similar, it really indicates that the magic number is too large to fit in 32/64 bit.

dnfclas added the cla-already-signed label Apr 15, 2017

russellhadley reviewed Apr 15, 2017

View reviewed changes

mikedn force-pushed the magic-udiv branch 3 times, most recently from 0c0561f to 2d57e68 Compare April 17, 2017 11:43

mikedn force-pushed the magic-udiv branch from 2d57e68 to 6ca8389 Compare April 23, 2017 07:58

mikedn force-pushed the magic-udiv branch from 6ca8389 to c3ebcd7 Compare May 6, 2017 06:26

mikedn force-pushed the magic-udiv branch 2 times, most recently from 1431022 to c29c529 Compare May 27, 2017 08:32

mikedn force-pushed the magic-udiv branch 2 times, most recently from 725eb24 to 0cce79d Compare May 31, 2017 17:52

mikedn changed the title ~~[WIP] Do unsigned magic division~~ Do unsigned magic division Jun 6, 2017

pgavlin approved these changes Jun 6, 2017

View reviewed changes

mikedn added 4 commits June 7, 2017 19:29

Do unsigned magic division

64a4d49

Move GetSignedMagicNumberForDivide to utils.cpp

0435fbb

Add signed magic division tables

a09c598

Disable signed magic division in minopts mode

1a5d5e5

mikedn force-pushed the magic-udiv branch from 0cce79d to 1a5d5e5 Compare June 7, 2017 16:35

CarolEidt approved these changes Jun 8, 2017

View reviewed changes

pgavlin merged commit e3c4bd2 into dotnet:master Jun 9, 2017

karelz modified the milestone: 2.1.0 Aug 28, 2017

mikedn deleted the magic-udiv branch September 28, 2019 19:27

pentp mentioned this pull request Mar 13, 2021

Faster unsigned division by constants dotnet/runtime#49585

Merged

Do unsigned magic division #10996

Do unsigned magic division #10996

Uh oh!

Conversation

mikedn commented Apr 15, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danmoseley commented Apr 18, 2017

Uh oh!

mikedn commented May 27, 2017

Uh oh!

russellhadley commented May 30, 2017

Uh oh!

mikedn commented Jun 6, 2017

Uh oh!

russellhadley commented Jun 6, 2017

Uh oh!

pgavlin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikedn commented Jun 7, 2017

Uh oh!

pgavlin commented Jun 7, 2017

Uh oh!

mikedn commented Jun 7, 2017

Uh oh!

pgavlin commented Jun 7, 2017

Uh oh!

pgavlin commented Jun 7, 2017

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants