Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@mikedn
Copy link

@mikedn mikedn commented Apr 15, 2017

Fixes #10970

// and Peter L. Montgomery in PLDI 94

template <typename U>
U GetUnsignedMagicNumberForDivide(U denom, int* shift /*out*/, bool* add /*out*/)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested to see if this shows up in throughput. Just reading through, I was wondering if it would be worth pre-computing some set of common cases and leaving them in a table?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one occurrence in corelib and a couple of others in System.Runtime.Numerics and System.Xml.ReaderWriter so there's not much to measure.

FWIW I measured how much time this function takes for denom = 10 and it's ~12ns on my machine. The time seems to reach ~70ns for larger denominators.

The book from which this code was adapted actually suggests the possibility of using a precomputed table but then the book was written in 1996, computers where kind of slow back then :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, the VC++ compiler has a table for 3-12.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the classic algorithm, isn't it?
Did you consider Mölller and Granlund's updated one from 2011: Improved Division by Invariant Integers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tornhoof Yes, it's the original algorithm. I've seen "Improved Division by Invariant Integers" but it seems that it skewed towards multi word division so I didn't pay too much attention to it. Also, one of its basic premises is that that MULHI is slightly slower than a normal MUL but that doesn't seem to be the case anymore on current CPUs. Granlund's own instruction timing tables show that the latency dropped from 10 cycles (on Nehalem) to 3 cycles (on Haswell).

I was actually looking at another potential improvement: http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. It simplifies a bit the adjustment code required by stubborn divisors like 7. But I haven't seen any compiler using that and I have no idea if it's actually correct.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation

@mikedn mikedn force-pushed the magic-udiv branch 3 times, most recently from 0c0561f to 2d57e68 Compare April 17, 2017 11:43
@danmoseley
Copy link
Member

@dotnet-bot test Tizen armel Cross Release Build please (break)
@dotnet-bot test Ubuntu arm Cross Release Build please (break)

@mikedn mikedn force-pushed the magic-udiv branch 2 times, most recently from 1431022 to c29c529 Compare May 27, 2017 08:32
@mikedn
Copy link
Author

mikedn commented May 27, 2017

Well, this is pretty much done but there are a few questions:

  1. This implements the "classic" algorithm that everyone uses. There is another algorithm for unsigned division that generates slightly shorter code for certain divisors (e.g. 7) and, interesting in the context of JIT, avoids the need for creating a temporary lclvar. But no native compiler I looked at (cl, clang, gcc, icc) uses this algorithm and I have no idea why. Perhaps it's actually incorrect, perhaps there are legal issues (though the reference code is in public domain). Should we looking into this more?
  2. Should we use a precomputed table? This applies to the signed version as well.
  3. Should this optimization be done in minopts/debug mode? This applies to the signed version as well.

@russellhadley
Copy link

1 - This seems like a good place to start - I think we should get this in before taking the next step.
2 - I think the pre-computed table is a good idea - for the "we're a JIT" reason.
3 - My thought is that we should only do this under opts. (not minopt) It doesn't reduce IR node count so it won't likely make a speedup.

@mikedn mikedn force-pushed the magic-udiv branch 2 times, most recently from 725eb24 to 0cce79d Compare May 31, 2017 17:52
@mikedn
Copy link
Author

mikedn commented Jun 6, 2017

I've added precomputed tables and disabled magic division optimization under minopts (for both signed and unsigned division). The tables cover the range 3-12, they're identical to the ones found in "The PowerPC Compiler Writer's Guide" so they're easy to double check.

@russellhadley
Copy link

Remove the WIP tag on the PR? @pgavlin do you have any feedback here? LGTM.

@mikedn mikedn changed the title [WIP] Do unsigned magic division Do unsigned magic division Jun 6, 2017
Copy link

@pgavlin pgavlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from a few small comments.


delta = absDenom - r2;
} while (q1 < delta || (q1 == delta && r1 == 0));
// Insert a new GT_MULHI node in front of the existing GT_UDIV/GT_UMOD node.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/in front of/before/

divMod->gtOp1 = dividend;
divMod->gtOp2 = mul;

BlockRange().InsertBefore(divMod, dividend, div, divisor, mul);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this is probably better ordered as BlockRange().InsertBefore(divMod, div, divisor, mul, dividend) in order to keep the lifetime of dividend shorter in the case its lclVar does not end up tracked.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case its lclVar does not end up tracked

Heh, that would be quite unfortunate. I'll update.

template <>
const UnsignedMagic<uint32_t>* TryGetUnsignedMagic(uint32_t divisor)
{
// clang-format off
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of results were you getting from clang-format here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    static const UnsignedMagic<uint32_t> table[]{
        {0xaaaaaaab, false, 1}, // 3
        {},
        {0xcccccccd, false, 2}, // 5
        {0xaaaaaaab, false, 2}, // 6
        {0x24924925, true, 3},  // 7
        {},
        {0x38e38e39, false, 1}, // 9
        {0xcccccccd, false, 3}, // 10
        {0xba2e8ba3, false, 3}, // 11
        {0xaaaaaaab, false, 3}, // 12
    };

I suppose it's reasonable, though I hate the misplaced opening brace :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of custom formatting, it's not worth it.

typedef typename jitstd::make_signed<T>::type ST;

const unsigned bits = sizeof(T) * 8;
const unsigned bits_minus_1 = bits - 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: bitsMinus1, twoNMinus1, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should change those. I tried to be consistent with the signed version but there's no point in doing that.

@mikedn
Copy link
Author

mikedn commented Jun 7, 2017

jit-diff summary:

Total bytes of diff: 143 (0.00 % of base)
    diff is a regression.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file regressions by size (bytes):
         133 : System.Runtime.Numerics.dasm (0.23 % of base)
          10 : System.Private.CoreLib.dasm (0.00 % of base)
2 total files with size differences (0 improved, 2 regressed), 77 unchanged.
Top method regessions by size (bytes):
         122 : System.Runtime.Numerics.dasm - BigNumber:FormatBigInteger(struct,ref,ref):ref
          11 : System.Runtime.Numerics.dasm - Number:Int32ToDecChars(long,byref,int,int)
          10 : System.Private.CoreLib.dasm - TimerQueue:get_TickCount():int
3 total methods with size differences (0 improved, 3 regressed), 65548 unchanged.

All diffs come from the newly added unsigned magic division optimization (that is, the signed magic division changes don't generate any diffs as expected).

1 occurrence in corelib and a few more in System.Runtime.Numerics, some BigInteger formatting code.

@pgavlin
Copy link

pgavlin commented Jun 7, 2017

All diffs come from the newly added unsigned magic division optimization (that is, the signed magic division changes don't generate any diffs as expected).

Can you provide an example of one of the diffs?

@mikedn
Copy link
Author

mikedn commented Jun 7, 2017

A x % 10 from BigInteger code:

; before
       mov      r15d, 10
       mov      eax, ecx
       xor      rdx, rdx
       div      edx:eax, r15d
       add      edx, 48
; after
       mov      eax, ecx
       mov      edx, 0xD1FFAB1E
       mov      dword ptr [rbp-64H], eax
       mul      edx:eax, edx
       shr      edx, 3
       imul     eax, edx, 10
       mov      edx, dword ptr [rbp-64H]
       sub      edx, eax
       add      edx, 48

The temporary lclvar makes an appearance...

@pgavlin
Copy link

pgavlin commented Jun 7, 2017

The temporary lclvar makes an appearance...

Yeah, it would be nice if that reload was just a copy from ecx. Seems okay to me, though.

cc @CarolEidt @BruceForstall @dotnet/jit-contrib

@pgavlin
Copy link

pgavlin commented Jun 7, 2017

OS X failure looks infrastructural in nature.

@dotnet-bot test OSX10.12 x64 Checked Build and Test

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

q2 *= 2; // update q2 = 2^p / abs(denom)
r2 *= 2; // update r2 = rem(2^p / abs(denom))
// Depending on the "add" flag returned by GetUnsignedMagicNumberForDivide we need to generate:
// add == false (when divisor == 3 for example):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I would probably have named the flag needsAdd or useAdd or something that implies that it is a boolean.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"add" is the name used in "The PowerPC Compiler Writer's Guide". I don't like it but I don't think any variations around "add" are any better, IMO it should somehow indicate why that addition needs to be performed. Maybe I'll change it to "overflow" or something similar, it really indicates that the magic number is too large to fit in 32/64 bit.

@pgavlin pgavlin merged commit e3c4bd2 into dotnet:master Jun 9, 2017
@karelz karelz modified the milestone: 2.1.0 Aug 28, 2017
@mikedn mikedn deleted the magic-udiv branch September 28, 2019 19:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants