-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Do unsigned magic division #10996
Do unsigned magic division #10996
Conversation
src/jit/lower.cpp
Outdated
| // and Peter L. Montgomery in PLDI 94 | ||
|
|
||
| template <typename U> | ||
| U GetUnsignedMagicNumberForDivide(U denom, int* shift /*out*/, bool* add /*out*/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested to see if this shows up in throughput. Just reading through, I was wondering if it would be worth pre-computing some set of common cases and leaving them in a table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one occurrence in corelib and a couple of others in System.Runtime.Numerics and System.Xml.ReaderWriter so there's not much to measure.
FWIW I measured how much time this function takes for denom = 10 and it's ~12ns on my machine. The time seems to reach ~70ns for larger denominators.
The book from which this code was adapted actually suggests the possibility of using a precomputed table but then the book was written in 1996, computers where kind of slow back then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, the VC++ compiler has a table for 3-12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the classic algorithm, isn't it?
Did you consider Mölller and Granlund's updated one from 2011: Improved Division by Invariant Integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tornhoof Yes, it's the original algorithm. I've seen "Improved Division by Invariant Integers" but it seems that it skewed towards multi word division so I didn't pay too much attention to it. Also, one of its basic premises is that that MULHI is slightly slower than a normal MUL but that doesn't seem to be the case anymore on current CPUs. Granlund's own instruction timing tables show that the latency dropped from 10 cycles (on Nehalem) to 3 cycles (on Haswell).
I was actually looking at another potential improvement: http://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf. It simplifies a bit the adjustment code required by stubborn divisors like 7. But I haven't seen any compiler using that and I have no idea if it's actually correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation
0c0561f to
2d57e68
Compare
|
@dotnet-bot test Tizen armel Cross Release Build please (break) |
1431022 to
c29c529
Compare
|
Well, this is pretty much done but there are a few questions:
|
|
1 - This seems like a good place to start - I think we should get this in before taking the next step. |
725eb24 to
0cce79d
Compare
|
I've added precomputed tables and disabled magic division optimization under minopts (for both signed and unsigned division). The tables cover the range 3-12, they're identical to the ones found in "The PowerPC Compiler Writer's Guide" so they're easy to double check. |
|
Remove the WIP tag on the PR? @pgavlin do you have any feedback here? LGTM. |
pgavlin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM aside from a few small comments.
src/jit/lower.cpp
Outdated
|
|
||
| delta = absDenom - r2; | ||
| } while (q1 < delta || (q1 == delta && r1 == 0)); | ||
| // Insert a new GT_MULHI node in front of the existing GT_UDIV/GT_UMOD node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: s/in front of/before/
src/jit/lower.cpp
Outdated
| divMod->gtOp1 = dividend; | ||
| divMod->gtOp2 = mul; | ||
|
|
||
| BlockRange().InsertBefore(divMod, dividend, div, divisor, mul); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this is probably better ordered as BlockRange().InsertBefore(divMod, div, divisor, mul, dividend) in order to keep the lifetime of dividend shorter in the case its lclVar does not end up tracked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case its lclVar does not end up tracked
Heh, that would be quite unfortunate. I'll update.
src/jit/utils.cpp
Outdated
| template <> | ||
| const UnsignedMagic<uint32_t>* TryGetUnsignedMagic(uint32_t divisor) | ||
| { | ||
| // clang-format off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of results were you getting from clang-format here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static const UnsignedMagic<uint32_t> table[]{
{0xaaaaaaab, false, 1}, // 3
{},
{0xcccccccd, false, 2}, // 5
{0xaaaaaaab, false, 2}, // 6
{0x24924925, true, 3}, // 7
{},
{0x38e38e39, false, 1}, // 9
{0xcccccccd, false, 3}, // 10
{0xba2e8ba3, false, 3}, // 11
{0xaaaaaaab, false, 3}, // 12
};I suppose it's reasonable, though I hate the misplaced opening brace :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got rid of custom formatting, it's not worth it.
src/jit/utils.cpp
Outdated
| typedef typename jitstd::make_signed<T>::type ST; | ||
|
|
||
| const unsigned bits = sizeof(T) * 8; | ||
| const unsigned bits_minus_1 = bits - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: bitsMinus1, twoNMinus1, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I should change those. I tried to be consistent with the signed version but there's no point in doing that.
|
jit-diff summary: All diffs come from the newly added unsigned magic division optimization (that is, the signed magic division changes don't generate any diffs as expected). 1 occurrence in corelib and a few more in System.Runtime.Numerics, some |
Can you provide an example of one of the diffs? |
|
A ; before
mov r15d, 10
mov eax, ecx
xor rdx, rdx
div edx:eax, r15d
add edx, 48
; after
mov eax, ecx
mov edx, 0xD1FFAB1E
mov dword ptr [rbp-64H], eax
mul edx:eax, edx
shr edx, 3
imul eax, edx, 10
mov edx, dword ptr [rbp-64H]
sub edx, eax
add edx, 48The temporary lclvar makes an appearance... |
Yeah, it would be nice if that reload was just a copy from cc @CarolEidt @BruceForstall @dotnet/jit-contrib |
|
OS X failure looks infrastructural in nature. @dotnet-bot test OSX10.12 x64 Checked Build and Test |
CarolEidt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| q2 *= 2; // update q2 = 2^p / abs(denom) | ||
| r2 *= 2; // update r2 = rem(2^p / abs(denom)) | ||
| // Depending on the "add" flag returned by GetUnsignedMagicNumberForDivide we need to generate: | ||
| // add == false (when divisor == 3 for example): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I would probably have named the flag needsAdd or useAdd or something that implies that it is a boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"add" is the name used in "The PowerPC Compiler Writer's Guide". I don't like it but I don't think any variations around "add" are any better, IMO it should somehow indicate why that addition needs to be performed. Maybe I'll change it to "overflow" or something similar, it really indicates that the magic number is too large to fit in 32/64 bit.
Fixes #10970