Skip to content

Conversation

@AntonLapounov
Copy link
Member

Similar to #74325, uniformly exclude the g_GCShadowEnd, g_ephemeral_high, and g_highest_address upper bounds in the corresponding range checks.

Note that JNB, JAE, and JNC mnemonics correspond to the same processor instruction (jump if CF = 0). I used JNB for amd64 and JAE for i386 for "local consistency" with other range checks in the corresponding files.

The Windows ARM64 version would trash the x16 register in the CheckCardTable block. That was not reflected in the comments — neither in this file nor in src\coreclr\jit\targetarm64.h. I am not convinced that ldp improves performance here as it requires an additional adr instruction and increases the average number of retired instructions. The Linux ARM64 version uses two ldr instructions instead and I decided to use the same for Windows.

@ghost ghost added the area-VM-coreclr label Sep 12, 2022
@ghost ghost assigned AntonLapounov Sep 12, 2022
@AntonLapounov
Copy link
Member Author

@VSadov Please take a look.

bhi Exit
ldr x12, wbs_ephemeral_high
cmp x15, x12
bhs Exit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .S version is using ccmp here. I think we should use the same pattern in both cases.

I wonder which would be better though. In theory the ephemeral check should more often fail than pass, statistically. In a large heap most objects are tenured. Ephemeral set is supposed to be small.

On the other hand both patterns probably work the same on a speculative CPU, but the predicated pattern is shorter by one instruction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And on server GC ephemeral check will always pass. I think, since we are touching this, it should be switched to the predicated form like in .S

@mangod9
Copy link
Member

mangod9 commented Mar 6, 2023

@AntonLapounov @VSadov is this something we need to continue to fix?

@VSadov
Copy link
Member

VSadov commented Mar 6, 2023

The changes from edge inclusive to edge exclusive compare are bug fixes. These kind of issues would be difficult to hit, but it makes sense to have correct comparisons.

I am not very concerned about NoShadow code paths as that is debug-only code and I do not know if WRITE_BARRIER_CHECK is used regularly or at all.
The changes in CheckCardTable are in the "real" code.

My only comment on that was that since we are using predicated form in unix ARM64 helper, perhaps we should use predicated form in the windows counterpart as well. It likely does not matter much which form is used, that would be mostly for consistency and predicated form is one instruction shorter.

I could do the change if @AntonLapounov is ok with that.

Otherwise this change LGTM

@AntonLapounov
Copy link
Member Author

I will try to complete it this week.

@trylek
Copy link
Member

trylek commented Apr 10, 2023

Now that Anton has been reassigned to other work outside of .NET, can someone on @dotnet/gc and / or @dotnet/jit-contrib chime in who would be best suited to finalize and merge this PR?

@VSadov
Copy link
Member

VSadov commented Apr 10, 2023

There was only one actionable suggestion - to use similar code in win-arm64 as in unix-arm64 in one place. There is no reason for the difference.
I will make the change.

@trylek trylek assigned VSadov and unassigned AntonLapounov Apr 10, 2023
@Maoni0
Copy link
Member

Maoni0 commented Apr 10, 2023

I think @cshung was making use of GC shadow at one point when he was debugging some WB stuff and jit folks probably use it for the most part.

@VSadov
Copy link
Member

VSadov commented Apr 10, 2023

I've made the change to address my earlier concern - no need to use different ways to compare on windows and on unix.
With that the change looks good to me.

@VSadov
Copy link
Member

VSadov commented Apr 10, 2023

I may need another signoff on this, or it would look like I am signing off on my own PR

Copy link
Member

@Maoni0 Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VSadov
Copy link
Member

VSadov commented Apr 11, 2023

Thanks!!

@VSadov VSadov merged commit dbe9f2a into dotnet:main Apr 11, 2023
@trylek
Copy link
Member

trylek commented Apr 11, 2023

Thanks to everyone involved for getting this finished!

@EgorBo
Copy link
Member

EgorBo commented Apr 20, 2023

@VSadov
Copy link
Member

VSadov commented Apr 20, 2023

Switching to predicated bounds check resulted in measurable improvement. I mostly expected just more compact code.

I was considering making the same change to NativeAOT barriers, but just shortening the code was not enough motivation.
It is a trivial change. I will apply the same to NativeAOT.

@VSadov
Copy link
Member

VSadov commented Apr 20, 2023

@EgorBo - in addition to 14 improvements there was one regression in CastingPerf2.CastingPerf.FooObjIsNull. Was that a one time outlier or it stayed in later runs? Is there a way to check?

@EgorBo
Copy link
Member

EgorBo commented Apr 20, 2023

@EgorBo - in addition to 14 improvements there was one regression in CastingPerf2.CastingPerf.FooObjIsNull. Was that a one time outlier or it stayed in later runs? Is there a way to check?

We can revise that one in a week since there are not enough data point to judge yet

image

@ghost ghost locked as resolved and limited conversation to collaborators May 20, 2023
@DrewScoggins
Copy link
Member

This ended up being a lasting regression.

image

@VSadov
Copy link
Member

VSadov commented Aug 28, 2023

This ended up being a lasting regression.

The changed code performs two comparisons (like in logical &&).
The change appears to penalize slightly the cases where the first compare would fail and the second would not happen, while cases when both compares would be performed are slightly faster. Having a null for the object is one case when compare could short-circuit.
The difference is relatively small and only seen because this is such a tight loop benchmark.

Either flavor of this code would penalize one case or another. Considering we have many more improvements and the new variant is shorter, I think this regression is acceptable.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants