Skip to content
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions src/coreclr/pal/src/include/pal/virtual.h
Original file line number Diff line number Diff line change
Expand Up @@ -180,17 +180,24 @@ class ExecutableMemoryAllocator
int32_t GenerateRandomStartOffset();

private:

// There does not seem to be an easy way find the size of a library on Unix.
// So this constant represents an approximation of the libcoreclr size (on debug build)
// that can be used to calculate an approximate location of the memory that
// is in 2GB range from the coreclr library. In addition, having precise size of libcoreclr
// is not necessary for the calculations.
static const int32_t CoreClrLibrarySize = 100 * 1024 * 1024;
static const int32_t CoreClrLibrarySize = 32 * 1024 * 1024;

#ifdef TARGET_XARCH
// This constant represent the max size of the virtual memory that this allocator
// will try to reserve during initialization. We want all JIT-ed code and the
// entire libcoreclr to be located in a 2GB range.
static const int32_t MaxExecutableMemorySize = 0x7FFF0000;
static const int32_t MaxExecutableMemorySize = 0x7FFF0000;// 2GB - 64KB
#else
// Smaller size for ARM64 where relative calls/jumps only work within 128MB
static const int32_t MaxExecutableMemorySize = 0x7FF0000; // 128MB - 64KB
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this range gets consumed in a real-world ASP.NET app?

Note that we map all sorts of stuff into this range, including R2R images. I would expect that 128MB gets exhausted fairly quickly given how things work today.

I agree that this fix works well for micro-benchmarks that are very unlikely to exhaust the 128MB range.

Copy link
Member Author

@EgorBo EgorBo Jan 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to investigate, I guess the idea that the hot code (tier1) should better be closer to VM's FCalls by default?
For large apps I guess we still want to try to address your suggestion with emitting direct addresses without jump-stubs in tier1 #62302 (I have a draft)

For R2R-only it can be improved by pgo + method sorting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think R2R code today benefits from being close to coreclr as all 'external' calls/branches have to go through indirection cells anyway. This may have been different in the days of fragile ngen?
Please correct me if I'm wrong @jkotas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large apps I guess we still want to try to address your suggestion with emitting direct addresses without jump-stubs

I do not think it is just for large apps. With TC enabled, all managed->managed method calls go through a precode that has the exact same instructions as jump stub and so it will introduce similar bottleneck as what you have identified.

For R2R-only it can be improved by pgo + method sorting?

R2R images are generally smaller than 128MB. You can only sort within the image, so the sorting won't help with jump stubs. (Sorting within image is still good for locality.)

Also, once we get this all fixed, we may want to look at retuning the inliner. My feeling is that the inliner expands the code too much these days. Some of it may be just compensating for the extra method call overhead that we are paying today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think R2R code today benefits from being close to coreclr as all 'external' calls/branches have to go through indirection cells anyway.

Calls from runtime generated stubs and JITed code to R2R code still benefit from the two being close.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JITed code to R2R code

Do these not go through an indirection when tiering is enabled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these not go through an indirection when tiering is enabled?

Yes - when tiering is enabled. No - when tiering is disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With TC enabled, all managed->managed method calls go through a precode that has the exact same instructions as jump stub and so it will introduce similar bottleneck as what you have identified.

I'll start from your suggestion to emit direct calls for T1 Caller calls T1 Callee (not as part of this PR)

#endif

static const int32_t MaxExecutableMemorySizeNearCoreClr = MaxExecutableMemorySize - CoreClrLibrarySize;

// Start address of the reserved virtual address space
Expand Down
5 changes: 5 additions & 0 deletions src/coreclr/utilcode/executableallocator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,12 @@ void ExecutableAllocator::InitLazyPreferredRange(size_t base, size_t size, int r
// to coreclr.dll. This avoids having to create jump stubs for calls to
// helpers and R2R images loaded close to coreclr.dll.
//
#ifdef TARGET_XARCH
SIZE_T reach = 0x7FFF0000u;
#else
// Smaller size for ARM64 where relative calls/jumps only work within 128MB
SIZE_T reach = 0x7FF0000u;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may collide with other ASLR assigned addresses and lead to non-trivial private working set hits for native .dlls or R2R images that are loaded after coreclr is initialized.

We try to start as far as possible from coreclr base address to avoid that situation on x64. Look for the comment below: "We try to occupy the space as far as possible to minimize collisions with other ASLR assigned addresses."

Copy link
Member Author

@EgorBo EgorBo Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that win-arm64 is fine as is?

I am still trying to understand the problem. E.g. we reserve 2GB via, I assume, VirtualAlloc
how come we end up with a large distance between coreclr code and jitted code in memory even for a hello-world

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked - win-arm64 app emits less jump-stubs for a completely empty app: only 4 (two inside StelemRef and two inside IndexOf)

Copy link
Member

@jkotas jkotas Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that win-arm64 is fine as is?

I am saying that this change is potentially replacing one performance problem with a different performance problem on win-arm64.

All modern OSes have address space layout randomization. Our attempts to allocate near coreclr library are going against that. So we have to be careful not to be on collision course with what the OSes are trying to do.

win-arm64 app emits less jump-stubs for a completely empty app

Before this change, or only with this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this change, or only with this change?

Before (Main)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you understand why it is the case? The executable space should be typically allocated more than 128MB away from coreclr.dll on win-arm64 if I am reading this code correctly.

Copy link
Member Author

@EgorBo EgorBo Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas oops, nvm, I forgot that on M1 I used DOTNET_ReadyToRun=0. I've just tried it with R2R=0 on win-arm64 and got exactly the same list of methods requesting jump stubs as in #62302 (comment)

#endif

// We will choose the preferred code region based on the address of coreclr.dll. The JIT helpers
// in coreclr.dll are the most heavily called functions.
Expand Down