-
Notifications
You must be signed in to change notification settings - Fork 525
Performance: Refactors query prefetch mechanism #4361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
microsoft-github-policy-service
merged 30 commits into
Azure:master
from
kevin-montrose:parallelPrefetchRework
Apr 1, 2024
Merged
Performance: Refactors query prefetch mechanism #4361
microsoft-github-policy-service
merged 30 commits into
Azure:master
from
kevin-montrose:parallelPrefetchRework
Apr 1, 2024
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ons, but we also can't be substantially slower to start all tasks
…d swallow exceptions
…sks and buffers that could occur in that some places
…ot allocating more in the non-test cases, but found a field to reuse; needs benchmarking
Contributor
Author
|
@microsoft-github-policy-service agree company="Microsoft" |
neildsh
reviewed
Mar 28, 2024
neildsh
reviewed
Mar 28, 2024
neildsh
previously approved these changes
Mar 29, 2024
Member
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
ealsur
previously approved these changes
Mar 29, 2024
Contributor
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
neildsh
approved these changes
Apr 1, 2024
sboshra
approved these changes
Apr 1, 2024
Contributor
sboshra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
Contributor
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This was referenced Jul 28, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Reworks
ParallelPrefetch.PrefetchInParallelAsyncto reduce allocations.This came out of profiling an application, and discovering that this method is allocating approximately as many bytes worth of
Task[]as the whole application is creatingbyte[]for IO. This is becauseTask.WhenAny(...)is a) used in a loop b) makes a defensive copy of the passedTasks.This version is substantially more complicated, and accordingly there are a lot of tests in this PR (code coverage is 100% of lines and blocks). Special attention was paid to exception and cancellation cases.
Improvements
Greatly Reduced Allocations
In my benchmarking, anywhere from 30% to 99% depending on the total number of
IPrefetchers used.More benchmarking discussion is at the bottom of this PR.
Special Casing For
maxConcurrencyWhen
== 0we do no work, more efficiently than current code.When
== 1we devolve to a foreach, which is just about ideal.Special Casing When Only 1
IPrefetcherWe accept an
IEnumerable<IPrefetcher>, but when that is only going to yield oneIPrefetchera lot of work (even with the old code) is pointless. New code detects this case (generically, it doesn't look for specific types) and devolves into a singleawait.Prompter Starting Of Next Task
Old code starts at most one task per pass through the while loop, so if multiple
Tasks are sitting there completed there's a fair amount a work done before they are all replaced with activeTasks.New code has the completed
Taskstart its replacement, which should keep us closer tomaxConcurrencyactiveTasks.IEnumerator<IPrefetcher>DisposedSmall nit, but the old code doesn't dispose the
IEnumerator<IPrefetcher>. While unlikely, this can put more load on the finalizer thread or potentially leak resources.Outline
maxConcurrency == 0just returnsmaxConcurrency == 1is just aforeachmaxConcurrency <= BatchSizeis more complicatedBatchSizeIPrefetchers are loaded into a rented arrayTasks are then started for each of thoseIPrefetchersTasks grab and start the nextIPrefetcherof theIEnumerator<IPrefetcher>when they finish with their last oneTaskis then awaited in ordermaxConcurrency > BatchSizereuses a lot of the above case, but is still more complicatedBatchSizeIPrefetchersare loaded and started as aboveTasks grab and start the nextIPrefetcherwhen they finish with oneIPrefetchers(up tomaxConcurrency) while there are activeTasksobject[], which is awaited in turn oncemaxConcurrencyis reached (or theIEnumerator<T>finishes)We distinguish between the two
maxConcurrency > 1cases to avoid allocating very large arrays, and to make sure we start some prefetches fairly promptly even whenmaxConcurrencyis very large.BatchSizeis, somewhat arbitrarily,512- any value> 1and< 8,192would be valid.Type of change
Sort of a bug I guess? Current code allocates a lot more than you'd expect.
Benchmarking
Standard caveats about micro-benchmarking apply, but I did some benchmarking to see how this stacks up versus the old code.
TL;DR - across the board improvements in allocations, with no wall clock regressions in what I believe is the common case. There are some narrow, less common cases, where small wall clock regressions are observed.
I consider the primary case here when the
IPrefetcheractually goes async, and takes some non-trivial time to do its work. My expectation is that the two versions of the code should have about the same wall-clock time when# tasks > maxConcurrency, with the new code edging out old as# tasksincreases.That said, I did also test the synchronous completion case, and the "goes async, but then completes immediately" cases to make sure performance wasn't terrible.
In all cases I expect the new code to perform fewer allocations than the old.
Summarizing some results (the full data is in the previous link)...
Here's old vs new on .NET 6 where the

IPrefetcheris just anawait Task.Delay(1)(< 1is an improvement):As expected, wall clock time is basically unaffected (the delay dominates) but allocations are improved across the board. The benefits of improved replacement
Taskstarting logic are visible at the very extreme ends of max concurrency and prefetcher counts.Again, but this time

IPrefetcherjustreturn default;s so everything completes synchronously:We see here that between 2 and 8 tasks there are configurations with wall clock regressions. I could try and improve that, but I believe "all synchronous completions" is fantastically rare, so it's not worth the extra code complications.
And finally,

IPrefetcheris justawait Task.Yield();so everything completes almost immediately but forces all the async completion machinery to run:Similarly, between 4 and 8 tasks there are some wall clock regressions. While more realistic than the "all synchronous"-case, I think this would still be pretty rare - most
IPrefetchers should be doing real work after some asynchronous operation.Since we target netstandard, I also benchmarked under a Framework version (4.6.2) and the results are basically the same:
