Tweak parallelism and the instantiation benchmark #3775
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently the "sequential" and "parallel" benchmarks reports somewhat
different timings. For sequential it's time-to-instantiate, but for
parallel it's time-to-instantiate-10k instances. The parallelism in the
parallel benchmark can also theoretically be affected by rayon's
work-stealing. For example if rayon doesn't actually do any work
stealing at all then this ends up being a sequential test again.
Otherwise though it's possible for some threads to finish much earlier
as rayon isn't guaranteed to keep threads busy.
This commit applies a few updates to the benchmark:
First an
InstancePre<T>is now used instead of aLinker<T>tofront-load type-checking and avoid that on each instantiation (and
this is generally the fastest path to instantiate right now).
Next the instantiation benchmark is changed to measure one
instantiation-per-iteration to measure per-instance instantiation to
better compare with sequential numbers.
Finally rayon is removed in favor of manually creating background
threads that infinitely do work until we tell them to stop. These
background threads are guaranteed to be working for the entire time
the benchmark is executing and should theoretically exhibit what the
situation that there's N units of work all happening at once.
I also applied some minor updates here such as having the parallel
instantiation defined conditionally for multiple modules as well as
upping the limits of the pooling allocator to handle a large module
(rustpython.wasm) that I threw at it.