Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@4creators
Copy link

Related issue #16854

Implementation of NBody algorithm with some data layuout modifications
introduced in C++ g++ #3. Benchmark is based on hand tuned procedural
implementation of the AoS algorithm form. Due to small size of the data
(5 objects only) and change of structural data layout requirements during
calculations SoA implementation may provide limited benefits at maximum 15-18 %.

On Haswell architecture Avx2/Fma vectorized benchmark is almost 2 x faster
than partially Sse2/Sse vectorized C++ #3 benchmark. The speedup should be
significantly higher on any architecture with number of ymm registers
larger than 16 as some register spills impact performance.

@fiigii
Copy link

fiigii commented Sep 28, 2018

Thanks for the work. Do you have detailed perf data (like VTune)?

@tannergooding
Copy link
Member

We already have a C# implementation of the n-body algorithm here: https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/BenchmarksGame/n-body/n-body-3.cs

Ideally, you would first submit an updated version to the benchmark games site and then we could pull it back into the repo from there.

@4creators
Copy link
Author

We already have a C# implementation of the n-body algorithm

I know this implementation and have compared results. However I need to move it outside of coreclr benchmark harness to make more reliable comparisons.

Do you have detailed perf data (like VTune)?

Yes I have tuned implementation with support of VTune. I will post detailed info soon.

Ideally, you would first submit an updated version to the benchmark games site and then we could pull it back into the repo from there.

It would be impossible for this implementation due to the fact that it is based on Avx2/Fma and it seems the Bnechmarks Game processor does not support anything higher than Sse41/Sse3.

@4creators
Copy link
Author

Just forgot to add that I will work now on Sse2/Sse3 implementation for submission to Benchmarks Game. In this case it may be beneficial to implement SoA instead of AoS algorithm.

…tnet#3 implementation

Implementation of NBody algorithm with some data layuout modifications
introduced in C++ g++ dotnet#3. Benchmark is based on hand tuned procedural
implementation of the AoS algorithm form. Due to small size of the data
(5 objects only) and change of structural data layout requirements during
calculations SoA implementation may provide limited benefits at maximum 15-18 %.

On Haswell architecture Avx2/Fma vectorized benchmark is almost 2 x faster
than partially Sse2/Sse vectorized C++ dotnet#3 benchmark. The speedup should be
significantly higher on any architecture with number of ymm registers
larger than 16 as some register spills impact performance.
@4creators
Copy link
Author

4creators commented Sep 29, 2018

Performance diff in milliseconds between nbody-3 (current benchmark) and NBodySimdAvxFma, 11 measurements were taken with 50 000 000 integration steps for each run

On Windows 10 x64

Program NBody3 NBodySimd
Avg ms 6 410.60 2 343.05
StdDev 18.46 7.63

On WSL Ubuntu 18.04

Program NBody3 NBodySimd
Avg ms 6 469.73 2 389.73
StdDev 13.75 3.89

Hardware i7-4700MQ, COMPlus_TieredCompilation=0, Microsoft.NETCore.App 3.0.0-preview1-26928-03, Windows 10 Pro.

@adamsitnik
Copy link
Member

Hi @4creators

Could you please close this PR and add the new benchmark to our dotnet/performance repository? This is the place where we keep all the benchmarks now.

Thanks,
Adam

@adamsitnik adamsitnik added the tenet-performance-benchmarks Issue from performance benchmark label Jul 31, 2019
@4creators
Copy link
Author

@adamsitnik Closing and will add PR to dotnet/perf repo

@4creators 4creators closed this Aug 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

tenet-performance-benchmarks Issue from performance benchmark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants