HW intrinsic Avx2/Fma accelerated NBody benchmark based on C++ g++ #3 implementation #20184

4creators · 2018-09-28T18:07:45Z

Related issue #16854

Implementation of NBody algorithm with some data layuout modifications
introduced in C++ g++ #3. Benchmark is based on hand tuned procedural
implementation of the AoS algorithm form. Due to small size of the data
(5 objects only) and change of structural data layout requirements during
calculations SoA implementation may provide limited benefits at maximum 15-18 %.

On Haswell architecture Avx2/Fma vectorized benchmark is almost 2 x faster
than partially Sse2/Sse vectorized C++ #3 benchmark. The speedup should be
significantly higher on any architecture with number of ymm registers
larger than 16 as some register spills impact performance.

fiigii · 2018-09-28T18:24:38Z

Thanks for the work. Do you have detailed perf data (like VTune)?

tannergooding · 2018-09-28T19:33:48Z

We already have a C# implementation of the n-body algorithm here: https://github.com/dotnet/coreclr/blob/master/tests/src/JIT/Performance/CodeQuality/BenchmarksGame/n-body/n-body-3.cs

Ideally, you would first submit an updated version to the benchmark games site and then we could pull it back into the repo from there.

4creators · 2018-09-28T20:05:26Z

We already have a C# implementation of the n-body algorithm

I know this implementation and have compared results. However I need to move it outside of coreclr benchmark harness to make more reliable comparisons.

Do you have detailed perf data (like VTune)?

Yes I have tuned implementation with support of VTune. I will post detailed info soon.

Ideally, you would first submit an updated version to the benchmark games site and then we could pull it back into the repo from there.

It would be impossible for this implementation due to the fact that it is based on Avx2/Fma and it seems the Bnechmarks Game processor does not support anything higher than Sse41/Sse3.

4creators · 2018-09-28T20:09:04Z

Just forgot to add that I will work now on Sse2/Sse3 implementation for submission to Benchmarks Game. In this case it may be beneficial to implement SoA instead of AoS algorithm.

…tnet#3 implementation Implementation of NBody algorithm with some data layuout modifications introduced in C++ g++ dotnet#3. Benchmark is based on hand tuned procedural implementation of the AoS algorithm form. Due to small size of the data (5 objects only) and change of structural data layout requirements during calculations SoA implementation may provide limited benefits at maximum 15-18 %. On Haswell architecture Avx2/Fma vectorized benchmark is almost 2 x faster than partially Sse2/Sse vectorized C++ dotnet#3 benchmark. The speedup should be significantly higher on any architecture with number of ymm registers larger than 16 as some register spills impact performance.

4creators · 2018-09-29T15:37:11Z

Performance diff in milliseconds between nbody-3 (current benchmark) and NBodySimdAvxFma, 11 measurements were taken with 50 000 000 integration steps for each run

On Windows 10 x64

Program	NBody3	NBodySimd
Avg ms	6 410.60	2 343.05
StdDev	18.46	7.63

On WSL Ubuntu 18.04

Program	NBody3	NBodySimd
Avg ms	6 469.73	2 389.73
StdDev	13.75	3.89

Hardware i7-4700MQ, COMPlus_TieredCompilation=0, Microsoft.NETCore.App 3.0.0-preview1-26928-03, Windows 10 Pro.

adamsitnik · 2019-07-31T07:29:53Z

Hi @4creators

Could you please close this PR and add the new benchmark to our dotnet/performance repository? This is the place where we keep all the benchmarks now.

Thanks,
Adam

4creators · 2019-08-01T17:17:30Z

@adamsitnik Closing and will add PR to dotnet/perf repo

4creators force-pushed the NBodySimdAvxFma branch from 0820a10 to 7b6b490 Compare September 28, 2018 20:25

adamsitnik added the tenet-performance-benchmarks Issue from performance benchmark label Jul 31, 2019

4creators closed this Aug 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HW intrinsic Avx2/Fma accelerated NBody benchmark based on C++ g++ #3 implementation #20184

HW intrinsic Avx2/Fma accelerated NBody benchmark based on C++ g++ #3 implementation #20184

Uh oh!

4creators commented Sep 28, 2018

Uh oh!

fiigii commented Sep 28, 2018

Uh oh!

tannergooding commented Sep 28, 2018

Uh oh!

4creators commented Sep 28, 2018

Uh oh!

4creators commented Sep 28, 2018

Uh oh!

4creators commented Sep 29, 2018 •

edited

Loading

Uh oh!

adamsitnik commented Jul 31, 2019

Uh oh!

4creators commented Aug 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HW intrinsic Avx2/Fma accelerated NBody benchmark based on C++ g++ #3 implementation #20184

HW intrinsic Avx2/Fma accelerated NBody benchmark based on C++ g++ #3 implementation #20184

Uh oh!

Conversation

4creators commented Sep 28, 2018

Uh oh!

fiigii commented Sep 28, 2018

Uh oh!

tannergooding commented Sep 28, 2018

Uh oh!

4creators commented Sep 28, 2018

Uh oh!

4creators commented Sep 28, 2018

Uh oh!

4creators commented Sep 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

On Windows 10 x64

On WSL Ubuntu 18.04

Uh oh!

adamsitnik commented Jul 31, 2019

Uh oh!

4creators commented Aug 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

4creators commented Sep 29, 2018 •

edited

Loading