Skip to content

Conversation

@echesakov
Copy link
Contributor

@echesakov echesakov commented Mar 12, 2020

Implements Store Arm64 hardware intrinsic

Fixes #24771

Also addresses Brian's feedback and partly Bruce's feedback received on #33461

I decided to have a separate PR for @BruceForstall suggestions concerning finding a proper names for emitDispVectorElemList and other functions.

@echesakov echesakov added arch-arm64 area-System.Runtime.Intrinsics area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 12, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@echesakov echesakov changed the title [Arm64] Implement Store Arm64 Hardware Intrinsic [Arm64] Implement Store Hardware Intrinsic Mar 12, 2020
@echesakov echesakov marked this pull request as ready for review March 17, 2020 00:28
@echesakov
Copy link
Contributor Author

…argument base type in hwintrinsic.cpp"

Also update Compiler::getBaseTypeFromArgIfNeeded and annotate Store* methods in hwintrinsiclistxarch.h with HW_Flag_BaseTypeFromSecondArg
@echesakov
Copy link
Contributor Author

Any other feedback here?

@TamarChristinaArm
Copy link
Contributor

Any other feedback here?

So for the single element versions of vst1 and vld1 intrinsics we never generate ST1 and LD1 on little-endian. There we always generate str and ldr since they provide more addressing modes.

e.g.

#include <arm_neon.h>

void foo (int32x4_t v, int32_t *a)
{
  return vst1q_s32 (a, v);
}

will generate a str q0, [x0]

For big-endian we do generate the ST1 and LD1 though. This is not to say that this patch is wrong or needs changing. but just a headsup that you have less flexibility. and the single structure single reg intrinsics are the most common ones.

@echesakov
Copy link
Contributor Author

So for the single element versions of vst1 and vld1 intrinsics we never generate ST1 and LD1 on little-endian. There we always generate str and ldr since they provide more addressing modes.

e.g.

#include <arm_neon.h>

void foo (int32x4_t v, int32_t *a)
{
  return vst1q_s32 (a, v);
}

will generate a str q0, [x0]

For big-endian we do generate the ST1 and LD1 though. This is not to say that this patch is wrong or needs changing. but just a headsup that you have less flexibility. and the single structure single reg intrinsics are the most common ones.

Thank you for your feedback @TamarChristinaArm
Is there any performance differences between ld1/st1 and ldr/str in this case?

I am leaning towards keeping ld1/st1 for Load/Store. We can always switch to ldr/str if/when we decide to utilize other addressing modes.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nice to see so much of it being table-driven. Most of my time reviewing was reading the manual!

@echesakov echesakov merged commit beb6a26 into dotnet:master Mar 19, 2020
@echesakov echesakov deleted the Arm64-Store branch March 19, 2020 01:31
@TamarChristinaArm
Copy link
Contributor

Thank you for your feedback @TamarChristinaArm
Is there any performance differences between ld1/st1 and ldr/str in this case?

I am leaning towards keeping ld1/st1 for Load/Store. We can always switch to ldr/str if/when we decide to utilize other addressing modes.

No there isn't a performance difference in instructions themselves, so that's fine for now.

@echesakov
Copy link
Contributor Author

So for the single element versions of vst1 and vld1 intrinsics we never generate ST1 and LD1 on little-endian. There we always generate str and ldr since they provide more addressing modes.

e.g.

#include <arm_neon.h>

void foo (int32x4_t v, int32_t *a)
{
  return vst1q_s32 (a, v);
}

will generate a str q0, [x0]

For big-endian we do generate the ST1 and LD1 though. This is not to say that this patch is wrong or needs changing. but just a headsup that you have less flexibility. and the single structure single reg intrinsics are the most common ones.

Interestingly, Visual C++ 2019 compiles the following functions

__declspec(noinline)
void foo(int32x4_t v, int32_t* a)
{
	return vst1q_s32(a, v);
}

__declspec(noinline)
void bar(int32x2_t v, int32_t* a)
{
	return vst1_s32(a, v);
}

into

?foo@@YAXT__n128@@PEAH@Z:
  0000000140001158: 4C007800  st1         {v0.4s},[x0]
  000000014000115C: D65F03C0  ret
?bar@@YAXT__n64@@PEAH@Z:
  0000000140001160: 0C007800  st1         {v0.2s},[x0]
  0000000140001164: D65F03C0  ret

@TamarChristinaArm
Copy link
Contributor

Interestingly, Visual C++ 2019 compiles the following functions

hmm yeah I only checked armclang and gcc :)

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI area-System.Runtime.Intrinsics new-api-needs-documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API Proposal : Arm64 Load & Store

6 participants