Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask"

## Summary

With https://github.com/dotnet/runtime/issues/49397 we approved and exposed cross platform APIs on Vector64/128/256<T> to help developers more easily support multiple platforms.

This was done by mirroring the surface area exposed by `Vector<T>`. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.

The APIs expose would include the following:
* `ExtractMostSignificantBits`
  * On x86/x64 this would be emitted as `MoveMask` and performs exactly as expected
  * On ARM64, this would be emulated via `and, element-wise shift-right, 64-bit pairwise add, extract`. The JIT could optionally detect if the `input` is the result of a `Compare` instruction and elide the `shift-right`.
  * On WASM, this is called `bitmask` and works identically to `MoveMask`
  * This API and its emulation are used throughout the BCL
* `Load/Store`
  * This is the basic load/store operations already in use for x86, x64, and ARM64
* `LoadAligned/StoreAligned`
  * This works exactly as the same named APIs on x86/x64
  * When optimizations are disabled the alignment is verified
  * When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
  * This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
* `LoadAlignedNonTemporal/StoreAlignedNonTemporal`
  * This behaves as `LoadAligned/StoreAligned` but may optionally treat the memory access as `non-temporal` and avoid polluting the cache
* `LoadUnsafe/StoreUnsafe`
  * These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
  * The API that just takes a `ref T` behaves exactly like the version that takes a `pointer`, just without requiring pinning
  * The API that additionally takes an `nuint index` behaves like `ref Unsafe.Add(ref value, index)` and avoids needing to further bloat IL and hinder readability

## API Proposal

```csharp
namespace System.Runtime.Intrinsics
{
    public static partial class Vector64
    {
        public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);

        public static Vector64<T> Load<T>(T* address);
        public static Vector64<T> LoadAligned<T>(T* address);
        public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector64<T> LoadUnsafe<T>(ref T address);
        public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector64<T> source);
        public static void StoreAligned<T>(T* address, Vector64<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
    }

    public static partial class Vector128
    {
        public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);

        public static Vector128<T> Load<T>(T* address);
        public static Vector128<T> LoadAligned<T>(T* address);
        public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector128<T> LoadUnsafe<T>(ref T address);
        public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector128<T> source);
        public static void StoreAligned<T>(T* address, Vector128<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
    }

    public static partial class Vector256
    {
        public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);

        public static Vector256<T> Load<T>(T* address);
        public static Vector256<T> LoadAligned<T>(T* address);
        public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
        public static Vector256<T> LoadUnsafe<T>(ref T address);
        public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);

        public static void Store<T>(T* address, Vector256<T> source);
        public static void StoreAligned<T>(T* address, Vector256<T> source);
        public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
        public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
    }
}
```

## Additional Notes

Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:
* On x86/x64 these are referred to as `Shuffle` or `Permute` (generally takes two elements and one element, respectively; but that isn't always the case)
* On ARM64, these are referred to as `VectorTableLookup` (only takes two elements)
* On WASM, these are referred to as `Shuffle` (takes two elements) and `Swizzle` (takes one element).
* On LLVM, these are referred to as `VectorShuffle` and only take two elements

Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for `Vector128<T>` is consistent on all platforms. `Vector64<T>` is ARM64 specific and `Vector256<T>` is x86/x64 specific. The former behaves like `Vector128<T>` while the latter generally behaves like `2x Vector128<T>` (outside a few APIs called `Permute#x#`). For consistency, the `Vector256<T>` APIs exposed here would behave identically to `Vector128<T>` and allow "cross lane permutation".

For the single-vector reordering, the APIs are "trivial":
```csharp
public static Vector128<byte>   Shuffle(Vector128<byte>   vector, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte>  vector, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  vector, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    vector, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   vector, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  vector, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   vector, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  vector, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long>   indices)
```

For the two-vector reordering, the APIs are generally the same:
```csharp
public static Vector128<byte>   Shuffle(Vector128<byte>  lower,  Vector128<byte>   upper, Vector128<byte>   indices)
public static Vector128<sbyte>  Shuffle(Vector128<sbyte> lower,  Vector128<sbyte>  upper, Vector128<sbyte>  indices)

public static Vector128<short>  Shuffle(Vector128<short>  lower, Vector128<short>  upper, Vector128<short>  indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices)

public static Vector128<int>    Shuffle(Vector128<int>    lower, Vector128<int>    upper, Vector128<int>    indices)
public static Vector128<uint>   Shuffle(Vector128<uint>   lower, Vector128<uint>   upper, Vector128<uint>   indices)
public static Vector128<float>  Shuffle(Vector128<float>  lower, Vector128<float>  upper, Vector128<int>    indices)

public static Vector128<long>   Shuffle(Vector128<long>   lower, Vector128<long>   upper, Vector128<long>   indices)
public static Vector128<ulong>  Shuffle(Vector128<ulong>  lower, Vector128<ulong>  upper, Vector128<ulong>  indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long>   indices)
```

An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain `Vector256<T>` shuffles involving `byte`, `sbyte`, `short`, or `ushort` that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.

This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331

Summary

API Proposal

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Additional cross-platform hardware intrinsic APIs for loading/storing, reordering, and extracting a per-element "mask" #63331

Description

Summary

API Proposal

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions