-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Summary
With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.
This was done by mirroring the surface area exposed by Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.
The APIs expose would include the following:
ExtractMostSignificantBits- On x86/x64 this would be emitted as
MoveMaskand performs exactly as expected - On ARM64, this would be emulated via
and, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if theinputis the result of aCompareinstruction and elide theshift-right. - On WASM, this is called
bitmaskand works identically toMoveMask - This API and its emulation are used throughout the BCL
- On x86/x64 this would be emitted as
Load/Store- This is the basic load/store operations already in use for x86, x64, and ARM64
LoadAligned/StoreAligned- This works exactly as the same named APIs on x86/x64
- When optimizations are disabled the alignment is verified
- When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
- This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
LoadAlignedNonTemporal/StoreAlignedNonTemporal- This behaves as
LoadAligned/StoreAlignedbut may optionally treat the memory access asnon-temporaland avoid polluting the cache
- This behaves as
LoadUnsafe/StoreUnsafe- These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
- The API that just takes a
ref Tbehaves exactly like the version that takes apointer, just without requiring pinning - The API that additionally takes an
nuint indexbehaves likeref Unsafe.Add(ref value, index)and avoids needing to further bloat IL and hinder readability
API Proposal
namespace System.Runtime.Intrinsics
{
public static partial class Vector64
{
public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);
public static Vector64<T> Load<T>(T* address);
public static Vector64<T> LoadAligned<T>(T* address);
public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector64<T> LoadUnsafe<T>(ref T address);
public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector64<T> source);
public static void StoreAligned<T>(T* address, Vector64<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
}
public static partial class Vector128
{
public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);
public static Vector128<T> Load<T>(T* address);
public static Vector128<T> LoadAligned<T>(T* address);
public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector128<T> LoadUnsafe<T>(ref T address);
public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector128<T> source);
public static void StoreAligned<T>(T* address, Vector128<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
}
public static partial class Vector256
{
public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);
public static Vector256<T> Load<T>(T* address);
public static Vector256<T> LoadAligned<T>(T* address);
public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector256<T> LoadUnsafe<T>(ref T address);
public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector256<T> source);
public static void StoreAligned<T>(T* address, Vector256<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
}
}Additional Notes
Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:
- On x86/x64 these are referred to as
ShuffleorPermute(generally takes two elements and one element, respectively; but that isn't always the case) - On ARM64, these are referred to as
VectorTableLookup(only takes two elements) - On WASM, these are referred to as
Shuffle(takes two elements) andSwizzle(takes one element). - On LLVM, these are referred to as
VectorShuffleand only take two elements
Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T> is consistent on all platforms. Vector64<T> is ARM64 specific and Vector256<T> is x86/x64 specific. The former behaves like Vector128<T> while the latter generally behaves like 2x Vector128<T> (outside a few APIs called Permute#x#). For consistency, the Vector256<T> APIs exposed here would behave identically to Vector128<T> and allow "cross lane permutation".
For the single-vector reordering, the APIs are "trivial":
public static Vector128<byte> Shuffle(Vector128<byte> vector, Vector128<byte> indices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices)
public static Vector128<short> Shuffle(Vector128<short> vector, Vector128<short> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)
public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices)
public static Vector128<uint> Shuffle(Vector128<uint> vector, Vector128<uint> indices)
public static Vector128<float> Shuffle(Vector128<float> vector, Vector128<int> indices)
public static Vector128<long> Shuffle(Vector128<long> vector, Vector128<long> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> vector, Vector128<ulong> indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long> indices)For the two-vector reordering, the APIs are generally the same:
public static Vector128<byte> Shuffle(Vector128<byte> lower, Vector128<byte> upper, Vector128<byte> indices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<sbyte> indices)
public static Vector128<short> Shuffle(Vector128<short> lower, Vector128<short> upper, Vector128<short> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices)
public static Vector128<int> Shuffle(Vector128<int> lower, Vector128<int> upper, Vector128<int> indices)
public static Vector128<uint> Shuffle(Vector128<uint> lower, Vector128<uint> upper, Vector128<uint> indices)
public static Vector128<float> Shuffle(Vector128<float> lower, Vector128<float> upper, Vector128<int> indices)
public static Vector128<long> Shuffle(Vector128<long> lower, Vector128<long> upper, Vector128<long> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> lower, Vector128<ulong> upper, Vector128<ulong> indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long> indices)An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T> shuffles involving byte, sbyte, short, or ushort that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.
This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.