Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions src/libraries/System.Linq/src/System.Linq.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@
<ItemGroup>
<Reference Include="System.Collections" />
<Reference Include="System.Memory" />
<Reference Include="System.Numerics.Vectors" />
<Reference Include="System.Runtime" />
<Reference Include="System.Runtime.Extensions" />
</ItemGroup>
Expand Down
115 changes: 114 additions & 1 deletion src/libraries/System.Linq/src/System/Linq/Max.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
// The .NET Foundation licenses this file to you under the MIT license.

using System.Collections.Generic;
using System.Diagnostics.CodeAnalysis;
using System.Numerics;

namespace System.Linq
{
Expand All @@ -15,6 +15,11 @@ public static int Max(this IEnumerable<int> source)
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.source);
}

if (source.GetType() == typeof(int[]))
{
return Max((int[])source);
}

int value;
using (IEnumerator<int> e = source.GetEnumerator())
{
Expand All @@ -37,6 +42,57 @@ public static int Max(this IEnumerable<int> source)
return value;
}

private static int Max(int[] array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I'd really like to see here is for many of these methods to be available on Span<T> and for us to forward this to the Span implementation.

Exposing new public APIs is of course something that needs to goto API proposal, but I think it would be better long term, would also allow more logic sharing for various helper APIs, and would centralize most of the "array like SIMD algorithms" to the same place in the BCL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine subsequently seeing such methods on MemoryExtensions and deleting the code from here to delegate there instead. Someone would need to do the work to figure out what that set of methods should be. I also don't think it precludes adding this now and then delegating later if/when the methods exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I just wanted to call out what I'd like to see here and I think just about any method we think is worth special-casing for array in LINQ is applicable.

Sum/Min/Max are the "most obvious ones"

{
if (array.Length == 0)
{
ThrowHelper.ThrowNoElementsException();
}

// Vectorize the search if possible.
int index, value;
if (Vector.IsHardwareAccelerated && array.Length >= Vector<int>.Count * 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how big the average input here is? This is going to give us different perf characteristics between Arm64 and x64 because the former will trigger for 8+ element arrays (32-bytes) while the latter will only trigger for 16+ element arrays (64-bytes).

I'd expect we want similar handling in both and while manually unrolling may pipeline well on x64, that may not be the case on Arm64 or all x64 based chipsets.

Generally the using Vector256 is beneficial for very large inputs where you can saturate the decoder and memory pipeline; but are less beneficial for "small" inputs (particularly on older hardware where cache line splits and throughput may be more limited).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know how big the average input here is?

I'm not sure.

Are you suggesting a default stance should be to use Vector128<T> instead of Vector<T> unless you know big inputs are common? The tradeoff you're describing is between potentially 2x throughput on larger sizes (if you use Vector<T> and end up employing 256-bit operations instead of 128-bit operations) vs vectorizing smaller sizes (if you use Vector<T> and can't vectorize the processing of 200 bits but could have a little with Vector128<T>)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is why I opened #64409. We need really clear guidance on when to use which.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting a default stance should be to use Vector128 instead of Vector unless you know big inputs are common? The tradeoff you're describing is between potentially 2x throughput on larger sizes (if you use Vector and end up employing 256-bit operations instead of 128-bit operations) vs vectorizing smaller sizes (if you use Vector and can't vectorize the processing of 200 bits but could have a little with Vector128)?

Yes. There is a lot of nuance here such as Arm64 not having Vector256<T> and it often having different pipelining characteristics. Even when just considering x64, there are various things to consider as well such as the expected target hardware as modern (Skylake and Zen2 or later) tends to perform really well with V256; where-as older (Haswell/Broadwell or Zen1) tends to behave worse and may incur downclocking, may not support dual dispatch, and may even emulate 256-bit support as 2x128-bit ops.

I'd generally expect that we do something like:

  • Vector128<T> is the default choice and is used for practically any input that is at least Count long
    • This is the path that will light up basically "everywhere" (Arm64, x64, x86; eventually WASM, etc)
    • It may or may not be worth it to specialize cases where we are doing less 4-7 loop iterations due to the default branch prediction behavior
  • Vector256<T> is used for cases with at least 128-bytes. This is also the cutoff of many memcpy algorithms before they start considering non-temporal or other store sizes
    • This path will only light up on x86/x64 and can be impacted by things like alignment, whether its going to split the cache every other read/write, the instructions being used, and the underlying data size

There are likely cases where microbenches or modern hardware (roughly Skylake and Zen2 or newer) will show Vector256<T> is better. However, I expect that a good bit of hardware (for both servers and consumers) still in the wild isn't this and so we'll need to take that into consideration. Likewise, the newer Alder Lake CPUs also bring in considerations where the Vector256<T> support is slower on the energy efficient cores (vpmaxud is 1 latency and 0.5 throughput for V128 and V256 on the power cores; but is 1 latency, 0.33 throughput on the energy cores for V128 and 2 latency, 0.67 throughput for V256)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that I wouldn't block this PR on that doc being written up. I think this largely matches what we're already doing for other cases.

Just that I believe this is roughly what the writeup will follow based on my own knowledge/experience.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are trying to get good performance everywhere and don't need to support .NET Standard. Vector128 is a good option.

We have many vectorized methods now. From a cursory scan, they all either use Vector<T> or they have both 128-bit and 256-bit code paths. Can you please list which of these implementations should ideally be changed to just use Vector128<T>? If that is the recommended default choice, I assume the list will include at least half of our methods. Thanks.

Copy link
Member

@tannergooding tannergooding Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any existing SIMD code in the BCL where we need to or should only support Vector128<T>. There may be some cases of new SIMD code where that is the right initial choice.

There are likely places where Vector256<T> will end up needing additional or modified checks due to the changing hardware ecosystem.

There are likely places where we would see perf benefits by adding a Vector128<T> path on top of Vector<T> to ensure that we get better perf for inputs smaller than 32-bytes (or inputs that are small enough that loop unrolling and other considerations would be better).

The BCL is special from many angles and the guidance we have for our own usage may be more complex than the default recommendation for the typical user or individual OSS maintainer looking to accelerate their app.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way, we are the core foundation on which a lot of other code is built and run. We have no way of statically knowing how any given app will use our own code and being the baseline the places where our code is used varies a lot.

This is different from many applications or smaller libraries out there where they can easily get away with having simpler code/guidance and still seeing good benefits. This doesn't mean that we shouldn't provide or make available the extended guidance as well nor that other libraries may not have the same or similar considerations.

But there are multiple angles and target audiences here and no one single answer.

Copy link
Member

@tannergooding tannergooding Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially put another way in an attempt at drawing a parallel, the BCL provides T[], Span<T>, and several other tools for working with memory (MemoryMarshl, Unsafe, etc).

Most apps/libraries only need to use T[] to be successful. They get indirect benefits by the BCL using Span<T>

Some apps/libraries can then get additional perf benefits by providing additional overloads which use Span<T> allowing for reduced allocations and handling of unmanaged memory without needing to deal with unsafe.

You can then go a step further and "hyper-optimize" by doing things like casting int to uint and then zero-extending to nuint or by using MemoryMarshal.GetReference to avoid bounds checking, etc. This is something that provides the "best performance" and is also the most complicated.

All of these options are valid and we use all of them throughout the BCL depending on what the exact circumstances and needs are.


In the same way, it is fine for most user code to just use Vector128<T> and it gives them the easiest path and chance of "succeeding" (for some definition of success) on all platforms. Vector<T> can similarly be used to allow them to succeed and is almost as good of a choice, but there is some complicated nuance around it that makes it more confusing (at least in my own opinion) to recommend as the default scenario. One of these two is the equivalent to T[].

Vector256<T> and providing multiple paths is somewhat like the Span<T> case. This can be beneficial when you know what you're doing and most code can get indirect benefits by the BCL utilizing these techniques.

Using platform specific intrinsics is a case of "hyper-optimization". You use these when you need the utmost control of the codegen and are trying to eek out every bit of performance or where the xplat helpers happen to fall short.

Copy link
Member Author

@stephentoub stephentoub Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any existing SIMD code in the BCL where we need to or should only support Vector128<T>

Right

The BCL is special from many angles and the guidance we have for our own usage may be more complex than the default recommendation

I do not believe the core libraries are special in this regard. If someone is reaching for Vector, they're already in the 1% of caring about getting really great throughput and have already demonstrated an interest in writing additional code for performance. On top of that, most libraries are in a position where they can't predict an upper bound on input lengths.

If the guidance we have doesn't work for us, it doesn't work. What I outlined in #64470 (comment) is essentially what we do, where the starting point is writing one Vector<T> path and then getting more complicated from there if either the API surface is insufficient or the smaller input sizes are critical... I think that's guidance that will actually be followed and by default yields the most maintainable code that's also portable and also good perf as a starting point (yes, I've heard and understood all the nuances). Folks may reasonably start and end with Vector<T>... I do not believe folks will end with Vector<128>.

{
// The array is at least two vectors long. Create a vector from the first N elements,
// and then repeatedly compare that against the next vector from the array. At the end,
// the resulting vector will contain the maximum values found, and we then need only
// to find the max of those.
var maxes = new Vector<int>(array);
index = Vector<int>.Count;
do
{
maxes = Vector.Max(maxes, new Vector<int>(array, index));
index += Vector<int>.Count;
}
while (index + Vector<int>.Count <= array.Length);

value = maxes[0];
for (int i = 1; i < Vector<int>.Count; i++)
{
if (maxes[i] > value)
{
value = maxes[i];
}
}
}
else
{
value = array[0];
index = 1;
}

// Iterate through the remaining elements, comparing against the max.
for (int i = index; (uint)i < (uint)array.Length; i++)
{
if (array[i] > value)
{
value = array[i];
}
}

return value;
}

public static int? Max(this IEnumerable<int?> source)
{
if (source == null)
Expand Down Expand Up @@ -106,6 +162,11 @@ public static long Max(this IEnumerable<long> source)
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.source);
}

if (source.GetType() == typeof(long[]))
{
return Max((long[])source);
}

long value;
using (IEnumerator<long> e = source.GetEnumerator())
{
Expand All @@ -128,6 +189,58 @@ public static long Max(this IEnumerable<long> source)
return value;
}

private static long Max(long[] array)
{
if (array.Length == 0)
{
ThrowHelper.ThrowNoElementsException();
}

// Vectorize the search if possible.
int index;
long value;
if (Vector.IsHardwareAccelerated && array.Length >= Vector<long>.Count * 2)
{
// The array is at least two vectors long. Create a vector from the first N elements,
// and then repeatedly compare that against the next vector from the array. At the end,
// the resulting vector will contain the maximum values found, and we then need only
// to find the max of those.
var maxes = new Vector<long>(array);
index = Vector<long>.Count;
do
{
maxes = Vector.Max(maxes, new Vector<long>(array, index));
index += Vector<long>.Count;
}
while (index + Vector<long>.Count <= array.Length);

value = maxes[0];
for (int i = 1; i < Vector<long>.Count; i++)
{
if (maxes[i] > value)
{
value = maxes[i];
}
}
}
else
{
value = array[0];
index = 1;
}

// Iterate through the remaining elements, comparing against the max.
for (int i = index; (uint)i < (uint)array.Length; i++)
{
if (array[i] > value)
{
value = array[i];
}
}

return value;
}

public static long? Max(this IEnumerable<long?> source)
{
if (source == null)
Expand Down
115 changes: 114 additions & 1 deletion src/libraries/System.Linq/src/System/Linq/Min.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
// The .NET Foundation licenses this file to you under the MIT license.

using System.Collections.Generic;
using System.Diagnostics.CodeAnalysis;
using System.Numerics;

namespace System.Linq
{
Expand All @@ -15,6 +15,11 @@ public static int Min(this IEnumerable<int> source)
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.source);
}

if (source.GetType() == typeof(int[]))
{
return Min((int[])source);
}

int value;
using (IEnumerator<int> e = source.GetEnumerator())
{
Expand All @@ -37,6 +42,57 @@ public static int Min(this IEnumerable<int> source)
return value;
}

private static int Min(int[] array)
{
if (array.Length == 0)
{
ThrowHelper.ThrowNoElementsException();
}

// Vectorize the search if possible.
int index, value;
if (Vector.IsHardwareAccelerated && array.Length >= Vector<int>.Count * 2)
{
// The array is at least two vectors long. Create a vector from the first N elements,
// and then repeatedly compare that against the next vector from the array. At the end,
// the resulting vector will contain the minimum values found, and we then need only
// to find the min of those.
var mins = new Vector<int>(array);
index = Vector<int>.Count;
do
{
mins = Vector.Min(mins, new Vector<int>(array, index));
index += Vector<int>.Count;
}
while (index + Vector<int>.Count <= array.Length);

value = mins[0];
for (int i = 1; i < Vector<int>.Count; i++)
{
if (mins[i] < value)
{
value = mins[i];
}
}
}
else
{
value = array[0];
index = 1;
}

// Iterate through the remaining elements, comparing against the min.
for (int i = index; (uint)i < (uint)array.Length; i++)
{
if (array[i] < value)
{
value = array[i];
}
}

return value;
}

public static int? Min(this IEnumerable<int?> source)
{
if (source == null)
Expand Down Expand Up @@ -88,6 +144,11 @@ public static long Min(this IEnumerable<long> source)
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.source);
}

if (source.GetType() == typeof(long[]))
{
return Min((long[])source);
}

long value;
using (IEnumerator<long> e = source.GetEnumerator())
{
Expand All @@ -110,6 +171,58 @@ public static long Min(this IEnumerable<long> source)
return value;
}

private static long Min(long[] array)
{
if (array.Length == 0)
{
ThrowHelper.ThrowNoElementsException();
}

// Vectorize the search if possible.
int index;
long value;
if (Vector.IsHardwareAccelerated && array.Length >= Vector<long>.Count * 2)
{
// The array is at least two vectors long. Create a vector from the first N elements,
// and then repeatedly compare that against the next vector from the array. At the end,
// the resulting vector will contain the minimum values found, and we then need only
// to find the min of those.
var mins = new Vector<long>(array);
index = Vector<long>.Count;
do
{
mins = Vector.Min(mins, new Vector<long>(array, index));
index += Vector<long>.Count;
}
while (index + Vector<long>.Count <= array.Length);

value = mins[0];
for (int i = 1; i < Vector<long>.Count; i++)
{
if (mins[i] < value)
{
value = mins[i];
}
}
}
else
{
value = array[0];
index = 1;
}

// Iterate through the remaining elements, comparing against the min.
for (int i = index; (uint)i < (uint)array.Length; i++)
{
if (array[i] < value)
{
value = array[i];
}
}

return value;
}

public static long? Min(this IEnumerable<long?> source)
{
if (source == null)
Expand Down
16 changes: 16 additions & 0 deletions src/libraries/System.Linq/tests/MaxTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ public void Max_Int_EmptySource_ThrowsInvalidOpertionException()
{
Assert.Throws<InvalidOperationException>(() => Enumerable.Empty<int>().Max());
Assert.Throws<InvalidOperationException>(() => Enumerable.Empty<int>().Max(x => x));
Assert.Throws<InvalidOperationException>(() => Array.Empty<int>().Max());
Assert.Throws<InvalidOperationException>(() => new List<int>().Max());
}

public static IEnumerable<object[]> Max_Int_TestData()
Expand All @@ -55,6 +57,12 @@ public static IEnumerable<object[]> Max_Int_TestData()
yield return new object[] { new int[] { 16, 9, 10, 7, 8 }, 16 };
yield return new object[] { new int[] { 6, 9, 10, 0, 50 }, 50 };
yield return new object[] { new int[] { -6, 0, -9, 0, -10, 0 }, 0 };

for (int length = 2; length < 33; length++)
{
yield return new object[] { Shuffler.Shuffle(Enumerable.Range(length, length)), length + length - 1 };
yield return new object[] { Shuffler.Shuffle(Enumerable.Range(length, length).ToArray()), length + length - 1 };
}
}

[Theory]
Expand All @@ -78,6 +86,12 @@ public static IEnumerable<object[]> Max_Long_TestData()
yield return new object[] { new long[] { 250, 49, 130, 47, 28 }, 250L };
yield return new object[] { new long[] { 6, 9, 10, 0, int.MaxValue + 50L }, int.MaxValue + 50L };
yield return new object[] { new long[] { 6, 50, 9, 50, 10, 50 }, 50L };

for (int length = 2; length < 33; length++)
{
yield return new object[] { Shuffler.Shuffle(Enumerable.Range(length, length).Select(i => (long)i)), (long)(length + length - 1) };
yield return new object[] { Shuffler.Shuffle(Enumerable.Range(length, length).Select(i => (long)i).ToArray()), (long)(length + length - 1) };
}
}

[Theory]
Expand All @@ -100,6 +114,8 @@ public void Max_Long_EmptySource_ThrowsInvalidOpertionException()
{
Assert.Throws<InvalidOperationException>(() => Enumerable.Empty<long>().Max());
Assert.Throws<InvalidOperationException>(() => Enumerable.Empty<long>().Max(x => x));
Assert.Throws<InvalidOperationException>(() => Array.Empty<long>().Max());
Assert.Throws<InvalidOperationException>(() => new List<long>().Max());
}

public static IEnumerable<object[]> Max_Float_TestData()
Expand Down
Loading