Fix Histogram synchronization issue#3534
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3534 +/- ##
==========================================
+ Coverage 87.26% 87.28% +0.01%
==========================================
Files 277 277
Lines 9931 9938 +7
==========================================
+ Hits 8666 8674 +8
+ Misses 1265 1264 -1
|
| while (true) | ||
| { | ||
| if (Interlocked.Exchange(ref this.histogramBuckets.IsCriticalSectionOccupied, 1) == 0) | ||
| if (Interlocked.CompareExchange(ref this.histogramBuckets.IsCriticalSectionOccupied, 1, 0) == 0) |
There was a problem hiding this comment.
Just curious if there were more learnings based on the questions raised here #3458 (comment).
Does CompareExchange help alleviate the concerns? Or did you learn the concern was unfounded?
There was a problem hiding this comment.
I introduced CompareExchange in this PR because I think it's much better to try to update IsCriticalSectionOccupied to 1 only when it's actually 0. With the Interlocked.Exchange we would be setting that variable to 1 even when it's already set to 1, and that seems wasteful.
Regarding the learnings for the questions asked in #3458:
-
We can use Interlocked operations to devise a "lock" that would synchronize the access to histogram related fields here among multiple threads. However, the part where we release this "lock" i.e., where we set
IsCriticalSectionOccupied= 0 has to be done using anInterlockedstatement as well. This is becauseInterlockedcreates a memory fence around that variable. This prevents certain reorderings of instructions that might setIsCriticalSectionto 0 before the original thread is done processing the histogram fields as that might allow another thread to enter the critical section while the original thread has not exited it. -
Also, any other thread which might be accessing these histogram fields must use the same synchronization mechanism. That's why I have updated the
TakeSnapshotmethod to use the same Interlocked operations to updateIsCriticalSectionOccupiedto determine if it's safe to read these fields. -
The instructions within this "lock" can be reordered, in this case, the code under
unchecked. That's okay because we don't care what order they get executed in. -
Regarding the freshness of the values, as long as all the threads that access the same set of fields use the same synchronization mechanism (could be the C#
lockstatement or our Interlocked based "lock"), any thread that enters the critical section would see the most up-to-date value of these fields. -
The only thing to be wary of in this Interlocked based "lock" is exceptions. If an exception is thrown while a thread was in the critical section,
IsCriticalSectionOccupiedwould not be set back to 0 and other threads waiting to enter the critical section would never be able to enter it. In our case, the instructions inside the "lock" would not throw an exception so we don't have to worry about that.
There was a problem hiding this comment.
I introduced CompareExchange in this PR because I think it's much better to try to update IsCriticalSectionOccupied to 1 only when it's actually 0. With the Interlocked.Exchange we would be setting that variable to 1 even when it's already set to 1, and that seems wasteful.
^ might be good to see if perf is impacted with this.
There was a problem hiding this comment.
Excellent summary! Thank you. Number 5 in particular had not occurred to me, but absolutely makes sense.
I'd be interested in perf too. Given that the snapshot and update threads were previously not synchronizing with one another, I'm curious if perf is impacted now that they are correctly synchronized. Maybe not by much, since the snapshot thread only runs seldomly...
There was a problem hiding this comment.
I'm OK with this as-is but we should probably do some perf testing using lock, SpinLock, and this new Interlocked code. I think this will be the fastest way but I don't want to blindly trust the old test which proved "Interlocked is faster than lock" because it was bugged when tested 😄
PS: Special shout-out to @noahfalk for helping us on this 👏
There was a problem hiding this comment.
for correctness, we need to modify this tests https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/test/OpenTelemetry.Tests/Metrics/MetricAPITest.cs#L1175
It validates multiple threads writing scenario.
We can add a reader thread also, and ultimately everything should add up correctly. (ideally we want to modify the stress tests to do it, so you can run it for hours and at the end, it'll confirm if the metrics are correct).
|
I ran Benchmark and Stress tests for Expand for codeusing System.Threading;
using BenchmarkDotNet.Attributes;
namespace Benchmarks.Metrics
{
[MemoryDiagnoser]
public class TestBenchmarks
{
private int isLocked = 1;
[Benchmark]
public void Exchange()
{
var temp = Interlocked.Exchange(ref this.isLocked, 1) == 0;
}
[Benchmark]
public void CompareExchange()
{
var temp = Interlocked.CompareExchange(ref this.isLocked, 1, 0) == 0;
}
}
}
I have updated the code to go back to using |
|
Beware that single threaded, zero contention lock perf benchmarks can (and probably will) give very different results vs. a test with multiple threads and some contention. Even if threads are never explicitly waiting for another to release the lock, down at the hardware level the cache line backing the critical section memory has to be moved between the caches connected to different cores. When the cores are continuously renegotiating which one has exclusive access to the memory that creates large additional latencies on those Interlocked instructions. |
|
Check the PRs for the other two synchronization techniques: Performance comparison of the different synchronization techniquesHigh contention scenario:
Low contention scenario:
The Interlocked based "lock" outperforms the classic lock and SpinLock for high contention scenario where multiple threads try to update the same MetricPoint in parallel. It had ~35% more loops/sec than the other techniques. It also has significantly lower CPU cycles/loop: ~83% less than classic lock and ~23% less than SpinLock. For low contention scenario, there is still some perf gain for Interlocked based "lock", but it's not as high. The benchmark results for Interlocked based "lock" are comparable to classic lock. The benchmark numbers for SpinLock are lower than that of Interlocked based "lock" and classic lock. |
Changes
UpdateandTakeSnapshotmethodsHistogramBuckets.LockObjectStress Test Results (high contention) with Prometheus Exporter scraping interval = 10s
Multiple threads updating the same Histogram MetricPoint
Stress Test Results (low contention) with Prometheus Exporter scraping interval = 10s
Multiple threads updating random Histogram MetricPoints
Benchmark Results
// * Summary *
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=6.0.302
[Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT