BufferingTargetWrapper - Avoid Timer starvation when SlidingTimeout = true #3412

snakefoot · 2019-05-21T17:45:03Z

Codecov Report

Merging #3412 into release/4.6.4 will increase coverage by <1%.
The diff coverage is 94%.

@@              Coverage Diff               @@
##           release/4.6.4   #3412    +/-   ##
==============================================
+ Coverage             80%     80%   +<1%     
==============================================
  Files                358     358            
  Lines              28373   28384    +11     
  Branches            3785    3789     +4     
==============================================
+ Hits               22750   22770    +20     
+ Misses              4537    4523    -14     
- Partials            1086    1091     +5

rusanu · 2019-05-21T20:12:49Z

Thanks!

virgilp · 2019-05-22T06:12:31Z

It's not just "when SlidingTimeout=true" - this happens a lot when the target is slow (say, temporarily overwhelmed by writes), and overflowAction=Flush.

If I understand things correctly, things are like this - let's say that FlushTimeout is 1000ms, and it takes 5000ms to actually write a buffer; also we have a 1001-items buffer, and we write one log event per ms. What happens both before & after this PR is:

T0: logging starts
T0+1000: buffer almost full, is flushed on the timer thread;
T0+1001: event written, timer re-armed.
T0+2001: buffer almost full again, flush triggered on the timer thread, but can't get lock
T0+2002: buffer full, flush triggered on the logging thread, can't get lock.
T0+2003: buffer full, 1 log event overwritten, flush triggered on the logging thread, can't get lock
T0+2004: buffer full, 2 log events overwritten, flush triggered on the logging thread, can't get lock
[....]
T0+ 3000: oh my.

snakefoot · 2019-05-22T11:25:16Z

Think buffer-full-situtation is not handled by timer. But is handled by the calling thread. The timer is "only" scheduled when having inserted the first item with SlidingTimout=false

snakefoot · 2019-05-22T11:28:27Z

@virgilp If possible then please tell how this PR will fail to solve the problem with hanging timers.

virgilp · 2019-05-23T11:18:43Z

What I was trying to say is that I don't think this solves in any meaningful way the problem of "slow target" (or intermittently-slow target).

So, before this PR, with SlidingTimeout=true, what happened was this, right? (Assume constant stream of log events, say one each 1ms, with a FlushTimeout of 1000ms).
A. Fast target (buffer successfully flushed within FlushTimeout - timer continuously postponed, buffer only flushed when full. If the buffer is cleared fast enough by WriteEventsInBuffer - all is well; but if you have concurrency where two events are written simultaneously just when the buffer was full - one of the old events may be lost (not sure this is desired behaviour). Anyway, this PR changes nothing.
B. Slow target (buffer takes very long to flush) - initially, works just the same, due to sliding timeout, timer is continuously reset, and FlushCallback is never invoked. Then we get to WriteEventsInBufffer (synchronous flushing on buffer full), which successfully resets the buffer to 0. then we postpone the timer again, until the buffer fills up and we do the synchronous flushing again ; but now, thread stays blocked in the lock and timer is not postponed on each event write, because each event write will overwrite one entry from buffer & will re-attempt to flush. So then at T+2000 (roughly) timer gets triggered again (nobody reset it). It doesn't manage to acquire the lock, so it blocks. (in the initial implementation). In this new implementation, at T+2500 it gives up trying and sets up a new timer for T+3500. But, does that really improve things?

I guess what I'm saying is that I don't think this PR changes/ improves anything. As long as there's a single BufferingTargetWrapper in the application, there will be a max of 2 timers triggered, SlidingTimeout or no SlidingTimeout. The problems of slow targets are real - but are not related to the timer. Am I missing/misunderstanding something?

snakefoot · 2019-05-23T12:39:35Z

@virgilp Nice of you to explain the issue before this PR. More interested in input about what is wrong with this PR? And why? We are trying to prevent timer threads getting stuck for a long time. Because they keep firing but each of them get stuck. Any issues with SlidingTimeout always postponing on dripping activity is a known behavior. Not gonna be fixed. And is unrelated to this issue.

virgilp · 2019-05-23T12:57:09Z

What I'm saying is that I don't think they keep firing - do they? Can you reproduce it? I don't understand what's the scenario where more than two timer threads get stuck.

(I apparently suck at explaining things - in retrospect I shouldn't have buried this in the last paragraph)

snakefoot · 2019-05-23T18:04:22Z

@virgilp What I'm saying is that I don't think they keep firing - do they? Can you reproduce it? I don't understand

You have these two statements:

When having a blocking operation, then it will block the timer. And this cannot be avoided.
It is not possible for more than 2 timer-threads to become blocked.

I can only agree on the obvious that a blocking operation will block the calling timer. Regarding your second statement, then you can run the following code:

System.Threading.Timer timer = new System.Threading.Timer((s) => System.Threading.Thread.Sleep(2000));
for (int i = 0; i < 1000; ++i)
{
   timer.Change(5, System.Threading.Timeout.Infinite);
   System.Threading.Thread.Sleep(10);
}

You will see that you suddenly have more than 2 timer-thread being stuck in the Thread.Sleep-operation.

Imagine the following example with SlidingTimeout = true and a LogEvent is written every second with a FlushTimeout of 500ms:

- T0 Logging Starts
- T1 LogEvent is written and timer is scheduled to fire in 500 msec.
- T1.1 Timer executes and is stuck on blocking operation for 5000 msec.
- T2 LogEvent is written and timer is scheduled to fire in 500 msec
- T2.1 Timer executes and is now blocking on lock taken in T1.1 (4000 msec left)
- T3 LogEvent is written and timer is scheduled to fire in 500 msec
- T3.1 Timer executes and is now blocking on lock taken in T1.1 (3000 msec left)
- T4 LogEvent is written and timer is scheduled to fire in 500 msec
- T4.1 Timer executes and is now blocking on lock taken in T1.1 (2000 msec left)
- T5 LogEvent is written and timer is scheduled to fire in 500 msec
- T5.1 Timer executes and is now blocking on lock taken in T1.1 (1000 msec left)
- T6 LogEvent is written and timer is scheduled to fire in 500 msec
- T6.1 Timer executes and is now figthing with all the other active timers to acquire lock. (Only one of the timers will succeed and the rest will now wait 5000 secs again)

Now with this PR the following will happen:

- T0 Logging Starts
- T1 LogEvent is written and timer is scheduled to fire in 500 msec.
- T1.1 Timer executes and is stuck on blocking operation for 5000 msec.
- T2 LogEvent is written and timer is scheduled to fire in 500 msec
- T2.1 Timer executes and is now blocking, but reschedules after 500 msec.  (4000 msec left)
- T3 LogEvent is written and timer is scheduled to fire in 500 msec
- T3.1 Timer executes and is now blocking, but reschedules after 500 msec.  (3000 msec left)
- T3.2 Timer executes and is now blocking, but reschedules after 500 msec.  (3000 msec left)
- T4 LogEvent is written and timer is scheduled to fire in 500 msec
- T4.1 Timer executes and is now blocking, but reschedules after 500 msec.  (2000 msec left)
- T4.2 Timer executes and is now blocking, but reschedules after 500 msec.  (2000 msec left)
- T4.3 Timer executes and is now blocking, but reschedules after 500 msec.  (2000 msec left)
- T5 LogEvent is written and timer is scheduled to fire in 500 msec
- T5.1 Timer executes and is now blocking, but reschedules after 500 msec.  (1000 msec left)
- T5.2 Timer executes and is now blocking, but reschedules after 500 msec.  (1000 msec left)
- T5.3 Timer executes and is now blocking, but reschedules after 500 msec.  (1000 msec left)
- T5.4 Timer executes and is now blocking, but reschedules after 500 msec.  (1000 msec left)
- T6 LogEvent is written and timer is scheduled to fire in 500 msec
- T6.1 Timer executes and is now figthing with all the other active timers to acquire lock. (Only one of the timers will succeed and the rest will now wait 5000 secs again)

So I guess the 500 msec lock-timeout is too long. Instead it should be 100 ms (or half the FlushTimeout depending which is less):

- T0 Logging Starts
- T1 LogEvent is written and timer is scheduled to fire in 500 msec.
- T1.1 Timer executes and is stuck on blocking operation for 5000 msec.
- T2 LogEvent is written and timer is scheduled to fire in 500 msec
- T2.1 Timer executes and is now blocking, but reschedules after 100 msec.  (4500 msec left)
- T2.1 Timer executes and is now blocking, but reschedules after 100 msec.  (4000 msec left)
- T3 LogEvent is written and timer is scheduled to fire in 500 msec
- T3.1 Timer executes and is now blocking, but reschedules after 100 msec.  (3500 msec left)
- T3.2 Timer executes and is now blocking, but reschedules after 100 msec.  (3000 msec left)
- T4 LogEvent is written and timer is scheduled to fire in 500 msec
- T4.1 Timer executes and is now blocking, but reschedules after 100 msec.  (2500 msec left)
- T4.2 Timer executes and is now blocking, but reschedules after 100 msec.  (2000 msec left)
- T5 LogEvent is written and timer is scheduled to fire in 500 msec
- T5.1 Timer executes and is now blocking, but reschedules after 100 msec.  (1500 msec left)
- T5.2 Timer executes and is now blocking, but reschedules after 100 msec.  (1000 msec left)
- T6 LogEvent is written and timer is scheduled to fire in 500 msec
- T6.1 Timer executes and is now blocking, but reschedules after 100 msec.  (500 msec left)
- T6.2  Timer executes and is now figthing with the other timer to acquire lock. (Only one of the timers will succeed and the rest will now wait 5000 secs again)

… true

304NotModified · 2019-05-23T18:33:21Z

@virgilp Nice of you to explain the issue before this PR

Is this sarcasm?

snakefoot · 2019-05-23T18:39:28Z

Is this sarcasm?

I guess so. Trying to understand the input for this PR, but it is a shotgun of information. I like input to be concise (and maybe even proven) without needing to dig for the needle.

304NotModified · 2019-05-23T18:46:03Z

I guess so.

I prefer a "be nice" policy ;)

(and feel free to let me know when I violate that policy)

304NotModified · 2019-05-23T18:47:43Z

@virgilp Do you think this PR is an improvement? Do you have any suggestions what we have to change? The latter is also a bit unclear to me. Thanks!

virgilp · 2019-05-24T06:51:55Z

@snakefoot ok, but this assumes flush timeout < frequency of logging (i.e. I write an event every second, but flush the buffer every half second). I was assuming that "flush timeout" < "logging frequency" is such an unrealistic case that it basically never happens (I'm pretty sure that's not what happened in #3410 - there's a constant stream of logevents in that system).

@304NotModified I think it's not a significant improvement/ i.e. doesn't improve things in real-life situations; in particular, I was looking to see if this PR fixes the issue we encountered, and I don't think it does. I was not criticizing the PR as much as trying to validate my understanding that this will not fix the referenced issue. For the suggestions... I do have some, but it's not a small change, and it's best explained by opening a PR when I have time to do it :).
WRT the sarcasm I'm not upset, it was well deserved, I should've been more clear from the start, it was actually very kind of @snakefoot to jump right on the issue and offer a fix. It's just that I didn't understand how it fixed things, because from my understanding of the code, it seemed it doesn't. This was me trying to understand - not me trying to block the PR.

snakefoot · 2019-05-24T07:42:13Z

Still very confusing. Your two previous statements said there is no issue to be fixed because there will only be 2 timer threads blocked no matter what. Now you are saying that you are also experiencing some kind of issue with blocking timer threads. It would make things a lot easier if you shared: - how you configured your buffering Wrappers - how your logging pattern (frequency and amount and flush-latency) - some kind of description of the problem state you see (thread listing, callstacks, etc) Please create a new issue with all of this. Instead of piggy backing on this PR

virgilp · 2019-05-24T08:22:04Z

Ok.

304NotModified · 2019-05-24T19:47:52Z

@snakefoot should we merge this PR or abandoned it?

snakefoot · 2019-05-24T19:48:59Z

@304NotModified I think it should be merged. It implements what @rusanu requested in the initial issue.

And solves the issue with timer-threads building up in different situations. But maybe not the situation @virgilp is seeing.

snakefoot · 2019-05-28T22:01:48Z

@virgilp NLog 4.6.4 has been released: https://www.nuget.org/packages/NLog/4.6.4

Please try testing the new version. If you still have issue with timer-threads building up, then please create a new issue.

snakefoot mentioned this pull request May 21, 2019

BufferingTargetWrapper should supress the timer while waiting the lock #3410

Closed

snakefoot force-pushed the BufferingTargetWrapperTimer branch from ea623a6 to 90521f3 Compare May 21, 2019 17:53

304NotModified added this to the 4.6.4 milestone May 22, 2019

304NotModified added the bug Bug report / Bug fix label May 22, 2019

BufferingTargetWrapper - Avoid Timer starvation when SlidingTimeout =…

071b128

… true

snakefoot force-pushed the BufferingTargetWrapperTimer branch from 90521f3 to 071b128 Compare May 23, 2019 18:08

304NotModified merged commit 7bb0473 into NLog:release/4.6.4 May 24, 2019

snakefoot mentioned this pull request Feb 4, 2020

AsyncTargetWrapper - Do not schedule lazy writer timer, when other active writer #3790

Merged

snakefoot deleted the BufferingTargetWrapperTimer branch April 4, 2020 12:55

BufferingTargetWrapper - Avoid Timer starvation when SlidingTimeout = true #3412

BufferingTargetWrapper - Avoid Timer starvation when SlidingTimeout = true #3412

Uh oh!

Conversation

snakefoot commented May 21, 2019

Uh oh!

codecov-io commented May 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rusanu commented May 21, 2019

Uh oh!

virgilp commented May 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snakefoot commented May 22, 2019 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snakefoot commented May 22, 2019 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

virgilp commented May 23, 2019

Uh oh!

snakefoot commented May 23, 2019 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

virgilp commented May 23, 2019

Uh oh!

snakefoot commented May 23, 2019

Uh oh!

304NotModified commented May 23, 2019

Uh oh!

snakefoot commented May 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

304NotModified commented May 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

304NotModified commented May 23, 2019

Uh oh!

virgilp commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snakefoot commented May 24, 2019 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

virgilp commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

304NotModified commented May 24, 2019

Uh oh!

snakefoot commented May 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snakefoot commented May 28, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-io commented May 21, 2019 •

edited

Loading

virgilp commented May 22, 2019 •

edited

Loading

snakefoot commented May 22, 2019 via email •

edited

Loading

snakefoot commented May 22, 2019 via email •

edited

Loading

snakefoot commented May 23, 2019 via email •

edited

Loading

snakefoot commented May 23, 2019 •

edited

Loading

304NotModified commented May 23, 2019 •

edited

Loading

virgilp commented May 24, 2019 •

edited

Loading

snakefoot commented May 24, 2019 via email •

edited

Loading

virgilp commented May 24, 2019 •

edited

Loading

snakefoot commented May 24, 2019 •

edited

Loading