Skip to content

Conversation

@snakefoot
Copy link
Contributor

See also #3410

@codecov-io
Copy link

codecov-io commented May 21, 2019

Codecov Report

Merging #3412 into release/4.6.4 will increase coverage by <1%.
The diff coverage is 94%.

@@              Coverage Diff               @@
##           release/4.6.4   #3412    +/-   ##
==============================================
+ Coverage             80%     80%   +<1%     
==============================================
  Files                358     358            
  Lines              28373   28384    +11     
  Branches            3785    3789     +4     
==============================================
+ Hits               22750   22770    +20     
+ Misses              4537    4523    -14     
- Partials            1086    1091     +5

@rusanu
Copy link

rusanu commented May 21, 2019

Thanks!

@virgilp
Copy link
Contributor

virgilp commented May 22, 2019

It's not just "when SlidingTimeout=true" - this happens a lot when the target is slow (say, temporarily overwhelmed by writes), and overflowAction=Flush.

If I understand things correctly, things are like this - let's say that FlushTimeout is 1000ms, and it takes 5000ms to actually write a buffer; also we have a 1001-items buffer, and we write one log event per ms. What happens both before & after this PR is:

  • T0: logging starts
  • T0+1000: buffer almost full, is flushed on the timer thread;
  • T0+1001: event written, timer re-armed.
  • T0+2001: buffer almost full again, flush triggered on the timer thread, but can't get lock
  • T0+2002: buffer full, flush triggered on the logging thread, can't get lock.
  • T0+2003: buffer full, 1 log event overwritten, flush triggered on the logging thread, can't get lock
  • T0+2004: buffer full, 2 log events overwritten, flush triggered on the logging thread, can't get lock
    [....]
  • T0+ 3000: oh my.

@snakefoot
Copy link
Contributor Author

snakefoot commented May 22, 2019 via email

@snakefoot
Copy link
Contributor Author

snakefoot commented May 22, 2019 via email

@304NotModified 304NotModified added this to the 4.6.4 milestone May 22, 2019
@304NotModified 304NotModified added the bug Bug report / Bug fix label May 22, 2019
@virgilp
Copy link
Contributor

virgilp commented May 23, 2019

What I was trying to say is that I don't think this solves in any meaningful way the problem of "slow target" (or intermittently-slow target).

So, before this PR, with SlidingTimeout=true, what happened was this, right? (Assume constant stream of log events, say one each 1ms, with a FlushTimeout of 1000ms).
A. Fast target (buffer successfully flushed within FlushTimeout - timer continuously postponed, buffer only flushed when full. If the buffer is cleared fast enough by WriteEventsInBuffer - all is well; but if you have concurrency where two events are written simultaneously just when the buffer was full - one of the old events may be lost (not sure this is desired behaviour). Anyway, this PR changes nothing.
B. Slow target (buffer takes very long to flush) - initially, works just the same, due to sliding timeout, timer is continuously reset, and FlushCallback is never invoked. Then we get to WriteEventsInBufffer (synchronous flushing on buffer full), which successfully resets the buffer to 0. then we postpone the timer again, until the buffer fills up and we do the synchronous flushing again ; but now, thread stays blocked in the lock and timer is not postponed on each event write, because each event write will overwrite one entry from buffer & will re-attempt to flush. So then at T+2000 (roughly) timer gets triggered again (nobody reset it). It doesn't manage to acquire the lock, so it blocks. (in the initial implementation). In this new implementation, at T+2500 it gives up trying and sets up a new timer for T+3500. But, does that really improve things?

I guess what I'm saying is that I don't think this PR changes/ improves anything. As long as there's a single BufferingTargetWrapper in the application, there will be a max of 2 timers triggered, SlidingTimeout or no SlidingTimeout. The problems of slow targets are real - but are not related to the timer. Am I missing/misunderstanding something?

@snakefoot
Copy link
Contributor Author

snakefoot commented May 23, 2019 via email

@virgilp
Copy link
Contributor

virgilp commented May 23, 2019

What I'm saying is that I don't think they keep firing - do they? Can you reproduce it? I don't understand what's the scenario where more than two timer threads get stuck.

(I apparently suck at explaining things - in retrospect I shouldn't have buried this in the last paragraph)

@snakefoot
Copy link
Contributor Author

@virgilp What I'm saying is that I don't think they keep firing - do they? Can you reproduce it? I don't understand

You have these two statements:

  • When having a blocking operation, then it will block the timer. And this cannot be avoided.
  • It is not possible for more than 2 timer-threads to become blocked.

I can only agree on the obvious that a blocking operation will block the calling timer. Regarding your second statement, then you can run the following code:

System.Threading.Timer timer = new System.Threading.Timer((s) => System.Threading.Thread.Sleep(2000));
for (int i = 0; i < 1000; ++i)
{
   timer.Change(5, System.Threading.Timeout.Infinite);
   System.Threading.Thread.Sleep(10);
}

You will see that you suddenly have more than 2 timer-thread being stuck in the Thread.Sleep-operation.

Imagine the following example with SlidingTimeout = true and a LogEvent is written every second with a FlushTimeout of 500ms:

- T0 Logging Starts
- T1 LogEvent is written and timer is scheduled to fire in 500 msec.
- T1.1 Timer executes and is stuck on blocking operation for 5000 msec.
- T2 LogEvent is written and timer is scheduled to fire in 500 msec
- T2.1 Timer executes and is now blocking on lock taken in T1.1 (4000 msec left)
- T3 LogEvent is written and timer is scheduled to fire in 500 msec
- T3.1 Timer executes and is now blocking on lock taken in T1.1 (3000 msec left)
- T4 LogEvent is written and timer is scheduled to fire in 500 msec
- T4.1 Timer executes and is now blocking on lock taken in T1.1 (2000 msec left)
- T5 LogEvent is written and timer is scheduled to fire in 500 msec
- T5.1 Timer executes and is now blocking on lock taken in T1.1 (1000 msec left)
- T6 LogEvent is written and timer is scheduled to fire in 500 msec
- T6.1 Timer executes and is now figthing with all the other active timers to acquire lock. (Only one of the timers will succeed and the rest will now wait 5000 secs again)

Now with this PR the following will happen:

- T0 Logging Starts
- T1 LogEvent is written and timer is scheduled to fire in 500 msec.
- T1.1 Timer executes and is stuck on blocking operation for 5000 msec.
- T2 LogEvent is written and timer is scheduled to fire in 500 msec
- T2.1 Timer executes and is now blocking, but reschedules after 500 msec.  (4000 msec left)
- T3 LogEvent is written and timer is scheduled to fire in 500 msec
- T3.1 Timer executes and is now blocking, but reschedules after 500 msec.  (3000 msec left)
- T3.2 Timer executes and is now blocking, but reschedules after 500 msec.  (3000 msec left)
- T4 LogEvent is written and timer is scheduled to fire in 500 msec
- T4.1 Timer executes and is now blocking, but reschedules after 500 msec.  (2000 msec left)
- T4.2 Timer executes and is now blocking, but reschedules after 500 msec.  (2000 msec left)
- T4.3 Timer executes and is now blocking, but reschedules after 500 msec.  (2000 msec left)
- T5 LogEvent is written and timer is scheduled to fire in 500 msec
- T5.1 Timer executes and is now blocking, but reschedules after 500 msec.  (1000 msec left)
- T5.2 Timer executes and is now blocking, but reschedules after 500 msec.  (1000 msec left)
- T5.3 Timer executes and is now blocking, but reschedules after 500 msec.  (1000 msec left)
- T5.4 Timer executes and is now blocking, but reschedules after 500 msec.  (1000 msec left)
- T6 LogEvent is written and timer is scheduled to fire in 500 msec
- T6.1 Timer executes and is now figthing with all the other active timers to acquire lock. (Only one of the timers will succeed and the rest will now wait 5000 secs again)

So I guess the 500 msec lock-timeout is too long. Instead it should be 100 ms (or half the FlushTimeout depending which is less):

- T0 Logging Starts
- T1 LogEvent is written and timer is scheduled to fire in 500 msec.
- T1.1 Timer executes and is stuck on blocking operation for 5000 msec.
- T2 LogEvent is written and timer is scheduled to fire in 500 msec
- T2.1 Timer executes and is now blocking, but reschedules after 100 msec.  (4500 msec left)
- T2.1 Timer executes and is now blocking, but reschedules after 100 msec.  (4000 msec left)
- T3 LogEvent is written and timer is scheduled to fire in 500 msec
- T3.1 Timer executes and is now blocking, but reschedules after 100 msec.  (3500 msec left)
- T3.2 Timer executes and is now blocking, but reschedules after 100 msec.  (3000 msec left)
- T4 LogEvent is written and timer is scheduled to fire in 500 msec
- T4.1 Timer executes and is now blocking, but reschedules after 100 msec.  (2500 msec left)
- T4.2 Timer executes and is now blocking, but reschedules after 100 msec.  (2000 msec left)
- T5 LogEvent is written and timer is scheduled to fire in 500 msec
- T5.1 Timer executes and is now blocking, but reschedules after 100 msec.  (1500 msec left)
- T5.2 Timer executes and is now blocking, but reschedules after 100 msec.  (1000 msec left)
- T6 LogEvent is written and timer is scheduled to fire in 500 msec
- T6.1 Timer executes and is now blocking, but reschedules after 100 msec.  (500 msec left)
- T6.2  Timer executes and is now figthing with the other timer to acquire lock. (Only one of the timers will succeed and the rest will now wait 5000 secs again)

@snakefoot snakefoot force-pushed the BufferingTargetWrapperTimer branch from 90521f3 to 071b128 Compare May 23, 2019 18:08
@304NotModified
Copy link
Member

@virgilp Nice of you to explain the issue before this PR

Is this sarcasm?

@snakefoot
Copy link
Contributor Author

snakefoot commented May 23, 2019

Is this sarcasm?

I guess so. Trying to understand the input for this PR, but it is a shotgun of information. I like input to be concise (and maybe even proven) without needing to dig for the needle.

@304NotModified
Copy link
Member

304NotModified commented May 23, 2019

I guess so.

I prefer a "be nice" policy ;)

(and feel free to let me know when I violate that policy)

@304NotModified
Copy link
Member

@virgilp Do you think this PR is an improvement? Do you have any suggestions what we have to change? The latter is also a bit unclear to me. Thanks!

@virgilp
Copy link
Contributor

virgilp commented May 24, 2019

@snakefoot ok, but this assumes flush timeout < frequency of logging (i.e. I write an event every second, but flush the buffer every half second). I was assuming that "flush timeout" < "logging frequency" is such an unrealistic case that it basically never happens (I'm pretty sure that's not what happened in #3410 - there's a constant stream of logevents in that system).

@304NotModified I think it's not a significant improvement/ i.e. doesn't improve things in real-life situations; in particular, I was looking to see if this PR fixes the issue we encountered, and I don't think it does. I was not criticizing the PR as much as trying to validate my understanding that this will not fix the referenced issue. For the suggestions... I do have some, but it's not a small change, and it's best explained by opening a PR when I have time to do it :).
WRT the sarcasm I'm not upset, it was well deserved, I should've been more clear from the start, it was actually very kind of @snakefoot to jump right on the issue and offer a fix. It's just that I didn't understand how it fixed things, because from my understanding of the code, it seemed it doesn't. This was me trying to understand - not me trying to block the PR.

@snakefoot
Copy link
Contributor Author

snakefoot commented May 24, 2019 via email

@virgilp
Copy link
Contributor

virgilp commented May 24, 2019

Ok.

@304NotModified
Copy link
Member

@snakefoot should we merge this PR or abandoned it?

@snakefoot
Copy link
Contributor Author

snakefoot commented May 24, 2019

@304NotModified I think it should be merged. It implements what @rusanu requested in the initial issue.

And solves the issue with timer-threads building up in different situations. But maybe not the situation @virgilp is seeing.

@304NotModified 304NotModified merged commit 7bb0473 into NLog:release/4.6.4 May 24, 2019
@snakefoot
Copy link
Contributor Author

@virgilp NLog 4.6.4 has been released: https://www.nuget.org/packages/NLog/4.6.4

Please try testing the new version. If you still have issue with timer-threads building up, then please create a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Bug report / Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants