Skip to content

receive: Track replications#2852

Merged
brancz merged 3 commits intothanos-io:masterfrom
kakkoyun:track_replications
Jul 10, 2020
Merged

receive: Track replications#2852
brancz merged 3 commits intothanos-io:masterfrom
kakkoyun:track_replications

Conversation

@kakkoyun
Copy link
Member

@kakkoyun kakkoyun commented Jul 7, 2020

This PR adds a new metric to track replication behaviour. Previously, thanos_receive_forward_requests_total was used. However, that metric is now misleading because we have recently changed how replication behaves (#2621).

  • I added CHANGELOG entry for this change.

Changes

  • Add a new metric to track replicated request thanos_receive_replications_total fd04971
  • Add alerts and dashboards for the new metric 0aa495d
  • Modify alerts for the thanos_forward_requests_total 0aa495d

Verification

  • make test-local
  • make examples
  • make example-rules-lint

cc @krasi-georgiev

@kakkoyun kakkoyun requested review from brancz, bwplotka and squat July 7, 2020 15:27
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is a bit different here.

When I think about it I think what is happening is that we don't cancel the context here, however, context is still chained(!).

I think what would fix this problem would be actually starting new context for forwarding and then cancel it only if timeout OR ((not yet success or failure) && parent context is cancelled). WDYT?

This would have major benefit of knowing best effort forwards status properly and replication status is visible by write error / success no? cc @squat

@bwplotka
Copy link
Member

bwplotka commented Jul 7, 2020

In other words what I mean is that http_requests_total{handler="receive"} is really the same as our replication metric you add here no?

@kakkoyun
Copy link
Member Author

kakkoyun commented Jul 7, 2020

@bwplotka Yes, eventually that's the result. We can use that metric instead of adding a new one. I just wanted to be more explicit about the metric.

I think what would fix this problem would be actually starting new context for forwarding and then cancel it only if timeout OR ((not yet success or failure) && parent context is cancelled). WDYT?

For me, this is just a misleading metric issue. Rather than detaching the forward requests from the incoming request's context, I would suggest to catch this case and handle it different than an error.

@bwplotka
Copy link
Member

bwplotka commented Jul 7, 2020

You cannot catch cancellation in any other way than having separate context ):

@bwplotka
Copy link
Member

bwplotka commented Jul 7, 2020

It's not error handling, it's premature context cancel from parent IMO

@kakkoyun
Copy link
Member Author

kakkoyun commented Jul 7, 2020

@bwplotka @squat Closing this one.

We have talked with @bwplotka offline and decided to fix it in a simpler manner. I'll send another PR.

@kakkoyun kakkoyun closed this Jul 7, 2020
@kakkoyun kakkoyun reopened this Jul 8, 2020
@kakkoyun
Copy link
Member Author

kakkoyun commented Jul 8, 2020

@squat has pointed out offline that we have multiple replication batch operations per request so http_requests_total{handler="receive"} is not enough to do the trick. Thus opened this again.

@kakkoyun kakkoyun marked this pull request as draft July 8, 2020 13:42
@kakkoyun kakkoyun force-pushed the track_replications branch 2 times, most recently from f4a98ff to 30e4c7e Compare July 9, 2020 16:10
@kakkoyun kakkoyun marked this pull request as ready for review July 9, 2020 16:50
@kakkoyun kakkoyun requested a review from bwplotka July 9, 2020 16:50
Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really awesome! Nothing to complain :D

Will leave for one more review though as this is a rather large change.

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just some small nits 🤗

Thanks!

kakkoyun added 3 commits July 10, 2020 09:33
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com>
@kakkoyun kakkoyun force-pushed the track_replications branch from b800021 to cb03c7b Compare July 10, 2020 07:34
@brancz brancz merged commit dc5897a into thanos-io:master Jul 10, 2020
@kakkoyun kakkoyun deleted the track_replications branch July 10, 2020 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants