cmd: compact: clean partial / marked blocks concurrently#3115
cmd: compact: clean partial / marked blocks concurrently#3115GiedriusS merged 8 commits intothanos-io:masterfrom
Conversation
Clean partially uploaded and blocks marked for deletion concurrently with the whole compaction/downsampling process. One iteration could potentially take a few days so it should be nice to periodically clean unneeded blocks in the background. Without this, there are huge spikes in block storage usage. The spike's size depends on how long it takes to complete one iteration. The implementation of this is simple - factored out the deletion part into a separate function. It is called at the end of an iteration + concurrently if `--wait` has been specified. Add a mutex to protect from concurrent runs. Delete blocks from the deletion mark map so that we wouldn't try to delete same blocks twice or more. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
53072a4 to
1fd6a5f
Compare
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
1fd6a5f to
09d60b5
Compare
yeya24
left a comment
There was a problem hiding this comment.
Overall it looks very good! Just one small nit, thanks!
cmd/thanos/compact.go
Outdated
| // No need to resync before partial uploads and delete marked blocks. Last sync should be valid. | ||
| compact.BestEffortCleanAbortedPartialUploads(ctx, logger, sy.Partial(), bkt, partialUploadDeleteAttempts, blocksCleaned, blockCleanupFailures) | ||
| if err := blocksCleaner.DeleteMarkedBlocks(ctx); err != nil { | ||
| return errors.Wrap(err, "error cleaning marked blocks") |
There was a problem hiding this comment.
One nit, is it better to be cleaning marked blocks? error seems redundant.
There was a problem hiding this comment.
Yes, let's do this. Still valid comment 👍
bwplotka
left a comment
There was a problem hiding this comment.
Awesome, thanks!
Couple of comments but overall LGTM 💪
cmd/thanos/compact.go
Outdated
| // No need to resync before partial uploads and delete marked blocks. Last sync should be valid. | ||
| compact.BestEffortCleanAbortedPartialUploads(ctx, logger, sy.Partial(), bkt, partialUploadDeleteAttempts, blocksCleaned, blockCleanupFailures) | ||
| if err := blocksCleaner.DeleteMarkedBlocks(ctx); err != nil { | ||
| return errors.Wrap(err, "error cleaning marked blocks") |
There was a problem hiding this comment.
Yes, let's do this. Still valid comment 👍
| // since one iteration potentially could take a long time. | ||
| if conf.cleanupBlocksInterval > 0 { | ||
| g.Add(func() error { | ||
| // Wait the whole period at the beginning because we've executed this on boot. |
There was a problem hiding this comment.
So... why not just removing this and removing boot time execution? (: Same stuff right?
There was a problem hiding this comment.
EDIT: actually gave it a second thought. We need to explicitly run it at boot time to make sure that we don't have flaky tests because we depend there on a failure happening and a cleanup. It's not guaranteed to happen if we do everything concurrently.
There was a problem hiding this comment.
It sounds wrong that we do more complex code only because we don't want to change tests 🤔
There was a problem hiding this comment.
The tests would be much more complex and probably out of the scope of this PR. Actually, it's not just that, I think it's nice that we do this at least once. Imagine where someone doesn't use --wait and the whole Thanos Compact process ended before the clean-up has happened. Space usage would never go down in the remote object storage even though it could. And the user then could be charged more as a result of this not happening.
|
Where we are with this? I want to cut 0.16.0-rc.0 tomorrow 🤗 |
…d_periodically Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Remove "error" from the `error` and just directly call the function. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
bwplotka
left a comment
There was a problem hiding this comment.
LGTM for now, but I think we could improve a bit in future (: But not a blocker, LGTM!
Thanks 👍
| // since one iteration potentially could take a long time. | ||
| if conf.cleanupBlocksInterval > 0 { | ||
| g.Add(func() error { | ||
| // Wait the whole period at the beginning because we've executed this on boot. |
There was a problem hiding this comment.
It sounds wrong that we do more complex code only because we don't want to change tests 🤔
…d_periodically Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Forgot to remove this part while solving conflicts. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
|
I guess since the approvals are there and I have cleaned up the CHANGELOG.md, I'll merge this. I also ran this for a bit locally with |
) * cmd: compact: clean partial / marked blocks concurrently Clean partially uploaded and blocks marked for deletion concurrently with the whole compaction/downsampling process. One iteration could potentially take a few days so it should be nice to periodically clean unneeded blocks in the background. Without this, there are huge spikes in block storage usage. The spike's size depends on how long it takes to complete one iteration. The implementation of this is simple - factored out the deletion part into a separate function. It is called at the end of an iteration + concurrently if `--wait` has been specified. Add a mutex to protect from concurrent runs. Delete blocks from the deletion mark map so that we wouldn't try to delete same blocks twice or more. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * *: update changelog, e2e tests Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * cmd: compact: fix according to comments Remove "error" from the `error` and just directly call the function. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * CHANGELOG: cleanups Forgot to remove this part while solving conflicts. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * CHANGELOG: update Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * CHANGELOG: clean whitespace Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> Signed-off-by: Chans321 <tsschand@gmail.com>
Changes
Clean partially uploaded and blocks marked for deletion concurrently
with the whole compaction/downsampling process. One iteration could
potentially take a few days so it should be nice to periodically clean
unneeded blocks in the background. Without this, there are huge spikes
in block storage usage. The spike's size depends on how long it takes to
complete one iteration.
The implementation of this is simple - factored out the deletion part
into a separate function. It is called at the end of an iteration +
concurrently if
--waithas been specified. Add a mutex to protect fromconcurrent runs.
Verification
Updated e2e tests.