feat store: added readiness and livenes prober#1460
Conversation
0027f18 to
e44d882
Compare
8457aec to
f15ba68
Compare
|
Hey @FUSAKLA, I realized that we miss liveness and healthiness checks for various components when I was working on https://github.com/metalmatze/kube-thanos/pull/42 and then I have seen your work #656, #1395 and #1297. We need probes in place for all other components as well. I'd like to push this effort further as fast as possible. I'm happy to help. I could create PRs for components which you haven't touch yet if it is also ok for you? I'd be happy to hear your opinion on those PRs as well. What do you think? |
|
Hi @kakkoyun ! Yeah I didn't want to make more PRs at once since even this way it takes it's time to get these reviewed. Every one of them raises big discussion on when to report as ready and when as healthy. Thats also the reason it got eventually splitted up but help is always welcomed so feel free to add those PRS! I'd love to see this done finally :D |
bwplotka
left a comment
There was a problem hiding this comment.
LGTM, some minor nits only (:
f15ba68 to
315602f
Compare
|
Thanks @bwplotka ! Should be resolved and rebased on master. |
315602f to
6d2e5f2
Compare
| defer func() { | ||
| if err != nil { | ||
| return errors.Wrap(err, "create bucket client") | ||
| runutil.CloseWithLogOnErr(logger, bkt, "bucket client") |
There was a problem hiding this comment.
I think we should remove if err !=nil, as it is run in defer, err will be overwritten by non related code which happens after this line.
There was a problem hiding this comment.
I'd say it was meant to close the client if the runStore ended up with an error.
Otherwise it should not close the client since others are still using it.
Maybe that's why the if err != nil is there and the fact err will be overwritten by the following code was the intention?
But I did not touch this code actually, just removed the unneeded code block wrapping it.
| cancel() | ||
| ctx, cancel := context.WithCancel(context.Background()) | ||
| g.Add(func() error { | ||
| defer runutil.CloseWithLogOnErr(logger, bkt, "bucket client") |
There was a problem hiding this comment.
this is a bit confusing, so we were closing bkt earlier, why are we closing here again?
76c0da7 to
4d8ca19
Compare
Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
4d8ca19 to
d334786
Compare
|
@kakkoyun PTAL I removed the |
Signed-off-by: Martin Chodur <m.chodur@seznam.cz>
Signed-off-by: Martin Chodur <m.chodur@seznam.cz> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me>
* Some updates to compact docs Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * some formatting Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Update docs/components/compact.md accept PR suggestions Co-Authored-By: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Add metalmatze to list of maintainers (#1547) Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * resolve comments Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * resolve last comment Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * receive: Add liveness and readiness probe (#1537) * Add prober to receive Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Add changelog entries Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update README Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Remove default Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Wait hashring to be ready Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * downsample: Add liveness and readiness probe (#1540) * Add readiness and liveness probes for downsampler Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Add changelog entry Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Remove default Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Set ready Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update CHANGELOG Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Clean CHANGELOG Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Document the dnssrvnoa option (#1551) Signed-off-by: Antonio Santos <antonio@santosvelasco.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * feat store: added readiness and livenes prober (#1460) Signed-off-by: Martin Chodur <m.chodur@seznam.cz> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Add Hotstar to adopters. (#1553) It's the largest streaming service in India that does cricket and GoT for India. They have insane scale and are using Thanos to scale their Prometheus. Spoke to them offline about adding the logo and will get a signoff here too. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Fix hotstar logo in the adoptor's list (#1558) Signed-off-by: Karthik Vijayaraju <karthik@hotstar.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Fix typos, including 'fomrat' -> 'format' in tracing.config-file help text. (#1552) Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Compactor: Fix for #844 - Ignore object if it is the current directory (#1544) * Ignore object if it is the current directory Signed-off-by: Jamie Poole <jimbobby5@yahoo.com> * Add full-stop Signed-off-by: Jamie Poole <jimbobby5@yahoo.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Adding doc explaining the importance of groups for compactor (#1555) Signed-off-by: Leo Meira Vital <leo.vital@nubank.com.br> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Add blank line for list (#1566) The format of these files is wrong in the web. Signed-off-by: dongwenjuan <dong.wenjuan@zte.com.cn> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Refactor compactor constants, fix bucket column (#1561) * compact: unify different time constants Use downsample.* constants where possible. Move the downsampling time ranges into constants and use them as well. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * bucket: refactor column calculation into compact Fix the column's name and name it UNTIL-DOWN because that is what it actually shows - time until the next downsampling. Move out the calculation into a separate function into the compact package. Ideally we could use the retention policies in this calculation as well but the `bucket` subcommand knows nothing about them :-( Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * compact: fix issues with naming Reorder the constants and fix mistakes. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * remove duplicate Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me>
Signed-off-by: Martin Chodur <m.chodur@seznam.cz> Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
* Some updates to compact docs Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * some formatting Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Update docs/components/compact.md accept PR suggestions Co-Authored-By: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Add metalmatze to list of maintainers (#1547) Signed-off-by: Matthias Loibl <mail@matthiasloibl.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * resolve comments Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * resolve last comment Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * receive: Add liveness and readiness probe (#1537) * Add prober to receive Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Add changelog entries Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update README Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Remove default Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Wait hashring to be ready Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * downsample: Add liveness and readiness probe (#1540) * Add readiness and liveness probes for downsampler Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Add changelog entry Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Remove default Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Set ready Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Update CHANGELOG Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> * Clean CHANGELOG Signed-off-by: Kemal Akkoyun <kakkoyun@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Document the dnssrvnoa option (#1551) Signed-off-by: Antonio Santos <antonio@santosvelasco.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * feat store: added readiness and livenes prober (#1460) Signed-off-by: Martin Chodur <m.chodur@seznam.cz> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Add Hotstar to adopters. (#1553) It's the largest streaming service in India that does cricket and GoT for India. They have insane scale and are using Thanos to scale their Prometheus. Spoke to them offline about adding the logo and will get a signoff here too. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Fix hotstar logo in the adoptor's list (#1558) Signed-off-by: Karthik Vijayaraju <karthik@hotstar.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Fix typos, including 'fomrat' -> 'format' in tracing.config-file help text. (#1552) Signed-off-by: Callum Styan <callumstyan@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Compactor: Fix for #844 - Ignore object if it is the current directory (#1544) * Ignore object if it is the current directory Signed-off-by: Jamie Poole <jimbobby5@yahoo.com> * Add full-stop Signed-off-by: Jamie Poole <jimbobby5@yahoo.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Adding doc explaining the importance of groups for compactor (#1555) Signed-off-by: Leo Meira Vital <leo.vital@nubank.com.br> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Add blank line for list (#1566) The format of these files is wrong in the web. Signed-off-by: dongwenjuan <dong.wenjuan@zte.com.cn> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * Refactor compactor constants, fix bucket column (#1561) * compact: unify different time constants Use downsample.* constants where possible. Move the downsampling time ranges into constants and use them as well. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * bucket: refactor column calculation into compact Fix the column's name and name it UNTIL-DOWN because that is what it actually shows - time until the next downsampling. Move out the calculation into a separate function into the compact package. Ideally we could use the retention policies in this calculation as well but the `bucket` subcommand knows nothing about them :-( Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> * compact: fix issues with naming Reorder the constants and fix mistakes. Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com> Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> * remove duplicate Signed-off-by: Ivan Kiselev <kiselev_ivan@pm.me> Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Added the prober to yet another component. Store this time.
The store does the bucket initialization and sync synchronously, so the default http server will start up only once it is finished. Then it gets healthy(live) and once the gRPC server starts up the store gets also ready.
Personally I'd much rather see the liveness probe to answer right from the start during the bucket init since this way we leave the bucket sync duration to some initial back-off of an orchestrator.
Changes
Verification
Started it and verified it gets ready end healthy as expected.