Added logging for dns resolution error in the Query component as well#2525
Added logging for dns resolution error in the Query component as well#2525bwplotka merged 9 commits intothanos-io:masterfrom yashrsharma44:query-component-issue
Conversation
There was a problem hiding this comment.
Could we please refactor this to use errors.MultiError instead of an array of error? AFAIK we generally try to avoid using []error as you can see in the current code-base. That other type provides a nicer and unified format IMHO. The code would simpler as well since the for-loop would be contained inside of MultiError.Error() + we'd have one, simple message about all failures instead of spam :P
Thanks for the review! I guess that sounds elegant, as the error log certainly seems a bit clogged. Making the changes right away! 😄 |
bwplotka
left a comment
There was a problem hiding this comment.
Thanks! Some comments and bug challenge =D
CHANGELOG.md
Outdated
| ### Fixed | ||
|
|
||
| - [#2501](https://github.com/thanos-io/thanos/pull/2501) Query: gracefully handle additional fields in `SeriesResponse` protobuf message that may be added in the future. | ||
| - [#2525](https://github.com/thanos-io/thanos/pull/2525) Query: Added logging for dns resolution error in the `Query` component. |
There was a problem hiding this comment.
So fixed or added? (; You put this in Fixed section , but we said "Added" here... ^^
I think it's fixed as we fix issue
There was a problem hiding this comment.
Sure, some sloppy grammar from my side :P
cmd/thanos/query.go
Outdated
| stores.Update(ctxUpdate) | ||
| dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)) | ||
| if err := dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)); err.Err() != nil { | ||
| level.Error(logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err.Error()) |
There was a problem hiding this comment.
| level.Error(logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err.Error()) | |
| level.Error(logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err) |
cmd/thanos/query.go
Outdated
| fileSDCache.Update(update) | ||
| stores.Update(ctxUpdate) | ||
| dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)) | ||
| if err := dnsProvider.Resolve(ctxUpdate, append(fileSDCache.Addresses(), storeAddrs...)); err.Err() != nil { |
There was a problem hiding this comment.
There is quite big bug here, but CI should help you to find what.
I will not say exactly what's wrong for educational purpose, please try to get from CI what's wrong you should be able to quickly spot this! 🤗 Let me know if you are blocked.
There was a problem hiding this comment.
Challenge accepted!
cmd/thanos/rule.go
Outdated
| return runutil.Repeat(interval, ctx.Done(), func() error { | ||
| c.Resolve(ctx) | ||
| return nil | ||
| err := c.Resolve(ctx) |
There was a problem hiding this comment.
Nice! No need for separate variable though.
Please see https://thanos.io/contributing/coding-style-guide.md/#avoid-defining-variables-used-only-once
pkg/cacheutil/memcached_client.go
Outdated
| // Fail in case no server address is resolved. | ||
| servers := c.dnsProvider.Addresses() | ||
| if len(servers) == 0 { | ||
| if len(servers) == 0 || err != nil { |
There was a problem hiding this comment.
I think we want to always check error, no?
There was a problem hiding this comment.
Sure, made a note.
pkg/discovery/dns/provider.go
Outdated
| func (p *Provider) Resolve(ctx context.Context, addrs []string) { | ||
| func (p *Provider) Resolve(ctx context.Context, addrs []string) tsdberrors.MultiError { | ||
| resolvedAddrs := map[string][]string{} | ||
| errorList := tsdberrors.MultiError{} |
There was a problem hiding this comment.
We use to use just errs name for those
pkg/discovery/dns/provider.go
Outdated
| errorList.Add(err) | ||
| // The DNS resolution failed. Continue without modifying the old records. | ||
| p.resolverFailuresCount.Inc() | ||
| level.Error(p.logger).Log("msg", "dns resolution failed", "addr", addr, "err", err) |
There was a problem hiding this comment.
Ok this wrong.
The problem is that we handle errors TWICE. One time here, then we pass errs somewhere. This will mean that we probably will log errors couple of times and just add confusion... Same with metric which is even more risky. Let's handle error just once. Either handle it all (just log), or pass further... not both! (:
What do you think?
There was a problem hiding this comment.
I think that we should not log error in the dns/provider.go, rather the error should be logged by the function requesting it, so that we can clearly log where the error is effecting which function.
So provider.go would return the failed list of errors and the requesting function would log all the errors.
Does it sound fair?
There was a problem hiding this comment.
If we log, and we do that only for this error, do we really need to return this error? Does the caller ("requesting function") needs to act in specific way for those errors?
There was a problem hiding this comment.
Okay, so I forgot to mention that, I am not logging the error now in provider.go, rather I am returning it(in my current change that I will push), and the receiver would handle that error(by logging it).
There was a problem hiding this comment.
Does the caller ("requesting function") needs to act in specific way for those errors?
If we can log in the scope of the requesting function, that would be helpful in understanding which function request resulted in the error. Other than that, there is no specific behaviour that I am looking to intend.
There was a problem hiding this comment.
I understand this, but still what would be the solid answer to the question:
If we log, and we do that only for this error, do we really need to return this error? Does the caller ("requesting function") needs to act in specific way for those errors?
(: ?
There was a problem hiding this comment.
If we log, and we do that only for this error, do we really need to return this error? Does the caller ("requesting function") needs to act in specific way for those errors?
As I have understood this question, I think we should return the error and the requesting function should log the error. If you have found any function which is doing both, then please highlight that, I might have missed that 😅
pkg/discovery/dns/provider_test.go
Outdated
| testutil.Equals(t, float64(2), promtestutil.ToFloat64(prv.resolverAddrs.WithLabelValues("any+a"))) | ||
| testutil.Equals(t, float64(2), promtestutil.ToFloat64(prv.resolverAddrs.WithLabelValues("any+b"))) | ||
| testutil.Equals(t, float64(1), promtestutil.ToFloat64(prv.resolverAddrs.WithLabelValues("any+c"))) | ||
| if err := prv.Resolve(ctx, []string{"any+x"}); err != nil { |
There was a problem hiding this comment.
Why this pattern?
With this if there is error on ALL resolves tests will be still green.
Why not using just testutil.Ok(t, prv.Resolve...)?
There was a problem hiding this comment.
Great! I didn't had a look at the testutil package, the construct is quite elegant.
pkg/http/http.go
Outdated
|
|
||
| type AddressProvider interface { | ||
| Resolve(context.Context, []string) | ||
| Resolve(context.Context, []string) tsdberrors.MultiError |
There was a problem hiding this comment.
You should never return concrete type of error in Go.
| Resolve(context.Context, []string) tsdberrors.MultiError | |
| Resolve(context.Context, []string) errors |
There was a problem hiding this comment.
Sure, reverted it.
pkg/http/http.go
Outdated
| func (c *Client) Resolve(ctx context.Context) { | ||
| c.provider.Resolve(ctx, append(c.fileSDCache.Addresses(), c.staticAddresses...)) | ||
| func (c *Client) Resolve(ctx context.Context) tsdberrors.MultiError { | ||
| // Skip the error as it would be logged in the corresponding component. |
There was a problem hiding this comment.
Also what do you mean skip? We are not skipping , we are passing it further so handling properly
There was a problem hiding this comment.
I have changed it.
|
Thanks for the suggestion, I will make the changes 😄 |
bwplotka
left a comment
There was a problem hiding this comment.
Amazing! Last suggestions and LGTM (:
pkg/cacheutil/memcached_client.go
Outdated
|
|
||
| // If some of the dns resolution fails, log the error. | ||
| if err := c.dnsProvider.Resolve(ctx, c.config.Addresses); err != nil { | ||
| level.Error(c.logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err) |
There was a problem hiding this comment.
Maybe worth mentioning addresses?
| level.Error(c.logger).Log("msg", "failed to resolve addresses for storeAPIs", "err", err) | |
| level.Error(c.logger).Log("msg", "failed to resolve addresses for storeAPIs", "addresses", strings.Join(c.config.Addresses,","), "err", err) |
There was a problem hiding this comment.
Sure, that looks neat!
pkg/discovery/dns/provider.go
Outdated
| p.resolverAddrs.Submit() | ||
|
|
||
| p.resolved = resolvedAddrs | ||
| if len(errs) > 0 { |
There was a problem hiding this comment.
Power tip: we can do just returns. errs.Err() (: Without check! (check will be done on err)
There was a problem hiding this comment.
Ohh great, let me add it up!
Signed-off-by: Yash <yashrsharma44@gmail.com> Nitpicking Signed-off-by: Yash <yashrsharma44@gmail.com> More nitpicking Signed-off-by: Yash <yashrsharma44@gmail.com> Removed the errorlist from lock scope for faster access Signed-off-by: Yash <yashrsharma44@gmail.com> Added tsdb.MultiError for error handling Signed-off-by: Yash <yashrsharma44@gmail.com> Fixed some import issues Signed-off-by: Yash <yashrsharma44@gmail.com> Fixed linting Signed-off-by: Yash <yashrsharma44@gmail.com> Changelog Signed-off-by: Yash <yashrsharma44@gmail.com> Linting issues Signed-off-by: Yash <yashrsharma44@gmail.com> More linting issues Signed-off-by: Yash <yashrsharma44@gmail.com> fixed import error Signed-off-by: Yash <yashrsharma44@gmail.com> Reverted some changes Signed-off-by: Yash <yashrsharma44@gmail.com> Added some more error checks Signed-off-by: Yash <yashrsharma44@gmail.com> Fixed some failing tests Signed-off-by: Yash <yashrsharma44@gmail.com> Linting Signed-off-by: Yash <yashrsharma44@gmail.com> Removed extra error param for logging Signed-off-by: Yash <yashrsharma44@gmail.com> Changelog modification Signed-off-by: Yash <yashrsharma44@gmail.com> Fixed unused variable Signed-off-by: Yash <yashrsharma44@gmail.com> Logged failed dns resolution in the memcached_client Signed-off-by: Yash <yashrsharma44@gmail.com> Changed the errorList to errs Signed-off-by: Yash <yashrsharma44@gmail.com> Changed concrete error type to primitive one Signed-off-by: Yash <yashrsharma44@gmail.com> Added logging of dns resolution failure in `http.go` Changed the error message in the `memcached_client.go` Signed-off-by: Yash <yashrsharma44@gmail.com> Removed logging of errors in . Now it would return error list, which is to be handled by the receiver Signed-off-by: Yash <yashrsharma44@gmail.com> Removed Err() method Signed-off-by: Yash <yashrsharma44@gmail.com> Fixed import errors Signed-off-by: Yash <yashrsharma44@gmail.com> Removed extra variables Signed-off-by: Yash <yashrsharma44@gmail.com> Linting Signed-off-by: Yash <yashrsharma44@gmail.com> Check to ensure that return error if list of error is non-empty Signed-off-by: Yash <yashrsharma44@gmail.com> Modified tests of provider_test.go Signed-off-by: Yash <yashrsharma44@gmail.com> more checks Signed-off-by: Yash <yashrsharma44@gmail.com>
Signed-off-by: Yash <yashrsharma44@gmail.com>
…rror Signed-off-by: Yash <yashrsharma44@gmail.com>
Signed-off-by: Yash <yashrsharma44@gmail.com>
Signed-off-by: Yash <yashrsharma44@gmail.com>
Signed-off-by: Yash <yashrsharma44@gmail.com>
Signed-off-by: Yash <yashrsharma44@gmail.com>
Signed-off-by: Yash <yashrsharma44@gmail.com>
|
Thanks! |
Signed-off-by: Yash yashrsharma44@gmail.com
Changes
This PR addresses the issue #2404
Verification
I had run the changes, and the Query component also logs the failure of the dns resolution. Earlier, only the
provider.gologged the issue, now both are logging the error.Comments
I would like to discuss this approach, and see if this approach matches the requirements, as this is my first technical PR to Thanos! 🤓