[Ruler][DNS] Don't propagate no such host error if using default resolver#3257
[Ruler][DNS] Don't propagate no such host error if using default resolver#3257bwplotka merged 11 commits intothanos-io:masterfrom
Conversation
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
|
Shall there be a debug log entry when this happens 🤔 ? |
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
bwplotka
left a comment
There was a problem hiding this comment.
Thanks!
It's not the only a ruler who is using this, also Querier and other components and projects including Cortex.
I wonder if this makes sense, maybe putting it behind some option would do 🤔
No host found is when you cannot reach DNS host and this is definitely a configuration, server or just access failure. Imagine you are rolling to new cluster and somehow pod cannot each DNS server. By masking this error we will never know about this error right?
|
What exactly you want to achieve @OGKevin what's the use case? (: |
|
AFAIK, Host not found is also returned when the DNS server is reachable. Have a look at this article for example. The DNS server is working fine, the DNS just does not exits, hence the NXDOMAIN which in go results in a The issue linked in #3186 has more context of the use case that we want to avoid. Thought, this use case is ruler specific 🤔 not sure how other components handle this. |
bwplotka
left a comment
There was a problem hiding this comment.
Ok, I checked a bit and it looks like indeed IsNotFound is returned when EAI_NONAME is given by Linux resolver (getaddrinfo) so you might be right.
Still I would consider this as an error, potentially a configuration error case. (imagine you made a typo in crafting a DNS target for alertmananger).
I think the main question regarding your issue is really, should a ruler just mention the error and continue running or not. This decision might orthogonal to this change and in fact depends what it resolves for (key functionality or not) 🤔
Some more opinions welcome, but I would rather stop crashing ruler on fail like this on start, but still log & instrument it as failure overall.
|
The problem might not be a config issue only. If Alertmanagers was up and running fine, and the DNS name becomes invalid because all pods are down. Ruler wil crashloop without making any config change to ruler. So IMO Ruler should for sure not crash because Alertmanagers can't be resolved. Because ruler is not only responsible for evaluating alerting rules. I also agree that there should be a log entry saying that DNS resolution failed. On the other issues, e.g. how it impacts the other components I can't make a clear assessment. What we could do indeed is, add a flag and set this to "true" by default for ruler, and false for the other components so that the behaviour only changes for Ruler. |
Hm, crashlooop will actually apply all config changes right? |
|
Let me try to rephrase: ruler will crashloop not because of a config change made by a human to ruler. Ruler will go in a crashloop because a dependency, Alertmanager, is down and k8s service FQDN resolution returns a NXDOMAIN (or equivalent) and dies. Subsequently, ruler won't start up because of the DNS resolution failure and eventually ends up in a crash loop. Line 418 in e4941a5 Lines 792 to 798 in e4941a5 |
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
87462b9 to
e30a981
Compare
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
|
The link that is being reported as dead is not actually dead. I dont quite understand why e2e failed 🤔 |
pstibrany
left a comment
There was a problem hiding this comment.
Change to dns package looks good to me.
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
bwplotka
left a comment
There was a problem hiding this comment.
Awesome, thanks for offline chat, I think this makes sense, but let's make it a normal behaviour.
Some tests would be nice as well (:
pkg/discovery/dns/resolver.go
Outdated
| if dnsErr, ok := err.(*net.DNSError); !ok || !dnsErr.IsNotFound || s.returnErrOnNotFound { | ||
| return nil, errors.Wrapf(err, "lookup IP addresses %q", host) | ||
| } | ||
| level.Error(s.logger).Log("msg", "failed to lookup IP addresses", "host", host, "err", err) |
There was a problem hiding this comment.
Let's delete this and if only, let's have one for both miekg and Go DNS.
There was a problem hiding this comment.
hmm I don't quite understand this comment 🤔. Can you elaborate a lillte bit more?
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
bwplotka
left a comment
There was a problem hiding this comment.
I think I am fine with this consistency check, thanks!
|
Can you rebase 🤗 |
a9489a3 to
19ab4a3
Compare
|
@bwplotka I could not rebase as I merged master in this branch before. Rebasing now will cause a headache. Do you want me to squash this to 1 commit? |
|
Nah no need, GH squashes this. Thanks and sorry for long discussion (: |
Signed-off-by: Kevin Hellemun 17928966+OGKevin@users.noreply.github.com
Fixes #3186
Changes
Verification