Skip to content

Conversation

@rzikm
Copy link
Member

@rzikm rzikm commented Oct 4, 2024

Description

This PR brings in a C# implementation of a DNS resolver that is able to signal the TTL information together with the query results.

Main features

  • Async network I/O, fully cancellable
  • Mockable
  • Resolves IP Addresses (A/AAAA records) and Service records (SRV + related A/AAAA)
  • Transparent fallback to TCP
  • Autodetection of OS settings (i.e. reads nameservers from /etc/resolv.conf file)
  • Thread-safe

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?
    • Yes
      • Link to aspire-docs issue:
    • No
Microsoft Reviewers: Open in CodeFlow

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 4, 2024
@davidfowl davidfowl requested a review from ReubenBond October 29, 2024 21:16
Copy link
Member

@BrennanConroy BrennanConroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed a subset of files. It's a large PR and has a lot of RFC reading attached to it. Will review more later.


}

internal static bool TryReadQName(ReadOnlySpan<byte> messageBuffer, int offset, [NotNullWhen(true)] out string? name, out int bytesRead)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

messageBuffer doesn't have any endianness normalization when it's passed in. Is that problematic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only multibyte information is the offset in pointers and we read higher and lower byte separately and combine them correctly. Otherwise we only ever read individual bytes or (ascii) strings, so endianness should not matter.


if (responseLength > buffer.Length)
{
var largerBuffer = ArrayPool<byte>.Shared.Rent(responseLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of dangerous, user controlled pre-allocation. Since we're limited to 65k here you could argue it's not too risky, but we should probably explicitly call it out in a comment and make sure we're okay with the risk.

Technically the array pool (ConfigurableArrayPool) could return a 65 * 2 * 2 (or more) buffer since the implementation is free to search multiple bucket sizes to find a pooled byte[].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious, if we were to decide it's too risky, what would you suggest as a mitigation? We need the entire message in memory for parsing purposes (domain name compression via pointers with offsets), so I don't see us reading the contents of the message from a stream.

@danmoseley
Copy link
Member

@rzikm could you set expectations on this one ?

@danmoseley danmoseley added area-service-discovery needs-author-action An issue or pull request that requires more info or actions from the author. labels Feb 2, 2025
@rzikm
Copy link
Member Author

rzikm commented Feb 3, 2025

@rzikm could you set expectations on this one ?

in parallel to this, I am also finishing Threat Model Review and will need to setup fuzzing. If there are no major distractions I hope to finish the work this month.

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Feb 3, 2025
@davidfowl
Copy link
Member

@rzikm appreciate your work on this but we don;t have any plans to go this direction in the short term. I'll close the PR but feel free to continue evolving this.

@davidfowl davidfowl closed this Jun 14, 2025
@davidfowl davidfowl reopened this Jun 16, 2025
@davidfowl
Copy link
Member

Reopening as I forgot we do want this change as a replacement for the DNS client in the service discovery package!

@Dona278
Copy link
Contributor

Dona278 commented Jun 16, 2025

Thank you @davidfowl. I had just started to write a comment with issues that I've encountered with current DNS client hoping in a hand on heart for this pull!

@rzikm
Copy link
Member Author

rzikm commented Jun 18, 2025

CI is green, I expect to merge this within a week once I confirm there are no outstanding action items from Threat Model perspective.

@rzikm
Copy link
Member Author

rzikm commented Jun 25, 2025

Last open items on threat Model Review discussions have been closed, good to merge this after I merge latest main

@rzikm rzikm merged commit 416e077 into dotnet:main Jun 25, 2025
252 checks passed
@rzikm rzikm changed the title [WIP] Managed implementation of DNS resolver Managed implementation of DNS resolver Jun 25, 2025
@davidfowl
Copy link
Member

@rzikm need you on standby in case something breaks 😄. This code has not changed in a LONG time and we just made this massive change in a minor update with no fallback. I think we should put back the old implementation and have it be possible to opt in.

@davidfowl davidfowl added the breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. label Jun 25, 2025
@rzikm
Copy link
Member Author

rzikm commented Jun 25, 2025

@davidfowl not that I expect stuff to break, but I expect to be on standby in case something does break.

as for opt-in vs opt-out, we can easily toggle between Managed and System.Net.Dns for A/AAAA records. For SRV records, we would have to put back the DnsClient dependency. Let me know which variation of this (and the opt-in/out mechanism) you prefer and I can put up a follow-up PR.

@davidfowl
Copy link
Member

Add the DnsClient implementation back and we will keep it behind a feature flag for 1 or 2 minor versions then delete it later.

@Dona278
Copy link
Contributor

Dona278 commented Jun 25, 2025

I can try it and leave a feedback maybe tomorrow. We already used the previous version with a lot of issues for missing resolution in kubernetes with gRPC headless services.

@rzikm
Copy link
Member Author

rzikm commented Jun 25, 2025

missing resolution in kubernetes

speaking of kubernetes, I noticed #9913 during local testing, I am very curious if it is really a me-problem or if something can be done to increase the pit of success for aspir8 users.

@Dona278
Copy link
Contributor

Dona278 commented Jun 25, 2025

We don't use aspir8 to generate manifest and the url used for service discovery is composed with named port and service name during setup so I can't test it.
But the next step will be switch to use aspire publish to generate manifest, I will take a look.
Regards the "old" version issue (encountered many times in a day) was "DNS name: '_portName._tcp.service-name.default.svc.cluster.local'): Non-Existent Domain" even with pods ready and with service with only 1 pod without errors or restart. I hope this refactoring will fix this.

@MichaCo
Copy link

MichaCo commented Jul 19, 2025

Reopening as I forgot we do want this change as a replacement for the DNS client in the service discovery package!

Hi @davidfowl, @Dona278 and @rzikm
As the author of the DNS client library you used here so far, I'm really curious to hear about what the reasons are to start re-implementing your own version of a DNS resolver.

If there are technical reasons and issues I would really like to know for selfish reasons, to improve my library ofc!
I did not find any reported issues or discussions from any of you guys in my repo though.

If there are other motivations, that's fine, I don't really care much if you use my library or not.
But I'm open to collaborate with you guys if you want to.

I actually remember having some discussion with one of your team mates from Prague last year, that was more regarding your plans to add a configurable DNS resolver to the .NET framework I think.

Anyways, if there is something I can do to help, let me know.


Regarding

Regards the "old" version issue (encountered many times in a day) was "DNS name: '_portName._tcp.service-name.default.svc.cluster.local'): Non-Existent Domain" even with pods ready and with service with only 1 pod without errors or restart. I hope this refactoring will fix this.

Non-Existent Domain is a response error. That means, the DNS server you asked didn't know the answer and responded with that error code. It is very unlikely that a different resolver would change that outcome because it would receive the exact same response.

Digging a bit on google about issues with CoreDNS (the DNS system used by kubernetes) I found a bunch of reasons why you might get those kind of errors.
Do you have logs from CoreDNS when you are getting errors?

Maybe we just need some retry logic here in the app specific code to account for service changes? After all, this is a distributed system which might have some latency.

Maybe the resolver uses the wrong DNS server because of some bad system configuration?

That being said, I do not have much experience with kubernetes because I'm not using it at work, so what do I know ;)
But if you can tell me how to reproduce those kind of issues, I can take a look.

Thanks
Michael

@github-actions github-actions bot locked and limited conversation to collaborators Aug 18, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-service-discovery breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants