refactor: Improvements to machine proxy calls and response metadata#247
refactor: Improvements to machine proxy calls and response metadata#247jabr wants to merge 34 commits into
Conversation
|
Thank you for taking a look at this which I agree needs more love and clean up. The direction looks good but I think we can do even better by hiding all the details in the grpc-proxy layer: https://github.com/psviderski/uncloud/tree/79dc05cb665d6541ea2ff1ca16af4119ae6b47d4/internal/machine/api/proxy I left this TODO to explore that path: Line 86 in 79dc05c The idea is to move the machine resolution logic from the client to the server which should simplify the API and reduce the number of requests. At the moment uncloud/internal/machine/api/proxy/director.go Lines 41 to 85 in da3634b It's simple and agnostic from knowing anything about machine IDs/names and how to map them to network addresses. The obvious downside is that now every client needs to handle this. Another downside is when logging errors we again need to map network addresses to meaningful machine names. Now, when the proxy layer works pretty well, we can iterate on it and make it smarter. How about we add a mapper to the The client will be able to set Something similar needs to be done for responses as well. We can map ipv6 addresses back to machine IDs/names and set them in the response uncloud/internal/machine/api/pb/common.proto Lines 10 to 19 in b236994 This way the client will only work with and see machine IDs/names everywhere. Moreover, we can even implement a shortcut for broadcasting to all machines, e.g. set something like I also remember we have non-trivial error handling logic when broadcasting requests to multiple machines. I hope it can also be simplified by unifying the proxy layer. What do you think about this approach? |
|
Yeah, that makes sense. It was a little unwieldy to wrap the current implementation, and that probably should have been a sign that I'd missed a better refactor. 😆 I'll take a look at moving it into the |
|
Awesome! Let me know if you need any guidance or want to discuss something |
|
Curious about this bit in I'm guessing that is where the special case for "one machine" Just wondering if that optimization is worth keeping... how much overhead does the |
|
I can't confidently say off the top of my head. Maybe you can't simply use uncloud/internal/machine/api/pb/docker.proto Lines 82 to 85 in 79dc05c I was mainly replicating the usage in Talos Linux: https://github.com/siderolabs/talos/blob/e48c6d7ab9c8a2e28ebe2115ac09f1557bbcca33/internal/app/apid/pkg/director/director.go#L50-L94 and didn't try to optimise it since then. |
|
Okay, I had the AI take a pass at the server-side machine id/name resolution. It's got some rough spots still, but the basic idea is there now. edit: I think there's a simpler solution to the comments below with a fairly simple refactor to the "machine directory" and how the proxy director/backends use it. See #247 (comment) One deeper issue I noticed: the directory maintains a cache of machine backends, which now have the name (along with ID as well and the management IP addr they originally had), but that isn't updated if a machine is renamed. (Can the IP or ID also change?) Thinking the proxy backends might need to interact with the new Perhaps it should work like the Cluster code does with a machine list and then subscribe for updates to maintain its peer list. And if so, maybe those two pieces should be sharing a common, dynamic corrosion machine list... |
No, the machine ID is generated and stored once when the machine is initialised as well as the IP (which is derived from the first 14 bytes of the machine public key): uncloud/internal/machine/network/ip.go Lines 16 to 22 in 35d0a90 The public key cannot be changed as well.
Yeah, I think abstracting these It can expose just one method: "get a backend by machine ID/name". And it can cache them internally as We can start passing the machine ID/name to the remote backend so that it includes it in the metadata as is: . So if a user passed an ID in the requests, the same ID will come in the response. Same for the name if we want to support both.Note one more things is that the grpc-proxy establishes permanent connections to backends and doesn't close them (well, maybe there is a long deadline but I don't know the details). If a client communicated with a machine and it gets renamed later, the proxy will still hold a connection to it. I don't know how it identifies if it needs to establish a new grpc connection or not, maybe just based on the |
Ah, I see. Some of these ProxyMachines uses are always just calling with one machine and their proto doesn't have the metadata field... hmm. Three options coming to mind:
Personally, I like option 3 (feels like it'll be the smaller change between 2 and 3), and I'd really like to get rid of the current option 1, nil check cases. |
|
100% agree that metadata nil check is annoying. I'm not sure I understand what you meant in option 3. Can you please elaborate? So if the metadata is missing only for a one2one request, can we explicitly set it for such a request to behave the same as one2many? Not sure what is actually responsible for that, This is what I guess I was alluding to here: Lines 76 to 81 in 1e6aaf4 Then one2one won't be an edge case and could be handled in the same way as one2many in the client (always expecting the |
|
Ah, we probably can't do this for a generic case because the proxy layer doesn't know the type of a response whether it includes a |
Yeah, my option three was to basically have the cli caller tell the server if it wants metadata or not. And since the “no metadata” cases seem to always be “proxy to a single machine”, I was thinking it might just be a different key used in the call:
|
|
Right, thanks for clarifying. Hm, that sounds clever. Ah, isn't this the reason why Talos also supports both |
| localID string | ||
| localName string |
There was a problem hiding this comment.
Remove. No longer used.
|
Still needs a little more cleanup, but I added the machine/machines grpc call modes (for the no metadata cases) and refactored the Director/Mapper machine lookup (and how backends work with that and returning current machine name). The overall approach is looking much better now, I think.
Yeah, I think so. That Talos code looks vaguely similar to the new Director implementation, and it definitely seems to be dealing with the same sorts of issues (one2one vs one2many proxying for different cases). |
f90575a to
3336ee2
Compare
… some misc cleanup
…tored to use helper functions due to machine status filtering logic
…chine names/ids that are not found
092d464 to
4940525
Compare
…d be an error and a wildcard with no machines in the cluster (somehow)
…case of data race on access
…already available)
| MinClientVersion = "0.0.0" | ||
| MinServerVersion = "0.0.0" | ||
| MinClientVersion = "0.20.0" | ||
| MinServerVersion = "0.20.0" |
There was a problem hiding this comment.
Note: we'll want to bump these versions to whatever release this ends up going out in.
Update: This is now complete, I believe, and ready for review. I have my cluster running on it now with no problems so far.
Moves the machine id/name resolution to the server and returns metadata with machine addr, id, and name for responses (where appropriate). This streamlines machine name mapping and error and edge case handling of
ProxyMachinesContextto look like this: