Skip to content

refactor: Improvements to machine proxy calls and response metadata#247

Open
jabr wants to merge 34 commits into
psviderski:mainfrom
jabr:improved-proxy-machines-context
Open

refactor: Improvements to machine proxy calls and response metadata#247
jabr wants to merge 34 commits into
psviderski:mainfrom
jabr:improved-proxy-machines-context

Conversation

@jabr
Copy link
Copy Markdown
Contributor

@jabr jabr commented Jan 26, 2026

Update: This is now complete, I believe, and ready for review. I have my cluster running on it now with no problems so far.

Moves the machine id/name resolution to the server and returns metadata with machine addr, id, and name for responses (where appropriate). This streamlines machine name mapping and error and edge case handling of ProxyMachinesContext to look like this:

mctx := cli.ProxyMachinesContext(ctx, nil)
machineContainers, err := cli.Docker.ListServiceContainers(mctx, ...)
for _, msc := range machineContainers {
  machineName := msc.Metadata.MachineName
  ...
}

@jabr jabr marked this pull request as draft January 26, 2026 00:44
@jabr jabr mentioned this pull request Jan 26, 2026
@psviderski
Copy link
Copy Markdown
Owner

Thank you for taking a look at this which I agree needs more love and clean up. The direction looks good but I think we can do even better by hiding all the details in the grpc-proxy layer: https://github.com/psviderski/uncloud/tree/79dc05cb665d6541ea2ff1ca16af4119ae6b47d4/internal/machine/api/proxy

I left this TODO to explore that path:

// TODO: move the machine IP resolution to the proxy router to allow setting machine names and IDs in the metadata.

The idea is to move the machine resolution logic from the client to the server which should simplify the API and reduce the number of requests.

At the moment ProxyMachinesContext sets the machines metadata field on the request to a list of internal IPv6 machine addresses. This was done that way to make it easy to implement the proxy layer which was new to me so I didn't want to overcomplicate it from the beginning. It simply extracts the addresses from the metadata and routes requests there:

// Director implements proxy.StreamDirector for grpc-proxy, routing requests to local or remote backends based
// on gRPC metadata in the context. Each machine metadata is injected into the response messages by the proxy
// if the request is proxied to multiple backends.
func (d *Director) Director(ctx context.Context, fullMethodName string) (proxy.Mode, []proxy.Backend, error) {
md, ok := metadata.FromIncomingContext(ctx)
if !ok {
return proxy.One2One, []proxy.Backend{d.localBackend}, nil
}
// If the request is already proxied, send it to the local backend.
if _, ok = md["proxy-authority"]; ok {
return proxy.One2One, []proxy.Backend{d.localBackend}, nil
}
// If the request metadata doesn't contain machines to proxy to, send it to the local backend.
machines, ok := md["machines"]
if !ok {
return proxy.One2One, []proxy.Backend{d.localBackend}, nil
}
if len(machines) == 0 {
return proxy.One2One, nil, status.Error(codes.InvalidArgument, "no machines specified")
}
d.mu.RLock()
localAddress := d.localAddress
localBackend := d.localBackend
d.mu.RUnlock()
backends := make([]proxy.Backend, len(machines))
for i, addr := range machines {
if addr == localAddress {
backends[i] = localBackend
continue
}
backend, err := d.remoteBackend(addr)
if err != nil {
return proxy.One2One, nil, status.Error(codes.Internal, err.Error())
}
backends[i] = backend
}
if len(backends) == 1 {
return proxy.One2One, backends, nil
}
return proxy.One2Many, backends, nil
}

It's simple and agnostic from knowing anything about machine IDs/names and how to map them to network addresses. The obvious downside is that now every client needs to handle this. Another downside is when logging errors we again need to map network addresses to meaningful machine names.

Now, when the proxy layer works pretty well, we can iterate on it and make it smarter. How about we add a mapper to the Director directly. We can still abstract it with a mapper interface so that the proxy won't know about the local corrosion db and the subscription logic to watch for machine changes.

The client will be able to set machines to a list of machine IDs or names (we can discuss if supporting both needed/desired later). Then the director will use the mapper to translate these names to addresses it needs to create remote backends. The mapper will watch corrosion to keep its mapping up to date.

Something similar needs to be done for responses as well. We can map ipv6 addresses back to machine IDs/names and set them in the response Metadata:

// Common metadata message nested in all reply message types, injected by the gRPC proxy to provide information
// about the machine that responded to the request.
message Metadata {
// Address of the machine the response came from.
string machine = 1;
// error is set if the request to upstream failed. The rest of the response is undefined.
string error = 2;
// error as a gRPC Status message.
google.rpc.Status status = 3;
}

This way the client will only work with and see machine IDs/names everywhere.

Moreover, we can even implement a shortcut for broadcasting to all machines, e.g. set something like machines: * in the request metadata. Then the client won't even need to ListMachines to request all machines (the mapper knows the full list). But this is more like nice to have so could be done as a follow-up if we really want to.

I also remember we have non-trivial error handling logic when broadcasting requests to multiple machines. I hope it can also be simplified by unifying the proxy layer.

What do you think about this approach?

@jabr
Copy link
Copy Markdown
Contributor Author

jabr commented Jan 26, 2026

Yeah, that makes sense. It was a little unwieldy to wrap the current implementation, and that probably should have been a sign that I'd missed a better refactor. 😆

I'll take a look at moving it into the Director layer next. I should have some more time this week.

@psviderski
Copy link
Copy Markdown
Owner

Awesome! Let me know if you need any guidance or want to discuss something

@jabr
Copy link
Copy Markdown
Contributor Author

jabr commented Jan 27, 2026

Curious about this bit in Director where there is "machines" metadata:

if len(backends) == 1 {
	return proxy.One2One, backends, nil
}
return proxy.One2Many, backends, nil

I'm guessing that is where the special case for "one machine" ProxyMachinesContext usage comes from (and the need for the if r.Metadata == nil check in callers).

Just wondering if that optimization is worth keeping... how much overhead does the One2Many proxy add there? A quick scan of the proxy code looks like it's pretty similar, so I would be surprised if it's not negligible compared to the network overhead.

@psviderski
Copy link
Copy Markdown
Owner

I can't confidently say off the top of my head. Maybe you can't simply use One2Many for rpc calls that are not designed to support broadcast. To support broadcasting requests, the response should be designed upfront, e.g.

message ListContainersResponse {
// Must contain only one repeated messages field to allow broadcasting ListContainers requests to multiple machines.
repeated MachineContainers messages = 1;
}

I was mainly replicating the usage in Talos Linux: https://github.com/siderolabs/talos/blob/e48c6d7ab9c8a2e28ebe2115ac09f1557bbcca33/internal/app/apid/pkg/director/director.go#L50-L94 and didn't try to optimise it since then.

@jabr
Copy link
Copy Markdown
Contributor Author

jabr commented Jan 27, 2026

Okay, I had the AI take a pass at the server-side machine id/name resolution. It's got some rough spots still, but the basic idea is there now.

edit: I think there's a simpler solution to the comments below with a fairly simple refactor to the "machine directory" and how the proxy director/backends use it. See #247 (comment)


One deeper issue I noticed: the directory maintains a cache of machine backends, which now have the name (along with ID as well and the management IP addr they originally had), but that isn't updated if a machine is renamed. (Can the IP or ID also change?)

Thinking the proxy backends might need to interact with the new MachineDirectory to get the current name for the response rather than storing it itself. But that also means we might want to cache the machine list in the directory, since it would need the names both at the start to determine which backends to call and at the end to attach metadata.

Perhaps it should work like the Cluster code does with a machine list and then subscribe for updates to maintain its peer list. And if so, maybe those two pieces should be sharing a common, dynamic corrosion machine list...

Comment thread cmd/uncloud/machine/rm.go Outdated
Comment thread internal/machine/api/proxy/directory.go Outdated
Comment thread internal/machine/api/proxy/directory.go Outdated
Comment thread internal/machine/api/proxy/directory.go Outdated
Comment thread internal/machine/api/proxy/director.go Outdated
Comment thread pkg/client/volume.go Outdated
Comment thread pkg/client/volume.go Outdated
@psviderski
Copy link
Copy Markdown
Owner

psviderski commented Jan 27, 2026

Can the IP or ID also change?

No, the machine ID is generated and stored once when the machine is initialised as well as the IP (which is derived from the first 14 bytes of the machine public key):

// ManagementIP returns the IPv6 address of a peer derived from the first 14 bytes of its public key.
// This address always starts with fdcc: and is intended for cluster management traffic.
func ManagementIP(publicKey secret.Secret) netip.Addr {
bytes := [16]byte{0xfd, 0xcc}
copy(bytes[2:], publicKey[:14])
return netip.AddrFrom16(bytes)
}

The public key cannot be changed as well.

Perhaps it should work like the Cluster code does with a machine list and then subscribe for updates to maintain its peer list. And if so, maybe those two pieces should be sharing a common, dynamic corrosion machine list...

Yeah, I think abstracting these remoteBackends into a something I called a mapper earlier is how we can consolidate the current caching and Corrosion watching logic in one place:

remoteBackends sync.Map

It can expose just one method: "get a backend by machine ID/name". And it can cache them internally as remoteBackends map does now as well as invalidate the cache and keep the mapping in sync with machines in Corrosion. The directory itself won't care about all this logic, it will continue calling .remoteBackend(machineNameOrID) (give me the backend for this machine).

We can start passing the machine ID/name to the remote backend so that it includes it in the metadata as is:

. So if a user passed an ID in the requests, the same ID will come in the response. Same for the name if we want to support both.

Note one more things is that the grpc-proxy establishes permanent connections to backends and doesn't close them (well, maybe there is a long deadline but I don't know the details). If a client communicated with a machine and it gets renamed later, the proxy will still hold a connection to it. I don't know how it identifies if it needs to establish a new grpc connection or not, maybe just based on the target address.
I expect it should correctly handle the case when the director starts getting a backend with the same target address that contains a changed machine (new name) but this should be properly tested.

@jabr
Copy link
Copy Markdown
Contributor Author

jabr commented Jan 27, 2026

Maybe you can't simply use One2Many for rpc calls that are not designed to support broadcast. To support broadcasting requests, the response should be designed upfront

Ah, I see. Some of these ProxyMachines uses are always just calling with one machine and their proto doesn't have the metadata field... hmm.

Three options coming to mind:

  1. Keep the current behavior with metadata-less one2one invocation for single machine, but that means we need to handle the nil metadata case again in the client commands.
  2. Update those other protos to add a metadata field.
  3. Specify the target machine with a different call metadata property (or add a new flag option) to get the current metadata-less response behavior just when we want it (and not in the unintended single machine but want metadata cases)

Personally, I like option 3 (feels like it'll be the smaller change between 2 and 3), and I'd really like to get rid of the current option 1, nil check cases.

@psviderski
Copy link
Copy Markdown
Owner

psviderski commented Jan 27, 2026

100% agree that metadata nil check is annoying. I'm not sure I understand what you meant in option 3. Can you please elaborate?

So if the metadata is missing only for a one2one request, can we explicitly set it for such a request to behave the same as one2many? Not sure what is actually responsible for that, LocalBackend?

This is what I guess I was alluding to here:

// TODO: handle this in the grpc-proxy router and always provide Metadata if possible.
if msg.Metadata == nil {
// Metadata can be nil if the request was broadcasted to only one machine.
machineImages[i].Metadata = &pb.Metadata{
Machine: machines[0].Machine.Id,
}

Then one2one won't be an edge case and could be handled in the same way as one2many in the client (always expecting the Metadata to be set). Each machine knows what ID or name it is, so we should be able to inject this information in a response at the proxy layer.

@psviderski
Copy link
Copy Markdown
Owner

Ah, we probably can't do this for a generic case because the proxy layer doesn't know the type of a response whether it includes a Metadata field or not. Need to think more about it

@jabr
Copy link
Copy Markdown
Contributor Author

jabr commented Jan 27, 2026

Ah, we probably can't do this for a generic case because the proxy layer doesn't know the type of a response whether it includes a Metadata field or not.

Yeah, my option three was to basically have the cli caller tell the server if it wants metadata or not.

And since the “no metadata” cases seem to always be “proxy to a single machine”, I was thinking it might just be a different key used in the call:

  • “machines” is 1 or more id/names and should return metadata
  • “machine” is just 1 id/name and returns no metadata

@psviderski
Copy link
Copy Markdown
Owner

psviderski commented Jan 27, 2026

Right, thanks for clarifying. Hm, that sounds clever. Ah, isn't this the reason why Talos also supports both node and nodes? https://github.com/siderolabs/talos/blob/e48c6d7ab9c8a2e28ebe2115ac09f1557bbcca33/internal/app/apid/pkg/director/director.go#L60-L61. I didn't get it when I tried to understand the logic.
I like your approach to instruct on the client how the server should handle the requests 👍

Comment thread cmd/uncloud/machine/rm.go
Comment thread internal/machine/api/proxy/backend.go
Comment thread internal/machine/api/proxy/director.go Outdated
Comment thread internal/machine/api/proxy/director.go Outdated
Comment on lines +22 to +23
localID string
localName string
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove. No longer used.

Comment thread internal/machine/machine.go Outdated
Comment thread pkg/client/volume.go
@jabr
Copy link
Copy Markdown
Contributor Author

jabr commented Jan 28, 2026

Still needs a little more cleanup, but I added the machine/machines grpc call modes (for the no metadata cases) and refactored the Director/Mapper machine lookup (and how backends work with that and returning current machine name).

The overall approach is looking much better now, I think.

(re: machine/machines separation) isn't this the reason why Talos also supports both node and nodes

Yeah, I think so. That Talos code looks vaguely similar to the new Director implementation, and it definitely seems to be dealing with the same sorts of issues (one2one vs one2many proxying for different cases).

@jabr jabr force-pushed the improved-proxy-machines-context branch 2 times, most recently from f90575a to 3336ee2 Compare February 9, 2026 01:25
@jabr jabr force-pushed the improved-proxy-machines-context branch from 092d464 to 4940525 Compare April 16, 2026 07:21
@jabr jabr marked this pull request as ready for review May 16, 2026 06:20
Comment on lines -34 to +35
MinClientVersion = "0.0.0"
MinServerVersion = "0.0.0"
MinClientVersion = "0.20.0"
MinServerVersion = "0.20.0"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we'll want to bump these versions to whatever release this ends up going out in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants