-
Notifications
You must be signed in to change notification settings - Fork 753
Fix: Canonicalize PodGangSet to prevent spurious updates #2469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces PodSpec and PodGangSet canonicalization utilities, switches several graph-generation functions to return PodSpec pointers, sets common PodSpec defaults (restart policy, termination grace), updates controller to use value PodSpec, augments resource syncing with spec diff logging, adjusts tests accordingly, and adds testify as a dependency. Changes
Sequence Diagram(s)sequenceDiagram
participant C as Controller/Planner
participant G as Graph Gen
participant CC as controller_common
participant K as Kubernetes API
C->>G: GenerateBasePodSpec(...)
G-->>C: *PodSpec (pointer)
C->>CC: CanonicalizePodSpec(*PodSpec)
CC-->>C: *PodSpec (sorted)
C->>K: Create/Update PodTemplate/Resources
sequenceDiagram
participant Reconciler
participant CC as controller_common.SyncResource
participant CMP as go-cmp
participant API as K8s API
Reconciler->>CC: SyncResource(old, desired)
CC->>CC: Hash(old.spec) vs Hash(desired.spec)
alt spec changed
CC->>CMP: cmp.Diff(old.spec, desired.spec)
CMP-->>CC: diff string
CC->>CC: Log diff
CC->>API: Update resource
else no change
CC-->>Reconciler: No-op
end
sequenceDiagram
participant Ctrl as DCD Controller
participant B as basePodSpec
participant T as Template Builder
Ctrl->>B: Obtain basePodSpec (value)
Ctrl->>T: Mutate local copy (containers, SA, flags)
T-->>Ctrl: PodTemplateSpec with mutated spec
note over B: Original basePodSpec remains unmodified
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.2.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (17)
deploy/cloud/operator/go.mod (2)
16-16: Test-only dependency addition looks fine; consider keeping it scoped to tests and tidying the module.
github.com/stretchr/testify v1.10.0aligns with our test changes. Minor nit: ensure no production code imports testify and rungo mod tidyto prune unused constraints.
61-61: Indirect pin likely pulled by test deps; optional to omit explicit listing.
github.com/pmezard/go-difflibis an indirect brought in by test tooling. It's okay to keep, but consider lettinggo mod tidymanage indirects to avoid stale pins.deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
1220-1220: Avoid pointer aliasing: deep-copy the base PodSpec before mutation.As written,
podSpec := basePodSpecaliases the same PodSpec; subsequent mutations affect the original. If the intent is to isolate mutations per template, deep-copy instead.Apply this diff:
- podSpec := basePodSpec + podSpec := basePodSpec.DeepCopy()deploy/cloud/operator/internal/dynamo/component_common.go (1)
57-61: Sensible defaults; consider promoting 60s to a const for visibility.RestartPolicyAlways is correct for Deployments/LWS. Optional: extract
TerminationGracePeriodSecondsinto a named const (or config) for easier auditing.+const defaultTerminationGraceSeconds int64 = 60 ... - TerminationGracePeriodSeconds: ptr.To(int64(60)), + TerminationGracePeriodSeconds: ptr.To(defaultTerminationGraceSeconds),deploy/cloud/operator/internal/controller_common/resource.go (2)
160-166: Consider redacting sensitive fields and lowering log verbosity for spec diffs.Diffs may include user-specified env values or other sensitive content embedded in specs. Recommend:
- Redact common secret-like keys (e.g., fields/keys containing password, token, secret).
- Log at a higher verbosity (e.g., V(1)) or gate behind a feature flag.
Here’s a minimal pattern to redact values before logging:
- diff, diffErr := generateSpecDiff(oldResource, resource) + diff, diffErr := generateSpecDiff(oldResource, resource) + // Optional: cheap redaction pass + if diffErr == nil && diff != "" { + redactions := []string{"password", "secret", "token", "apikey"} + for _, r := range redactions { + diff = strings.ReplaceAll(diff, r, "***") + } + }Note: you’ll need to import "strings" if you implement this quick pass. A structured redactor (field-aware) would be ideal longer term.
264-284: Good addition: pre-update spec diff improves observability.The helper isolates spec extraction and uses cmp.Diff cleanly. Returning empty diff when equal is straightforward and keeps logs clean.
If you experience noisy diffs from non-semantic map ordering, consider cmpopts (e.g., cmpopts.SortSlices for certain paths) or diffing canonical JSON (post SortKeys) to align with your hashing representation. Not urgent given ongoing canonicalization work.
deploy/cloud/operator/internal/controller_common/pod.go (5)
14-22: Tie-break EnvFrom using Optional to avoid non-determinism.Two EnvFrom entries with same kind/name/prefix but different Optional could shuffle because the key omits Optional.
Apply:
- envFromKey := func(e corev1.EnvFromSource) string { - if e.ConfigMapRef != nil { - return "cm:" + e.ConfigMapRef.Name + ":" + e.Prefix - } - if e.SecretRef != nil { - return "sec:" + e.SecretRef.Name + ":" + e.Prefix - } - return "other:" + e.Prefix - } + envFromKey := func(e corev1.EnvFromSource) string { + if e.ConfigMapRef != nil { + opt := "0" + if e.ConfigMapRef.Optional != nil && *e.ConfigMapRef.Optional { + opt = "1" + } + return "cm:" + e.ConfigMapRef.Name + ":" + e.Prefix + ":" + opt + } + if e.SecretRef != nil { + opt := "0" + if e.SecretRef.Optional != nil && *e.SecretRef.Optional { + opt = "1" + } + return "sec:" + e.SecretRef.Name + ":" + e.Prefix + ":" + opt + } + return "other:" + e.Prefix + }
27-29: Add a deterministic tie-breaker for Env duplicates with same Name.While duplicate Env names are unusual, relying only on Name with an unstable sort can still shuffle ordering. Tie-break by Value then ValueFrom presence for deterministic output.
- sort.Slice(env, func(i, j int) bool { return env[i].Name < env[j].Name }) + sort.Slice(env, func(i, j int) bool { + if env[i].Name == env[j].Name { + if env[i].Value != env[j].Value { + return env[i].Value < env[j].Value + } + // nil ValueFrom sorts before non-nil to keep deterministic order + if (env[i].ValueFrom == nil) != (env[j].ValueFrom == nil) { + return env[i].ValueFrom == nil + } + return false + } + return env[i].Name < env[j].Name + })
40-46: Ports: add protocol tie-breaker for complete determinism.If Name and ContainerPort match (rare, but possible in misconfigs), tie-break on Protocol to avoid unstable order.
- if ports[i].Name == ports[j].Name { - return ports[i].ContainerPort < ports[j].ContainerPort - } + if ports[i].Name == ports[j].Name { + if ports[i].ContainerPort == ports[j].ContainerPort { + return ports[i].Protocol < ports[j].Protocol + } + return ports[i].ContainerPort < ports[j].ContainerPort + }
220-233: TopologySpreadConstraints: consider canonicalizing label selectors too.You sort constraints themselves, but not the embedded LabelSelector.MatchExpressions/MatchLabels. Unordered selector internals can still trigger spurious diffs.
I suggest a follow-up that:
- Sorts MatchLabels by key.
- Sorts MatchExpressions by Key, then Operator, then Values (sorted).
- Applies similar handling to pod affinity/anti-affinity LabelSelectors.
248-277: Broaden canonicalization to Affinity and ReadinessGates for full stability.Remaining array-like fields that can drift:
- Affinity: Preferred/Required NodeAffinity terms; Pod(Anti)Affinity terms; each’s LabelSelectors and MatchExpressions.
- ReadinessGates: []PodReadinessGate.
Not blocking this PR, but adding these will further reduce non-semantic diffs.
If helpful, I can sketch a minimal, safe ordering for Affinity (NodeSelectorTerms and PodAffinityTerms) and selectors in a follow-up PR.
deploy/cloud/operator/internal/controller_common/podgangset.go (1)
9-19: Handle nils and sort CliqueNames within scaling groups for complete determinism.
- gangSet could be nil.
- Cliques slice holds pointers; guard against nil elements.
- CliqueNames inside each PodCliqueScalingGroupConfig should also be sorted to avoid internal drift.
Apply:
func CanonicalizePodGangSet(gangSet *grovev1alpha1.PodGangSet) *grovev1alpha1.PodGangSet { - // sort cliques by name - sort.Slice(gangSet.Spec.Template.Cliques, func(i, j int) bool { - return gangSet.Spec.Template.Cliques[i].Name < gangSet.Spec.Template.Cliques[j].Name - }) - // sort scaling groups by name - sort.Slice(gangSet.Spec.Template.PodCliqueScalingGroupConfigs, func(i, j int) bool { - return gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i].Name < gangSet.Spec.Template.PodCliqueScalingGroupConfigs[j].Name - }) - return gangSet + if gangSet == nil { + return gangSet + } + // sort clique names within each scaling group + for i := range gangSet.Spec.Template.PodCliqueScalingGroupConfigs { + sort.Strings(gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i].CliqueNames) + } + // sort cliques by name (nil-safe) + sort.Slice(gangSet.Spec.Template.Cliques, func(i, j int) bool { + ci, cj := gangSet.Spec.Template.Cliques[i], gangSet.Spec.Template.Cliques[j] + if ci == nil || cj == nil { + return cj != nil // non-nil after nil + } + return ci.Name < cj.Name + }) + // sort scaling groups by name + sort.Slice(gangSet.Spec.Template.PodCliqueScalingGroupConfigs, func(i, j int) bool { + return gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i].Name < gangSet.Spec.Template.PodCliqueScalingGroupConfigs[j].Name + }) + return gangSet }deploy/cloud/operator/internal/dynamo/graph_test.go (1)
3004-3021: Now that canonicalization is in place, consider dropping manual sorting in tests.You sort Cliques and Env variables in tests prior to cmp.Diff. If the generator calls CanonicalizePodGangSet and CanonicalizePodSpec, these extra sorts may be redundant.
You could rely on the canonical forms and remove:
- The sort.Slice(...) on got/tt.want cliques.
- The per-container Env sorting loops.
This will make tests stricter and surface regressions in canonicalization.
deploy/cloud/operator/internal/controller_common/pod_test.go (2)
90-118: Add a case for EnvFrom entries that differ only by Optional.To lock-in deterministic behavior, add a test where kind/name/prefix match but Optional differs. This pairs with the suggested tie-breaker in CanonicalizePodSpec.
I can add the unit test covering Optional-based tie-break if you’d like.
482-521: Consider adding tests for Affinity and selector canonicalization.Remaining sources of drift:
- Affinity (NodeAffinity/Pod(Anti)Affinity) preferred/required terms.
- LabelSelector MatchLabels/MatchExpressions ordering.
Adding tests now will guard a future enhancement to CanonicalizePodSpec.
deploy/cloud/operator/internal/dynamo/graph.go (2)
941-945: Make CliqueNames ordering stable to avoid spurious diffsCliqueNames inherits loop order; it may already be stable, but canonicalization doesn’t currently sort this inner slice. To guarantee stability across reconciliations, sort it deterministically.
Two options:
- Sort here before constructing the scaling group (minimal change).
- Or extend CanonicalizePodGangSet to sort CliqueNames for each scaling group (preferred, centralizes ordering).
If you prefer local sorting here, minimal inline change:
- if isMultinode { - scalingGroups = append(scalingGroups, grovev1alpha1.PodCliqueScalingGroupConfig{ + if isMultinode { + // Keep clique names order stable + sort.Strings(cliqueNames) + scalingGroups = append(scalingGroups, grovev1alpha1.PodCliqueScalingGroupConfig{ Name: strings.ToLower(serviceName), CliqueNames: cliqueNames, Replicas: component.Replicas, MinAvailable: ptr.To(int32(1)), }) }Alternatively, update canonicalization (see next comment).
952-953: Returning CanonicalizePodGangSet is the right move; consider extending it to sort nested arraysThis ensures the top-level arrays (Cliques and PodCliqueScalingGroupConfigs) are stable. To fully eliminate ordering noise, also sort:
- Each scaling group’s CliqueNames
- Each clique’s Spec.StartsAfter (if/when used)
Here’s a small enhancement for deploy/cloud/operator/internal/controller_common/podgangset.go:
func CanonicalizePodGangSet(gangSet *grovev1alpha1.PodGangSet) *grovev1alpha1.PodGangSet { // Existing sorts... sort.Slice(gangSet.Spec.Template.Cliques, func(i, j int) bool { return gangSet.Spec.Template.Cliques[i].Name < gangSet.Spec.Template.Cliques[j].Name }) sort.Slice(gangSet.Spec.Template.PodCliqueScalingGroupConfigs, func(i, j int) bool { return gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i].Name < gangSet.Spec.Template.PodCliqueScalingGroupConfigs[j].Name }) // New: sort nested arrays to maximize stability for i := range gangSet.Spec.Template.PodCliqueScalingGroupConfigs { cfg := &gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i] if len(cfg.CliqueNames) > 1 { sort.Strings(cfg.CliqueNames) } } for i := range gangSet.Spec.Template.Cliques { cl := gangSet.Spec.Template.Cliques[i] if len(cl.Spec.StartsAfter) > 1 { sort.Strings(cl.Spec.StartsAfter) } } return gangSet }I can open a follow-up PR to add this canonicalization if helpful.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (11)
deploy/cloud/operator/go.mod(2 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go(1 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go(2 hunks)deploy/cloud/operator/internal/controller_common/pod.go(1 hunks)deploy/cloud/operator/internal/controller_common/pod_test.go(1 hunks)deploy/cloud/operator/internal/controller_common/podgangset.go(1 hunks)deploy/cloud/operator/internal/controller_common/resource.go(3 hunks)deploy/cloud/operator/internal/dynamo/component_common.go(2 hunks)deploy/cloud/operator/internal/dynamo/component_planner.go(1 hunks)deploy/cloud/operator/internal/dynamo/graph.go(12 hunks)deploy/cloud/operator/internal/dynamo/graph_test.go(21 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
🧬 Code Graph Analysis (3)
deploy/cloud/operator/internal/dynamo/component_planner.go (1)
deploy/cloud/operator/internal/consts/consts.go (1)
PlannerServiceAccountName(44-44)
deploy/cloud/operator/internal/controller_common/pod_test.go (1)
deploy/cloud/operator/internal/controller_common/pod.go (1)
CanonicalizePodSpec(12-277)
deploy/cloud/operator/internal/dynamo/graph.go (4)
deploy/cloud/operator/internal/dynamo/component_common.go (1)
ComponentDefaultsFactory(26-37)deploy/cloud/operator/api/dynamo/common/common.go (1)
ExtraPodSpec(58-61)deploy/cloud/operator/internal/controller_common/pod.go (1)
CanonicalizePodSpec(12-277)deploy/cloud/operator/internal/controller_common/podgangset.go (1)
CanonicalizePodGangSet(9-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (16)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
1220-1220: Summary mismatch: this line still aliases a pointer, not a value copy.AI summary notes a value copy; the code retains pointer semantics. If pointer aliasing is intentional, ignore; otherwise, adopt a deep copy as suggested above.
Would you confirm whether
GenerateBasePodSpecForControllerreturns a pointer and whether we want to mutate it in place? If not, the DeepCopy change will align behavior with the stated intent.deploy/cloud/operator/internal/dynamo/component_common.go (2)
11-11: Good call importing ptr for pointer-safe defaults.Using
k8s.io/utils/ptrimproves clarity and avoids the&[]T{...}[0]pattern.
54-55: Centralizing PodSpec defaults improves consistency.Returning
getCommonPodSpec()here reduces duplication and keeps defaulting uniform across components.deploy/cloud/operator/internal/dynamo/component_planner.go (1)
42-44: Planner pod defaults now inherit common behavior—LGTM.Leveraging
getCommonPodSpec()and then setting the Planner-specific service account keeps defaults consistent while preserving Planner semantics.deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (2)
799-800: Expectation updated for RestartPolicyAlways—correct per new defaults.Matches the new common PodSpec default. Good coverage.
899-900: Worker template RestartPolicyAlways expectation—LGTM.Consistent with leader template and shared defaults.
deploy/cloud/operator/internal/controller_common/resource.go (1)
30-30: No changes needed: go-cmp is already listed in go.modThe operator module’s go.mod (line 11) already includes
github.com/google/go-cmp v0.7.0. CI/build failures related to this import should not occur.deploy/cloud/operator/internal/controller_common/pod.go (1)
12-67: Solid, in-place canonicalization with clear helpers. LGTM.Deterministic ordering across container-like fields, volumes, tolerations, topology constraints, host aliases, and DNS config is well covered. The design keeps mutations local and returns the same pointer.
deploy/cloud/operator/internal/dynamo/graph_test.go (3)
1208-1208: Tests correctly reflect StartupType AnyOrder semantics. LGTM.The shift to CliqueStartupTypeAnyOrder is consistent with the new canonicalization and startup behavior. Good coverage across scenarios.
Also applies to: 1704-1704, 2462-2462
1228-1231: MinAvailable expectations are sensible and improve resiliency.Adding MinAvailable = 1 across cliques (and PCSG configs) is a pragmatic default for test fixtures and aligns with HA expectations.
Also applies to: 1360-1362, 1871-1872, 1979-1981, 2111-2112, 2741-2743, 2871-2874
1249-1249: RestartPolicyAlways and explicit grace periods: tests aligned with lifecycle tightening.Asserting RestartPolicyAlways and TerminationGracePeriodSeconds solidifies expected lifecycle semantics for generated pod specs.
Also applies to: 1383-1384, 1999-1999, 2114-2115, 2510-2511, 2648-2648, 2875-2875
deploy/cloud/operator/internal/controller_common/pod_test.go (1)
10-681: Excellent breadth: canonicalization behavior is thoroughly exercised.Covers containers (incl. ephemeral), Env/EnvFrom, ports, mounts, security caps, imagePullSecrets, volumes (and nested projections), tolerations, topology spreads, host aliases, DNS config, nil handling, and idempotence. This is the right level for stability guarantees.
deploy/cloud/operator/internal/dynamo/graph.go (4)
34-34: Appropriate use of ptr helper for pointer fieldsUsing k8s.io/utils/ptr for StartupType and MinAvailable is the right call and keeps literals clean.
875-875: Defaulting StartupType to AnyOrder is fine; confirm desired behavior while dependencies are deactivatedGiven applyCliqueStartupDependencies is currently deactivated, StartupType will remain AnyOrder. If you plan to re-enable dependencies, note they’ll set Explicit (overriding this default). Confirm this is the intended interim behavior.
850-859: Pointer return propagation looks correctGeneratePodSpecForComponent now returns (*corev1.PodSpec, error) and cleanly propagates the pointer/nil error contract from GenerateBasePodSpec. LGTM.
691-814: Downstream nil-handling verified; no further changes neededI’ve swept all call sites of GenerateBasePodSpec, GeneratePodSpecForComponent, and GenerateBasePodSpecForController. Every caller checks
err != nilbefore dereferencing the returned*corev1.PodSpec, and all tests validate both success and error paths. There are no unsafe usages or missing nil checks.
|
Should we ask for an improvement in Grove to make it not as picky? We can still take this PR of course, but seems like possibly a good improvement for Grove itself. |
yes @itay , I also opened an issue on the grove side |
mohammedabdulwahhab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just questions, but otherwise the change looks good.
Signed-off-by: Hannah Zhang <[email protected]>
Overview:
Canonicalize PodGangSet to prevent spurious updates
Details:
Grove validation webhook is very picky when checking for diff between old and updated CR.
This aim to make sure the dynamo operator always produces PodGangSet CR with arrays in the same order so that Grove only check the real diff.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores