Fix: Canonicalize PodGangSet to prevent spurious updates #2469

julienmancuso · 2025-08-16T00:04:03Z

Overview:

Canonicalize PodGangSet to prevent spurious updates

Details:

Grove validation webhook is very picky when checking for diff between old and updated CR.
This aim to make sure the dynamo operator always produces PodGangSet CR with arrays in the same order so that Grove only check the real diff.

Summary by CodeRabbit

New Features
- Operator logs now include human-readable diffs of resource spec changes.
- PodGangSet adds configurable startup order (AnyOrder) and per-clique MinAvailable.
Bug Fixes
- Deterministic PodSpec ordering reduces unnecessary updates and flapping.
- Default RestartPolicy set to Always and TerminationGracePeriod to 60s for more reliable restarts.
Tests
- Extensive tests added for PodSpec/PodGangSet canonicalization and updated behaviors.
Chores
- Dependency updates to support new testing and diff capabilities.

coderabbitai · 2025-08-16T00:07:45Z

Walkthrough

Introduces PodSpec and PodGangSet canonicalization utilities, switches several graph-generation functions to return PodSpec pointers, sets common PodSpec defaults (restart policy, termination grace), updates controller to use value PodSpec, augments resource syncing with spec diff logging, adjusts tests accordingly, and adds testify as a dependency.

Changes

Cohort / File(s)	Summary
Dependencies `deploy/cloud/operator/go.mod`	Added testify (direct) and go-difflib (indirect).
Controller: PodTemplate generation `deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go`	Use value copy of basePodSpec instead of pointer in generatePodTemplateSpec.
Controller tests `deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go`	Updated expectations to include RestartPolicyAlways for leader/worker PodSpecs.
PodSpec canonicalization `deploy/cloud/operator/internal/controller_common/pod.go`, `.../pod_test.go`	Added CanonicalizePodSpec with deterministic sorting across PodSpec fields; comprehensive tests.
PodGangSet canonicalization `deploy/cloud/operator/internal/controller_common/podgangset.go`	Added CanonicalizePodGangSet to sort cliques and scaling group configs by name.
Resource syncing diffs `deploy/cloud/operator/internal/controller_common/resource.go`	Added go-cmp; generate and log spec diffs when specs change.
Component defaults `deploy/cloud/operator/internal/dynamo/component_common.go`, `.../component_planner.go`	Introduced shared getCommonPodSpec: RestartPolicyAlways and 60s termination grace; planner sets ServiceAccountName atop shared defaults.
Graph generation (API changes) `deploy/cloud/operator/internal/dynamo/graph.go`, `.../graph_test.go`	Switched PodSpec returns to pointers; applied CanonicalizePodSpec/CanonicalizePodGangSet; added StartupType AnyOrder and MinAvailable=1 in gang sets; updated tests for new fields and lifecycle defaults.

Sequence Diagram(s)

sequenceDiagram
  participant C as Controller/Planner
  participant G as Graph Gen
  participant CC as controller_common
  participant K as Kubernetes API

  C->>G: GenerateBasePodSpec(...)
  G-->>C: *PodSpec (pointer)
  C->>CC: CanonicalizePodSpec(*PodSpec)
  CC-->>C: *PodSpec (sorted)
  C->>K: Create/Update PodTemplate/Resources

sequenceDiagram
  participant Reconciler
  participant CC as controller_common.SyncResource
  participant CMP as go-cmp
  participant API as K8s API

  Reconciler->>CC: SyncResource(old, desired)
  CC->>CC: Hash(old.spec) vs Hash(desired.spec)
  alt spec changed
    CC->>CMP: cmp.Diff(old.spec, desired.spec)
    CMP-->>CC: diff string
    CC->>CC: Log diff
    CC->>API: Update resource
  else no change
    CC-->>Reconciler: No-op
  end

sequenceDiagram
  participant Ctrl as DCD Controller
  participant B as basePodSpec
  participant T as Template Builder

  Ctrl->>B: Obtain basePodSpec (value)
  Ctrl->>T: Mutate local copy (containers, SA, flags)
  T-->>Ctrl: PodTemplateSpec with mutated spec
  note over B: Original basePodSpec remains unmodified

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: add grove multinode support #2269: Overlaps in graph generation and PodSpec pointer returns, canonicalization, and gang set fields.
feat: add initial Grove support #2012: Similar additions of CanonicalizePodSpec/PodGangSet and related controller_common and graph changes.
feat: kubernetes overrides for the entrypoint and cmd #1396: Touches the same generatePodTemplateSpec function, altering container command/args alongside this PR’s pointer/value handling.

Poem

A rabbit sorts the pods by name,
Aligns the ports, ensures the same.
Cliques in order, start any way—
Diffing specs before they sway.
With pointers poised and restarts true,
Hop, hop—canonical and new. 🐇✨

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (17)

deploy/cloud/operator/go.mod (2)

16-16: Test-only dependency addition looks fine; consider keeping it scoped to tests and tidying the module.

github.com/stretchr/testify v1.10.0 aligns with our test changes. Minor nit: ensure no production code imports testify and run go mod tidy to prune unused constraints.

61-61: Indirect pin likely pulled by test deps; optional to omit explicit listing.

github.com/pmezard/go-difflib is an indirect brought in by test tooling. It's okay to keep, but consider letting go mod tidy manage indirects to avoid stale pins.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
1220-1220: Avoid pointer aliasing: deep-copy the base PodSpec before mutation.

As written, podSpec := basePodSpec aliases the same PodSpec; subsequent mutations affect the original. If the intent is to isolate mutations per template, deep-copy instead.

Apply this diff:
-	podSpec := basePodSpec
+	podSpec := basePodSpec.DeepCopy()
deploy/cloud/operator/internal/dynamo/component_common.go (1)
57-61: Sensible defaults; consider promoting 60s to a const for visibility.

RestartPolicyAlways is correct for Deployments/LWS. Optional: extract TerminationGracePeriodSeconds into a named const (or config) for easier auditing.
+const defaultTerminationGraceSeconds int64 = 60
 ...
-    TerminationGracePeriodSeconds: ptr.To(int64(60)),
+    TerminationGracePeriodSeconds: ptr.To(defaultTerminationGraceSeconds),
deploy/cloud/operator/internal/controller_common/resource.go (2)
160-166: Consider redacting sensitive fields and lowering log verbosity for spec diffs.

Diffs may include user-specified env values or other sensitive content embedded in specs. Recommend:

Redact common secret-like keys (e.g., fields/keys containing password, token, secret).

Log at a higher verbosity (e.g., V(1)) or gate behind a feature flag.

Here’s a minimal pattern to redact values before logging:
- diff, diffErr := generateSpecDiff(oldResource, resource)
+ diff, diffErr := generateSpecDiff(oldResource, resource)
+ // Optional: cheap redaction pass
+ if diffErr == nil && diff != "" {
+   redactions := []string{"password", "secret", "token", "apikey"}
+   for _, r := range redactions {
+     diff = strings.ReplaceAll(diff, r, "***")
+   }
+ }
Note: you’ll need to import "strings" if you implement this quick pass. A structured redactor (field-aware) would be ideal longer term.

264-284: Good addition: pre-update spec diff improves observability.

The helper isolates spec extraction and uses cmp.Diff cleanly. Returning empty diff when equal is straightforward and keeps logs clean.

If you experience noisy diffs from non-semantic map ordering, consider cmpopts (e.g., cmpopts.SortSlices for certain paths) or diffing canonical JSON (post SortKeys) to align with your hashing representation. Not urgent given ongoing canonicalization work.
deploy/cloud/operator/internal/controller_common/pod.go (5)
14-22: Tie-break EnvFrom using Optional to avoid non-determinism.

Two EnvFrom entries with same kind/name/prefix but different Optional could shuffle because the key omits Optional.

Apply:
- envFromKey := func(e corev1.EnvFromSource) string {
-   if e.ConfigMapRef != nil {
-     return "cm:" + e.ConfigMapRef.Name + ":" + e.Prefix
-   }
-   if e.SecretRef != nil {
-     return "sec:" + e.SecretRef.Name + ":" + e.Prefix
-   }
-   return "other:" + e.Prefix
- }
+ envFromKey := func(e corev1.EnvFromSource) string {
+   if e.ConfigMapRef != nil {
+     opt := "0"
+     if e.ConfigMapRef.Optional != nil && *e.ConfigMapRef.Optional {
+       opt = "1"
+     }
+     return "cm:" + e.ConfigMapRef.Name + ":" + e.Prefix + ":" + opt
+   }
+   if e.SecretRef != nil {
+     opt := "0"
+     if e.SecretRef.Optional != nil && *e.SecretRef.Optional {
+       opt = "1"
+     }
+     return "sec:" + e.SecretRef.Name + ":" + e.Prefix + ":" + opt
+   }
+   return "other:" + e.Prefix
+ }
27-29: Add a deterministic tie-breaker for Env duplicates with same Name.

While duplicate Env names are unusual, relying only on Name with an unstable sort can still shuffle ordering. Tie-break by Value then ValueFrom presence for deterministic output.
- sort.Slice(env, func(i, j int) bool { return env[i].Name < env[j].Name })
+ sort.Slice(env, func(i, j int) bool {
+   if env[i].Name == env[j].Name {
+     if env[i].Value != env[j].Value {
+       return env[i].Value < env[j].Value
+     }
+     // nil ValueFrom sorts before non-nil to keep deterministic order
+     if (env[i].ValueFrom == nil) != (env[j].ValueFrom == nil) {
+       return env[i].ValueFrom == nil
+     }
+     return false
+   }
+   return env[i].Name < env[j].Name
+ })
40-46: Ports: add protocol tie-breaker for complete determinism.

If Name and ContainerPort match (rare, but possible in misconfigs), tie-break on Protocol to avoid unstable order.
- if ports[i].Name == ports[j].Name {
-   return ports[i].ContainerPort < ports[j].ContainerPort
- }
+ if ports[i].Name == ports[j].Name {
+   if ports[i].ContainerPort == ports[j].ContainerPort {
+     return ports[i].Protocol < ports[j].Protocol
+   }
+   return ports[i].ContainerPort < ports[j].ContainerPort
+ }
220-233: TopologySpreadConstraints: consider canonicalizing label selectors too.

You sort constraints themselves, but not the embedded LabelSelector.MatchExpressions/MatchLabels. Unordered selector internals can still trigger spurious diffs.

I suggest a follow-up that:

Sorts MatchLabels by key.

Sorts MatchExpressions by Key, then Operator, then Values (sorted).

Applies similar handling to pod affinity/anti-affinity LabelSelectors.

248-277: Broaden canonicalization to Affinity and ReadinessGates for full stability.

Remaining array-like fields that can drift:

Affinity: Preferred/Required NodeAffinity terms; Pod(Anti)Affinity terms; each’s LabelSelectors and MatchExpressions.

ReadinessGates: []PodReadinessGate.

Not blocking this PR, but adding these will further reduce non-semantic diffs.

If helpful, I can sketch a minimal, safe ordering for Affinity (NodeSelectorTerms and PodAffinityTerms) and selectors in a follow-up PR.
deploy/cloud/operator/internal/controller_common/podgangset.go (1)
9-19: Handle nils and sort CliqueNames within scaling groups for complete determinism.

gangSet could be nil.

Cliques slice holds pointers; guard against nil elements.

CliqueNames inside each PodCliqueScalingGroupConfig should also be sorted to avoid internal drift.

Apply:
 func CanonicalizePodGangSet(gangSet *grovev1alpha1.PodGangSet) *grovev1alpha1.PodGangSet {
-  // sort cliques by name
-  sort.Slice(gangSet.Spec.Template.Cliques, func(i, j int) bool {
-    return gangSet.Spec.Template.Cliques[i].Name < gangSet.Spec.Template.Cliques[j].Name
-  })
-  // sort scaling groups by name
-  sort.Slice(gangSet.Spec.Template.PodCliqueScalingGroupConfigs, func(i, j int) bool {
-    return gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i].Name < gangSet.Spec.Template.PodCliqueScalingGroupConfigs[j].Name
-  })
-  return gangSet
+  if gangSet == nil {
+    return gangSet
+  }
+  // sort clique names within each scaling group
+  for i := range gangSet.Spec.Template.PodCliqueScalingGroupConfigs {
+    sort.Strings(gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i].CliqueNames)
+  }
+  // sort cliques by name (nil-safe)
+  sort.Slice(gangSet.Spec.Template.Cliques, func(i, j int) bool {
+    ci, cj := gangSet.Spec.Template.Cliques[i], gangSet.Spec.Template.Cliques[j]
+    if ci == nil || cj == nil {
+      return cj != nil // non-nil after nil
+    }
+    return ci.Name < cj.Name
+  })
+  // sort scaling groups by name
+  sort.Slice(gangSet.Spec.Template.PodCliqueScalingGroupConfigs, func(i, j int) bool {
+    return gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i].Name < gangSet.Spec.Template.PodCliqueScalingGroupConfigs[j].Name
+  })
+  return gangSet
 }
deploy/cloud/operator/internal/dynamo/graph_test.go (1)

3004-3021: Now that canonicalization is in place, consider dropping manual sorting in tests.

You sort Cliques and Env variables in tests prior to cmp.Diff. If the generator calls CanonicalizePodGangSet and CanonicalizePodSpec, these extra sorts may be redundant.

You could rely on the canonical forms and remove:

The sort.Slice(...) on got/tt.want cliques.

The per-container Env sorting loops.

This will make tests stricter and surface regressions in canonicalization.

deploy/cloud/operator/internal/controller_common/pod_test.go (2)

90-118: Add a case for EnvFrom entries that differ only by Optional.

To lock-in deterministic behavior, add a test where kind/name/prefix match but Optional differs. This pairs with the suggested tie-breaker in CanonicalizePodSpec.

I can add the unit test covering Optional-based tie-break if you’d like.

482-521: Consider adding tests for Affinity and selector canonicalization.

Remaining sources of drift:

Affinity (NodeAffinity/Pod(Anti)Affinity) preferred/required terms.

LabelSelector MatchLabels/MatchExpressions ordering.

Adding tests now will guard a future enhancement to CanonicalizePodSpec.
deploy/cloud/operator/internal/dynamo/graph.go (2)
941-945: Make CliqueNames ordering stable to avoid spurious diffs

CliqueNames inherits loop order; it may already be stable, but canonicalization doesn’t currently sort this inner slice. To guarantee stability across reconciliations, sort it deterministically.

Two options:

Sort here before constructing the scaling group (minimal change).

Or extend CanonicalizePodGangSet to sort CliqueNames for each scaling group (preferred, centralizes ordering).

If you prefer local sorting here, minimal inline change:
-        if isMultinode {
-            scalingGroups = append(scalingGroups, grovev1alpha1.PodCliqueScalingGroupConfig{
+        if isMultinode {
+            // Keep clique names order stable
+            sort.Strings(cliqueNames)
+            scalingGroups = append(scalingGroups, grovev1alpha1.PodCliqueScalingGroupConfig{
                 Name:         strings.ToLower(serviceName),
                 CliqueNames:  cliqueNames,
                 Replicas:     component.Replicas,
                 MinAvailable: ptr.To(int32(1)),
             })
         }
Alternatively, update canonicalization (see next comment).

952-953: Returning CanonicalizePodGangSet is the right move; consider extending it to sort nested arrays

This ensures the top-level arrays (Cliques and PodCliqueScalingGroupConfigs) are stable. To fully eliminate ordering noise, also sort:

Each scaling group’s CliqueNames

Each clique’s Spec.StartsAfter (if/when used)

Here’s a small enhancement for deploy/cloud/operator/internal/controller_common/podgangset.go:
func CanonicalizePodGangSet(gangSet *grovev1alpha1.PodGangSet) *grovev1alpha1.PodGangSet {
    // Existing sorts...
    sort.Slice(gangSet.Spec.Template.Cliques, func(i, j int) bool {
        return gangSet.Spec.Template.Cliques[i].Name < gangSet.Spec.Template.Cliques[j].Name
    })
    sort.Slice(gangSet.Spec.Template.PodCliqueScalingGroupConfigs, func(i, j int) bool {
        return gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i].Name < gangSet.Spec.Template.PodCliqueScalingGroupConfigs[j].Name
    })

    // New: sort nested arrays to maximize stability
    for i := range gangSet.Spec.Template.PodCliqueScalingGroupConfigs {
        cfg := &gangSet.Spec.Template.PodCliqueScalingGroupConfigs[i]
        if len(cfg.CliqueNames) > 1 {
            sort.Strings(cfg.CliqueNames)
        }
    }
    for i := range gangSet.Spec.Template.Cliques {
        cl := gangSet.Spec.Template.Cliques[i]
        if len(cl.Spec.StartsAfter) > 1 {
            sort.Strings(cl.Spec.StartsAfter)
        }
    }
    return gangSet
}
I can open a follow-up PR to add this canonicalization if helpful.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0a1168c and 3d9e979.

📒 Files selected for processing (11)

deploy/cloud/operator/go.mod (2 hunks)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1 hunks)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (2 hunks)
deploy/cloud/operator/internal/controller_common/pod.go (1 hunks)
deploy/cloud/operator/internal/controller_common/pod_test.go (1 hunks)
deploy/cloud/operator/internal/controller_common/podgangset.go (1 hunks)
deploy/cloud/operator/internal/controller_common/resource.go (3 hunks)
deploy/cloud/operator/internal/dynamo/component_common.go (2 hunks)
deploy/cloud/operator/internal/dynamo/component_planner.go (1 hunks)
deploy/cloud/operator/internal/dynamo/graph.go (12 hunks)
deploy/cloud/operator/internal/dynamo/graph_test.go (21 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.

🧬 Code Graph Analysis (3)

deploy/cloud/operator/internal/dynamo/component_planner.go (1)

deploy/cloud/operator/internal/consts/consts.go (1)

PlannerServiceAccountName (44-44)

deploy/cloud/operator/internal/controller_common/pod_test.go (1)

deploy/cloud/operator/internal/controller_common/pod.go (1)

CanonicalizePodSpec (12-277)

deploy/cloud/operator/internal/dynamo/graph.go (4)

deploy/cloud/operator/internal/dynamo/component_common.go (1)

ComponentDefaultsFactory (26-37)

deploy/cloud/operator/api/dynamo/common/common.go (1)

ExtraPodSpec (58-61)

deploy/cloud/operator/internal/controller_common/pod.go (1)

CanonicalizePodSpec (12-277)

deploy/cloud/operator/internal/controller_common/podgangset.go (1)

CanonicalizePodGangSet (9-19)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Mirror Repository to GitLab
GitHub Check: Build and Test - dynamo

🔇 Additional comments (16)

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)

1220-1220: Summary mismatch: this line still aliases a pointer, not a value copy.

AI summary notes a value copy; the code retains pointer semantics. If pointer aliasing is intentional, ignore; otherwise, adopt a deep copy as suggested above.

Would you confirm whether GenerateBasePodSpecForController returns a pointer and whether we want to mutate it in place? If not, the DeepCopy change will align behavior with the stated intent.

deploy/cloud/operator/internal/dynamo/component_common.go (2)

11-11: Good call importing ptr for pointer-safe defaults.

Using k8s.io/utils/ptr improves clarity and avoids the &[]T{...}[0] pattern.

54-55: Centralizing PodSpec defaults improves consistency.

Returning getCommonPodSpec() here reduces duplication and keeps defaulting uniform across components.

deploy/cloud/operator/internal/dynamo/component_planner.go (1)

42-44: Planner pod defaults now inherit common behavior—LGTM.

Leveraging getCommonPodSpec() and then setting the Planner-specific service account keeps defaults consistent while preserving Planner semantics.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (2)

799-800: Expectation updated for RestartPolicyAlways—correct per new defaults.

Matches the new common PodSpec default. Good coverage.

899-900: Worker template RestartPolicyAlways expectation—LGTM.

Consistent with leader template and shared defaults.

deploy/cloud/operator/internal/controller_common/resource.go (1)

30-30: No changes needed: go-cmp is already listed in go.mod

The operator module’s go.mod (line 11) already includes
github.com/google/go-cmp v0.7.0. CI/build failures related to this import should not occur.

deploy/cloud/operator/internal/controller_common/pod.go (1)

12-67: Solid, in-place canonicalization with clear helpers. LGTM.

Deterministic ordering across container-like fields, volumes, tolerations, topology constraints, host aliases, and DNS config is well covered. The design keeps mutations local and returns the same pointer.

deploy/cloud/operator/internal/dynamo/graph_test.go (3)

1208-1208: Tests correctly reflect StartupType AnyOrder semantics. LGTM.

The shift to CliqueStartupTypeAnyOrder is consistent with the new canonicalization and startup behavior. Good coverage across scenarios.

Also applies to: 1704-1704, 2462-2462

1228-1231: MinAvailable expectations are sensible and improve resiliency.

Adding MinAvailable = 1 across cliques (and PCSG configs) is a pragmatic default for test fixtures and aligns with HA expectations.

Also applies to: 1360-1362, 1871-1872, 1979-1981, 2111-2112, 2741-2743, 2871-2874

1249-1249: RestartPolicyAlways and explicit grace periods: tests aligned with lifecycle tightening.

Asserting RestartPolicyAlways and TerminationGracePeriodSeconds solidifies expected lifecycle semantics for generated pod specs.

Also applies to: 1383-1384, 1999-1999, 2114-2115, 2510-2511, 2648-2648, 2875-2875

deploy/cloud/operator/internal/controller_common/pod_test.go (1)

10-681: Excellent breadth: canonicalization behavior is thoroughly exercised.

Covers containers (incl. ephemeral), Env/EnvFrom, ports, mounts, security caps, imagePullSecrets, volumes (and nested projections), tolerations, topology spreads, host aliases, DNS config, nil handling, and idempotence. This is the right level for stability guarantees.

deploy/cloud/operator/internal/dynamo/graph.go (4)

34-34: Appropriate use of ptr helper for pointer fields

Using k8s.io/utils/ptr for StartupType and MinAvailable is the right call and keeps literals clean.

875-875: Defaulting StartupType to AnyOrder is fine; confirm desired behavior while dependencies are deactivated

Given applyCliqueStartupDependencies is currently deactivated, StartupType will remain AnyOrder. If you plan to re-enable dependencies, note they’ll set Explicit (overriding this default). Confirm this is the intended interim behavior.

850-859: Pointer return propagation looks correct

GeneratePodSpecForComponent now returns (*corev1.PodSpec, error) and cleanly propagates the pointer/nil error contract from GenerateBasePodSpec. LGTM.

691-814: Downstream nil-handling verified; no further changes needed

I’ve swept all call sites of GenerateBasePodSpec, GeneratePodSpecForComponent, and GenerateBasePodSpecForController. Every caller checks err != nil before dereferencing the returned *corev1.PodSpec, and all tests validate both success and error paths. There are no unsafe usages or missing nil checks.

deploy/cloud/operator/internal/dynamo/graph.go

itay · 2025-08-18T15:23:05Z

Should we ask for an improvement in Grove to make it not as picky? We can still take this PR of course, but seems like possibly a good improvement for Grove itself.

julienmancuso · 2025-08-18T15:26:40Z

Should we ask for an improvement in Grove to make it not as picky? We can still take this PR of course, but seems like possibly a good improvement for Grove itself.

yes @itay , I also opened an issue on the grove side

mohammedabdulwahhab

Just questions, but otherwise the change looks good.

deploy/cloud/operator/internal/controller_common/pod.go

deploy/cloud/operator/internal/controller_common/resource.go

Signed-off-by: Hannah Zhang <[email protected]>

julienmancuso added 7 commits August 15, 2025 15:49

fix: generate predictable PodSpec

8b81baf

fix: generate predictable PodSpec

1fedf80

fix: generate predictable PodSpec

feef55a

fix: generate predictable PodSpec

8e4c9a8

fix: generate predictable PodSpec

4ca0e0f

fix: generate predictable PodSpec

102f10b

fix: generate predictable PodSpec

3cbb106

julienmancuso requested review from biswapanda, hhzhang16, hutm, ishandhanani, mohammedabdulwahhab and nnshah1 as code owners August 16, 2025 00:04

pull-request-size bot added the size/XXL label Aug 16, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 16, 2025 00:04 Inactive

fix: generate predictable PodSpec

3d9e979

copy-pr-bot bot temporarily deployed to GITLAB August 16, 2025 00:06 Inactive

julienmancuso changed the title ~~Canonicalize PodGangSet~~ Canonicalize PodGangSet to prevent spurious updates Aug 16, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 16, 2025 00:09 Inactive

coderabbitai bot reviewed Aug 16, 2025

View reviewed changes

deploy/cloud/operator/internal/dynamo/graph.go Show resolved Hide resolved

julienmancuso changed the title ~~Canonicalize PodGangSet to prevent spurious updates~~ Fix: Canonicalize PodGangSet to prevent spurious updates Aug 16, 2025

mohammedabdulwahhab reviewed Aug 18, 2025

View reviewed changes

deploy/cloud/operator/internal/controller_common/pod.go Show resolved Hide resolved

deploy/cloud/operator/internal/controller_common/resource.go Show resolved Hide resolved

mohammedabdulwahhab approved these changes Aug 18, 2025

View reviewed changes

julienmancuso merged commit 94876e5 into main Aug 18, 2025
8 of 15 checks passed

julienmancuso deleted the jsm/fixScaling branch August 18, 2025 20:27

coderabbitai bot mentioned this pull request Aug 22, 2025

fix: fix env vars override #2640

Merged

coderabbitai bot mentioned this pull request Aug 27, 2025

feat: Auto-inject kai-scheduler annotations and label #2748

Merged

hhzhang16 pushed a commit that referenced this pull request Aug 27, 2025

Fix: Canonicalize PodGangSet to prevent spurious updates (#2469)

5d32261

Signed-off-by: Hannah Zhang <[email protected]>

This was referenced Dec 10, 2025

[BUG]: PodTemplateSpec canonicalization issue #4852

Closed

fix: remove pod normalization logic #4853

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Canonicalize PodGangSet to prevent spurious updates #2469

Fix: Canonicalize PodGangSet to prevent spurious updates #2469

Uh oh!

julienmancuso commented Aug 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 16, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

itay commented Aug 18, 2025

Uh oh!

julienmancuso commented Aug 18, 2025

Uh oh!

mohammedabdulwahhab left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix: Canonicalize PodGangSet to prevent spurious updates #2469

Fix: Canonicalize PodGangSet to prevent spurious updates #2469

Uh oh!

Conversation

julienmancuso commented Aug 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

itay commented Aug 18, 2025

Uh oh!

julienmancuso commented Aug 18, 2025

Uh oh!

mohammedabdulwahhab left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

julienmancuso commented Aug 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 16, 2025 •

edited

Loading