-
Notifications
You must be signed in to change notification settings - Fork 772
fix: revisit grove and LWS selection #2564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughRemoves LWS flag/config from Helm and deploy script, bumps chart versions to 0.5.0, introduces runtime LWS API discovery and nested LWS config, updates controllers to use autodetected LWS and new Grove selection logic, adds multinode service detection, and adjusts operator templates/RBAC to remove LWS gating. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Operator as Operator (main)
participant K8s as Kubernetes API
Note over Operator: Startup
Operator->>K8s: DetectGroveAvailability()
K8s-->>Operator: grove.io group present?
Operator->>K8s: DetectLWSAvailability()
K8s-->>Operator: leaderworkerset.x-k8s.io group present?
Note right of Operator: Set Config.GroveEnabled, Config.LWS.Enabled
sequenceDiagram
autonumber
participant Reconciler as Graph Reconciler
participant Obj as DynamoGraphDeployment
participant Grove as Grove Reconcile
participant Dyn as Component Reconcile
Note over Reconciler,Obj: On reconcile
Reconciler->>Obj: Read annotation nvidia.com/enable-grove
Reconciler->>Obj: hasMultinode = HasAnyMultinodeService()
alt enableGrove && config.GroveEnabled
Reconciler->>Grove: Reconcile Grove resources
else hasMultinode && !config.LWS.Enabled
Reconciler-->>Obj: Set FailedState: NoOrchestratorAvailable
else
Reconciler->>Dyn: Reconcile Dynamo components
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.2.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (6)
38-42: Duplicate imports of the same package cause a compile errorThe file imports internal/consts twice (once as consts and once as commonconsts), and internal/controller_common twice (as controller_common and commonController). Go does not allow duplicate import paths.
Apply this minimal diff to keep a single alias for each package and adjust call sites accordingly:
@@ - "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts" - commonconsts "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts" - "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/controller_common" - commonController "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/controller_common" + commonconsts "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/consts" + commonController "github.com/ai-dynamo/dynamo/deploy/cloud/operator/internal/controller_common"And update the two usages:
@@ - withEventFilter(controller_common.EphemeralDeploymentEventFilter(r.Config)) + withEventFilter(commonController.EphemeralDeploymentEventFilter(r.Config)) @@ - basePodSpec, err := dynamo.GenerateBasePodSpecForController(opt.dynamoComponentDeployment, r.DockerSecretRetriever, r.Config, role, consts.MultinodeDeploymentTypeLWS) + basePodSpec, err := dynamo.GenerateBasePodSpecForController(opt.dynamoComponentDeployment, r.DockerSecretRetriever, r.Config, role, commonconsts.MultinodeDeploymentTypeLWS)
210-219: Loop over replicas uses range on int and takes the address of the loop variableTwo issues:
- for i := range int(desiredReplicas) requires Go 1.22; it’s safer to use a classic for loop for broader toolchain compatibility.
- Passing &i to closures captures the same variable across iterations; all closures see the last value.
Apply this diff:
- for i := range int(desiredReplicas) { + for i := 0; i < int(desiredReplicas); i++ { + idx := i @@ - return r.generateVolcanoPodGroup(ctx, generateResourceOption{ + return r.generateVolcanoPodGroup(ctx, generateResourceOption{ dynamoComponentDeployment: dynamoComponentDeployment, isStealingTrafficDebugModeEnabled: false, containsStealingTrafficDebugModeEnabled: false, - instanceID: &i, + instanceID: &idx, })
229-236: Same loop-variable address issue for LeaderWorkerSet generationMirror the fix above for the second SyncResource call.
- return r.generateLeaderWorkerSet(ctx, generateResourceOption{ + return r.generateLeaderWorkerSet(ctx, generateResourceOption{ dynamoComponentDeployment: dynamoComponentDeployment, isStealingTrafficDebugModeEnabled: false, containsStealingTrafficDebugModeEnabled: false, - instanceID: &i, + instanceID: &idx, })
357-361: Readiness computation picks the LWS path even when LWS wasn’t usedWhen LWS is disabled but the CR is multinode, you create a Deployment, but this block still calls computeAvailableStatusConditionForLeaderWorkerSets (leaderWorkerSets is empty), likely setting Available=true incorrectly.
Apply this diff:
- if dynamoComponentDeployment.IsMultinode() { + if r.Config.LWS.Enabled && dynamoComponentDeployment.IsMultinode() { err = r.computeAvailableStatusConditionForLeaderWorkerSets(ctx, req, leaderWorkerSets) } else { err = r.computeAvailableStatusCondition(ctx, req, deployment) }Optionally also guard computeAvailableStatusConditionForLeaderWorkerSets against an empty slice to avoid false positives.
1291-1295: Service selector should target leaders only when LWS is actually enabledWith LWS disabled, adding selector["role"]="leader" makes the Service select nothing.
- if opt.dynamoComponentDeployment.IsMultinode() { + if r.Config.LWS.Enabled && opt.dynamoComponentDeployment.IsMultinode() { selector["role"] = "leader" }
722-739: Retry loop uses range over int; also off-by-one vs maxRetriesfor range maxRetries - 1 loops fewer times than intended and ties you to Go 1.22. Use a classic counter loop.
- for range maxRetries - 1 { + for i := 0; i < maxRetries; i++ { @@ - break + break } else { - break + break } }
🧹 Nitpick comments (8)
deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go (1)
114-122: Safe multinode detection helper — clear and nil-safe.Iteration over a possibly nil map and guarding nil entries is correct. Name and semantics (>1) are unambiguous.
Consider adding a brief unit test matrix (nil map, empty map, one node, multiple nodes, mixed nil entries) to lock behavior.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)
168-175: Annotation parsing is overly strict; accept standard boolean forms.Currently only the literal "false" disables Grove. Consider parsing common boolean strings (true/false/1/0/yes/no/on/off) to reduce operator surprises.
Apply within this hunk:
- enableGrove := true - if dynamoDeployment.Annotations != nil && strings.ToLower(dynamoDeployment.Annotations[consts.KubeAnnotationEnableGrove]) == consts.KubeLabelValueFalse { - enableGrove = false - } + enableGrove := true + if dynamoDeployment.Annotations != nil { + raw := strings.TrimSpace(strings.ToLower(dynamoDeployment.Annotations[consts.KubeAnnotationEnableGrove])) + switch raw { + case "false", "0", "no", "off": + enableGrove = false + case "true", "1", "yes", "on", "": + // keep default true + default: + // keep default true to be backward-compatible; optionally log at V(1) + } + }
179-181: Optional: add explicit selection logs.A concise info log stating which orchestrator path was chosen (Grove vs components, LWS on/off, multinode presence) will simplify ops triage.
- if enableGrove && r.Config.Grove.Enabled { + if enableGrove && r.Config.Grove.Enabled { + log.FromContext(ctx).Info("Selecting Grove orchestration", "enableGrove", enableGrove, "groveEnabled", r.Config.Grove.Enabled) return r.reconcileGroveResources(ctx, dynamoDeployment) } - if hasMultinode && !r.Config.LWS.Enabled { + if hasMultinode && !r.Config.LWS.Enabled { + log.FromContext(ctx).Info("No multinode orchestrator available", "enableGrove", enableGrove, "groveEnabled", r.Config.Grove.Enabled, "lwsEnabled", r.Config.LWS.Enabled, "hasMultinode", hasMultinode) ... } + log.FromContext(ctx).Info("Selecting component orchestration", "enableGrove", enableGrove, "groveEnabled", r.Config.Grove.Enabled, "lwsEnabled", r.Config.LWS.Enabled, "hasMultinode", hasMultinode)deploy/cloud/operator/internal/controller_common/predicate.go (2)
78-92: Bound discovery latency and improve error message contextTwo small improvements:
- Use a short per-call timeout so startup isn’t stalled by a slow/partitioned apiserver.
- The “no discovery client available” message is actually “no rest.Config available”.
Apply this diff:
@@ -import ( +import ( "context" "strings" "time" "k8s.io/apimachinery/pkg/api/meta" "k8s.io/client-go/discovery" + "k8s.io/client-go/rest" @@ - cfg := mgr.GetConfig() + cfg := mgr.GetConfig() if cfg == nil { - logger.Info("detection failed, no discovery client available", "group", groupName) + logger.Info("detection failed, no REST config available", "group", groupName) return false } - discoveryClient, err := discovery.NewDiscoveryClientForConfig(cfg) + // Avoid long hangs during startup if the apiserver is slow/unreachable + dcfg := rest.CopyConfig(cfg) + dcfg.Timeout = 3 * time.Second + discoveryClient, err := discovery.NewDiscoveryClientForConfig(dcfg) if err != nil { logger.Error(err, "detection failed, could not create discovery client", "group", groupName) return false }
100-109: Nit: consider lowering log level for “not available” to avoid noisy INFO in normal clustersClusters without Grove/LWS are a normal configuration; using V-level (or Debug) would reduce noise. No change required if your logging policy prefers INFO.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2)
202-207: Gate multinode LWS path correctly; add user feedback when LWS is required but unavailableCondition r.Config.LWS.Enabled && IsMultinode() is correct for the workload path; however, users requesting multinode on a cluster without LWS will silently fall back to a single Deployment. Emit a warning Event and/or status Condition in that case so the misconfiguration is obvious.
Example (inside the else branch when IsMultinode() && !LWS.Enabled):
@@ -} else { +} else { + if dynamoComponentDeployment.IsMultinode() && !r.Config.LWS.Enabled { + logs.Info("Multinode requested but LWS is not available; falling back to single Deployment") + r.Recorder.Event(dynamoComponentDeployment, corev1.EventTypeWarning, "LWSNotAvailable", "Multinode requested but LWS API not detected; falling back to single Deployment") + }
249-291: Cleanup loop can stop early if there’s a gap; consider listing and filtering instead of probing by indexThe current “probe-until-miss” pattern breaks if some indices are missing, leaving residual LWS/PodGroups. Prefer listing by label selector and deleting those with instance-id >= desiredReplicas.
If you want, I can draft a follow-up patch using a List with labelSelector instance-id and parsing integers.
deploy/cloud/operator/cmd/main.go (1)
243-250: Log the detection outcome to aid debuggingYou already log the start of detection; also log the boolean result for both Grove and LWS so operators can confirm what the controller decided at startup.
setupLog.Info("Detecting Grove availability...") groveEnabled := commonController.DetectGroveAvailability(mainCtx, mgr) ctrlConfig.Grove.Enabled = groveEnabled + setupLog.Info("Grove availability detected", "enabled", groveEnabled) setupLog.Info("Detecting LWS availability...") lwsEnabled := commonController.DetectLWSAvailability(mainCtx, mgr) ctrlConfig.LWS.Enabled = lwsEnabled + setupLog.Info("LWS availability detected", "enabled", lwsEnabled)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (13)
deploy/cloud/helm/deploy.sh(1 hunks)deploy/cloud/helm/dynamo-platform-values.yaml(0 hunks)deploy/cloud/helm/platform/Chart.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/Chart.yaml(1 hunks)deploy/cloud/helm/platform/components/operator/templates/deployment.yaml(0 hunks)deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml(0 hunks)deploy/cloud/helm/platform/components/operator/values.yaml(0 hunks)deploy/cloud/helm/platform/values.yaml(0 hunks)deploy/cloud/operator/api/v1alpha1/dynamographdeployment_types.go(1 hunks)deploy/cloud/operator/cmd/main.go(3 hunks)deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go(2 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(1 hunks)deploy/cloud/operator/internal/controller_common/predicate.go(2 hunks)
💤 Files with no reviewable changes (5)
- deploy/cloud/helm/platform/components/operator/templates/deployment.yaml
- deploy/cloud/helm/dynamo-platform-values.yaml
- deploy/cloud/helm/platform/values.yaml
- deploy/cloud/helm/platform/components/operator/values.yaml
- deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
🧬 Code Graph Analysis (3)
deploy/cloud/operator/cmd/main.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (3)
LWSConfig(40-43)DetectGroveAvailability(68-70)DetectLWSAvailability(74-76)
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (2)
deploy/cloud/operator/internal/consts/consts.go (2)
KubeAnnotationEnableGrove(24-24)KubeLabelValueFalse(31-31)deploy/cloud/operator/internal/controller_common/predicate.go (1)
Config(45-53)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (1)
Config(45-53)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (12)
deploy/cloud/helm/platform/components/operator/Chart.yaml (1)
30-35: Version bump to 0.5.0 looks good and consistent.Version/appVersion changes are straightforward and align with the broader PR intent to ship LWS auto-discovery and updated Grove selection.
If you haven’t already, please verify the operator image tagged 0.5.0 is available and referenced in the platform chart and deploy script.
deploy/cloud/helm/platform/Chart.yaml (1)
22-26: Platform chart dependency bump verified; lockfile missingI’ve confirmed that in
deploy/cloud/helm/platform/Chart.yaml, thedynamo-operatordependency is pinned to version 0.5.0deploy/cloud/helm/platform/components/operator/Chart.yamlhasversion: 0.5.0However, there’s no
Chart.lockindeploy/cloud/helm/platformto lock in that dependency version. Please run the following locally and commit the resultingChart.lockto ensure your Helm dependencies are reproducibly locked:cd deploy/cloud/helm/platform helm dep update git add Chart.lockdeploy/cloud/helm/deploy.sh (1)
169-169: No stray ENABLE_LWS placeholders found – removal is safeRan a project-wide search for any occurrences of “enableLWS”, “ENABLE_LWS”, or “--enable-lws” in Helm charts, templates, operator args, and the generated values file. No matches were found, confirming there are no leftover placeholders after removing ENABLE_LWS from the envsubst list.
deploy/cloud/operator/internal/controller_common/predicate.go (3)
40-44: Good abstraction for LWS feature flaggingIntroducing LWSConfig mirrors GroveConfig and will keep related knobs scoped and future-proof.
66-71: Grove detection now correctly reuses the common helperDelegating to detectAPIGroupAvailability reduces duplication and keeps logs consistent.
72-77: LWS discovery helper is clear and aligned with Grove detectionUsing the API group leaderworkerset.x-k8s.io is the right contract for presence checks.
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2)
1356-1371: Gating watches on LWS availability looks goodWatches for LeaderWorkerSet and PodGroup are only registered when LWS is detected; this matches the runtime enablement strategy.
1164-1170: Verify the multinode type passed to GenerateBasePodSpecForController for non-LWS pathsYou always pass MultinodeDeploymentTypeLWS, including when falling back to a plain Deployment. If GenerateBasePodSpecForController uses this to configure scheduling or init, passing the LWS value in non-LWS mode could be wrong.
Do we have a distinct const for “standard/kubernetes” multinode type? If yes, we should branch on r.Config.LWS.Enabled here. I can patch this once you confirm the intended values.
deploy/cloud/operator/cmd/main.go (4)
158-158: Removing the enable-lws flag is consistent with runtime discoveryThis simplifies configuration and avoids drift between flags and cluster reality.
167-176: Nested LWS config initialization aligns with the new Config contractSetting LWS.Enabled to false and updating after discovery is clear and maintainable.
100-115: Keep LWS scheme registration regardless of availabilityRegistering schemes for optional CRDs at process start is safe and avoids nil GVK issues when LWS is present later. Looks good.
1-392: Sanity Checks Passed: No LWS Flag Remnants & Go 1.24.0 Verified
rg -nP --glob '!**/vendor/**' '(EnableLWS|enable-lws|\.EnableLWS\b)'returned no matches, confirming all LWS-related flags and config references have been removed.deploy/cloud/operator/go.moddeclaresgo 1.24.0, which supports the “range on int” syntax introduced in Go 1.22.No further action required.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Show resolved
Hide resolved
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Show resolved
Hide resolved
mohammedabdulwahhab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment but otherwise looks good.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Show resolved
Hide resolved
Signed-off-by: Julien Mancuso <[email protected]>
ee82917 to
9020653
Compare
Signed-off-by: Julien Mancuso <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]> Signed-off-by: Michael Shin <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]> Signed-off-by: Krishnan Prashanth <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]> Signed-off-by: nnshah1 <[email protected]>
Overview:
revisit grove and LWS selection
Summary by CodeRabbit
New Features
Refactor
Chores