-
Notifications
You must be signed in to change notification settings - Fork 764
feat: Auto-inject kai-scheduler annotations and label #2748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds Kai-scheduler support: RBAC rules/bindings for scheduling.run.ai queues; detection of Kai-scheduler API at startup; constants; queue resolution/validation; and injection of schedulerName/queue label into cliques when enabled. Updates docs for orchestrator selection and multinode examples. Includes unit tests for queue resolution, validation, and injection. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Operator as Operator main
participant Manager as Controller Manager
participant API as K8s API Server
participant Disc as Discovery Client
Operator->>Manager: Start()
Operator->>Disc: Init with mgr.GetConfig()
Disc->>API: List ServerGroups
API-->>Disc: API groups
Disc-->>Operator: Contains scheduling.run.ai?
Operator->>Operator: ctrlConfig.KaiScheduler.Enabled = true/false
note over Operator: Grove detection remains unchanged
sequenceDiagram
autonumber
participant Reconciler as DynamoGraphDeployment Reconciler
participant Dyn as Dynamo graph builder
participant K8s as K8s (dynamic client)
participant Clique as PodCliqueTemplateSpec
Reconciler->>Dyn: Build graph (cfg: Grove/KaiScheduler flags)
alt Grove && KaiScheduler enabled
Dyn->>Dyn: Resolve queue name (annotation or default)
Dyn->>K8s: Verify Queue exists (scheduling.run.ai/v2)
K8s-->>Dyn: Exists / Error
alt Queue exists
Dyn->>Clique: Inject schedulerName="kai-scheduler"\n+ label kai.scheduler/queue=<name>
else Error
Dyn-->>Reconciler: Return error
end
else Not enabled
Dyn->>Clique: No Kai-scheduler injection
end
Reconciler-->>Reconciler: Proceed with remaining flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.2.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (12)
docs/guides/dynamo_deploy/multinode-deployment.md (2)
72-72: Remove trailing colons from headings (MD026)Minor markdownlint cleanup.
-#### When Both Grove and LWS are Available: +#### When Both Grove and LWS are Available -#### When Only One Orchestrator is Available: +#### When Only One Orchestrator is Available -#### Scheduler Integration: +#### Scheduler Integration -#### Configuration Examples: +#### Configuration ExamplesAlso applies to: 76-76, 79-79, 86-86
53-55: Clarify KAI prerequisite about Queue creationExplicitly state the operator reads queues; users must ensure the Queue exists.
-- [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster with default queue name `dynamo` created. You can use a different queue name by setting the `nvidia.com/kai-scheduler-queue` annotation on the DGD resource. +- [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster. Ensure a `Queue` resource exists (defaults to `dynamo`, or set a different one via the `nvidia.com/kai-scheduler-queue` annotation on the DGD).deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2)
141-147: Gate queue rule when Role-scoped to avoid ineffective permissionsWhen namespaceRestriction.enabled=true, this block becomes a Role and won’t grant access to cluster-scoped Queues. Suggest adding a conditional so it’s only included for ClusterRole.
- - apiGroups: - - scheduling.run.ai - resources: - - queues - verbs: - - get - - list + {{- if not .Values.namespaceRestriction.enabled }} + - apiGroups: + - scheduling.run.ai + resources: + - queues + verbs: + - get + - list + {{- end }}
491-525: Make queue-reader ClusterRole/Binding configurableThese objects are always created; consider gating behind a values flag to avoid extra RBAC on clusters not using KAI.
---- -# ClusterRole for kai-scheduler queue access +{{- if .Values.kaiScheduler.enabled }} +--- +# ClusterRole for kai-scheduler queue access @@ ---- -# ClusterRoleBinding for kai-scheduler queue access +--- +# ClusterRoleBinding for kai-scheduler queue access @@ - name: {{ include "dynamo-operator.fullname" . }}-queue-reader + name: {{ include "dynamo-operator.fullname" . }}-queue-reader @@ - namespace: '{{ .Release.Namespace }}' + namespace: '{{ .Release.Namespace }}' +{{- end }}If you prefer auto-detection at runtime, keeping RBAC unconditional is acceptable; just confirm this aligns with your threat model.
Also applies to: 511-525
deploy/cloud/operator/internal/dynamo/graph.go (1)
885-895: Queue pre-validation is good; consider graceful fallback pathCurrent behavior fails Graph generation if the queue is absent/misconfigured. If you want “auto-inject when possible” semantics, consider downgrading to a warning and skipping injection rather than returning an error here.
deploy/cloud/operator/cmd/main.go (1)
251-255: Log the detection result to aid troubleshootingAfter detection, log the boolean result (similar to Grove) so operators can see if Kai-scheduler was enabled at runtime.
Example:
setupLog.Info("Kai-scheduler detection completed", "enabled", kaiSchedulerEnabled)deploy/cloud/operator/internal/dynamo/grove.go (3)
58-80: Hard-coded GVR and scope—make version/scope discovery-driven
queues.scheduling.run.aimay vary by version or scope. Prefer resolving via RESTMapper/discovery and support both v1/v2; confirm whether Queue is cluster-scoped (no Namespace()) in your target clusters.
85-105: Avoid per-call dynamic client creation; improve testabilityConstructing a dynamic client inside this function hampers unit testing and adds overhead. Accept a
dynamic.Interface(or a factory) as a parameter, defaulting to in-cluster config when nil.
114-140: Guard against empty queue labels and confirm label propagationEven though upstream validation should set a non-empty queue, add a defensive check to skip labeling on empty values, and verify labels reach Pods (else Kai won’t see the queue).
Possible tweak:
- queueName := validatedQueueName + queueName := validatedQueueName + if queueName == "" { + return + }deploy/cloud/operator/internal/controller_common/predicate.go (1)
104-140: Deduplicate discovery logic and optionally verify served versionsThis largely duplicates
DetectGroveAvailability. Consider a genericDetectAPIGroupAvailability(ctx, mgr, group string)helper, and optionally verify served versions/resources (e.g., thatqueuesexists).deploy/cloud/operator/internal/dynamo/grove_test.go (2)
294-309: Remove misleading comment about mocking the dynamic client
ensureQueueExistsalready accepts adynamic.Interface; the fake client here is sufficient, so the “can’t properly mock” comment is confusing.Apply:
- // This test is limited because we can't easily mock the dynamic client - // In a real test environment, you would set up a proper test cluster or use envtest + // Using a fake dynamic client to simulate presence/absence of the Queue resource
315-357: Improve testability of DetermineKaiSchedulerQueueBecause it creates its own dynamic client, this test must expect failure. Refactor the function to accept a client/factory so you can inject
dynamicfakehere and assert success paths too.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (10)
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml(2 hunks)deploy/cloud/operator/cmd/main.go(2 hunks)deploy/cloud/operator/config/rbac/role.yaml(1 hunks)deploy/cloud/operator/internal/consts/consts.go(1 hunks)deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go(1 hunks)deploy/cloud/operator/internal/controller_common/predicate.go(2 hunks)deploy/cloud/operator/internal/dynamo/graph.go(2 hunks)deploy/cloud/operator/internal/dynamo/grove.go(2 hunks)deploy/cloud/operator/internal/dynamo/grove_test.go(1 hunks)docs/guides/dynamo_deploy/multinode-deployment.md(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
🧬 Code graph analysis (4)
deploy/cloud/operator/internal/dynamo/graph.go (1)
deploy/cloud/operator/internal/dynamo/grove.go (1)
DetermineKaiSchedulerQueue(84-105)
deploy/cloud/operator/internal/dynamo/grove.go (2)
deploy/cloud/operator/internal/consts/consts.go (4)
DefaultKaiSchedulerQueue(62-62)KubeAnnotationKaiSchedulerQueue(59-59)KaiSchedulerName(61-61)KubeLabelKaiSchedulerQueue(60-60)deploy/cloud/operator/internal/controller_common/predicate.go (1)
Config(45-54)
deploy/cloud/operator/cmd/main.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (2)
KaiSchedulerConfig(40-43)DetectKaiSchedulerAvailability(106-139)
deploy/cloud/operator/internal/dynamo/grove_test.go (3)
deploy/cloud/operator/internal/consts/consts.go (4)
DefaultKaiSchedulerQueue(62-62)KubeAnnotationKaiSchedulerQueue(59-59)KaiSchedulerName(61-61)KubeLabelKaiSchedulerQueue(60-60)deploy/cloud/operator/internal/dynamo/grove.go (2)
ResolveKaiSchedulerQueue(109-111)DetermineKaiSchedulerQueue(84-105)deploy/cloud/operator/internal/controller_common/predicate.go (3)
Config(45-54)GroveConfig(33-38)KaiSchedulerConfig(40-43)
🪛 LanguageTool
docs/guides/dynamo_deploy/multinode-deployment.md
[grammar] ~27-~27: There might be a mistake here.
Context: ...duling and auto-scaling for AI workloads - **[KAI-Scheduler](https://github.com/NVIDIA...
(QB_NEW_EN)
[grammar] ~32-~32: There might be a mistake here.
Context: ...e nodes. Features Enabled with Grove: - Declarative composition of AI workloads ...
(QB_NEW_EN)
[grammar] ~33-~33: There might be a mistake here.
Context: ... Declarative composition of AI workloads - Multi-level horizontal auto-scaling - Cu...
(QB_NEW_EN)
[grammar] ~34-~34: There might be a mistake here.
Context: ...ds - Multi-level horizontal auto-scaling - Custom startup ordering for components -...
(QB_NEW_EN)
[grammar] ~35-~35: There might be a mistake here.
Context: ...- Custom startup ordering for components - Resource-aware rolling updates [KAI-Sc...
(QB_NEW_EN)
[grammar] ~41-~41: There might be a mistake here.
Context: ... Features Enabled with KAI-Scheduler: - Gang scheduling - Network topology-aware...
(QB_NEW_EN)
[grammar] ~42-~42: There might be a mistake here.
Context: ... with KAI-Scheduler:** - Gang scheduling - Network topology-aware pod placement - A...
(QB_NEW_EN)
[grammar] ~43-~43: There might be a mistake here.
Context: ...g - Network topology-aware pod placement - AI workload-optimized scheduling algorit...
(QB_NEW_EN)
[grammar] ~44-~44: There might be a mistake here.
Context: ...workload-optimized scheduling algorithms - GPU resource awareness and allocation - ...
(QB_NEW_EN)
[grammar] ~45-~45: There might be a mistake here.
Context: ... - GPU resource awareness and allocation - Support for complex scheduling constrain...
(QB_NEW_EN)
[grammar] ~46-~46: There might be a mistake here.
Context: ...pport for complex scheduling constraints - Integration with Grove for enhanced capa...
(QB_NEW_EN)
[grammar] ~47-~47: There might be a mistake here.
Context: ...ion with Grove for enhanced capabilities - Performance optimizations for large-scal...
(QB_NEW_EN)
[grammar] ~76-~76: There might be a mistake here.
Context: ...When Only One Orchestrator is Available: - The installed orchestrator (Grove or LWS...
(QB_NEW_EN)
[grammar] ~79-~79: There might be a mistake here.
Context: ...ly selected #### Scheduler Integration: - With Grove: Automatically integrates w...
(QB_NEW_EN)
[grammar] ~80-~80: There might be a mistake here.
Context: ...AI-Scheduler) when available, providing: - Advanced queue management via `nvidia.co...
(QB_NEW_EN)
[grammar] ~81-~81: There might be a mistake here.
Context: ...idia.com/kai-scheduler-queue` annotation - AI-optimized scheduling policies - Res...
(QB_NEW_EN)
[grammar] ~82-~82: There might be a mistake here.
Context: ...ion - AI-optimized scheduling policies - Resource-aware workload placement - **Wi...
(QB_NEW_EN)
[grammar] ~83-~83: There might be a mistake here.
Context: ...es - Resource-aware workload placement - With LWS: Uses Volcano scheduler for g...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/guides/dynamo_deploy/multinode-deployment.md
72-72: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
76-76: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
79-79: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
86-86: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (10)
deploy/cloud/operator/config/rbac/role.yaml (1)
188-194: RBAC rule for scheduling.run.ai/queues is present in the generated manifestI’ve confirmed that
deploy/cloud/operator/config/rbac/role.yamlincludes the following entry at lines 188–194:apiGroups: - scheduling.run.ai resources: - queues verbs: - get - listSince the rule is already in the checked-in YAML, no further changes are required.
deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)
91-91: Kubebuilder RBAC marker for queues verifiedThe RBAC annotation for
queuesis present at the expected location—no further code changes are required. Please remember to re-runcontroller-genso the generated ClusterRole stays in sync.• File: deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
• Line: 91deploy/cloud/operator/internal/consts/consts.go (1)
58-63: I’ve initiated searches to locate any usage or reference to those constants across the codebase to confirm the canonical label key.deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)
487-490: Added controller-manager subject: LGTMBinding the controller-manager SA explicitly is fine and matches the rest of the template.
deploy/cloud/operator/internal/dynamo/graph.go (1)
950-952: Verify that clique labels propagate to Pod labelsKai requires the queue label on Pods. Please confirm Grove propagates
clique.Labelsto Pod labels; otherwise, inject the label at the Pod level via the appropriate Pod template metadata field.deploy/cloud/operator/cmd/main.go (1)
176-179: Config wiring for KaiScheduler looks correctThe new field initialization is consistent with Grove’s pattern.
deploy/cloud/operator/internal/dynamo/grove.go (1)
110-112: Resolve-only helper is appropriateThis helper cleanly separates resolution from validation. No issues.
deploy/cloud/operator/internal/controller_common/predicate.go (2)
40-44: KaiSchedulerConfig addition—LGTMMatches existing Grove config pattern.
45-54: Config.KaiScheduler field—LGTMPublic API extension is straightforward and consistent.
deploy/cloud/operator/internal/dynamo/grove_test.go (1)
160-174: Injection path happy-case tests—LGTMGood coverage of schedulerName and label injection when both features are enabled.
3b155b2 to
24c51ea
Compare
Signed-off-by: Julien Mancuso <[email protected]>
24c51ea to
cbdbe2e
Compare
Signed-off-by: Julien Mancuso <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]> Signed-off-by: Michael Shin <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]> Signed-off-by: Krishnan Prashanth <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]> Signed-off-by: nnshah1 <[email protected]>
Overview:
Auto-inject kai-scheduler annotations and label
Summary by CodeRabbit