Skip to content

Conversation

@julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Aug 27, 2025

Overview:

Auto-inject kai-scheduler annotations and label

Summary by CodeRabbit

  • New Features
    • Added Kai-scheduler integration with auto-detection, queue resolution/validation, and per-workload injection when enabled.
    • Supports setting scheduler name and queue label; honors manual scheduler overrides.
  • Documentation
    • Revamped multinode deployment guide with orchestrator selection logic, Grove + Kai-Scheduler guidance, prerequisites, GPU distribution notes, and full YAML examples.
  • Tests
    • Added unit tests covering queue resolution, validation, and Kai-scheduler injection scenarios.
  • Chores
    • Updated RBAC/permissions to allow reading Kai-scheduler queues at namespace and cluster scope.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 27, 2025

Walkthrough

Adds Kai-scheduler support: RBAC rules/bindings for scheduling.run.ai queues; detection of Kai-scheduler API at startup; constants; queue resolution/validation; and injection of schedulerName/queue label into cliques when enabled. Updates docs for orchestrator selection and multinode examples. Includes unit tests for queue resolution, validation, and injection.

Changes

Cohort / File(s) Summary
RBAC additions (manager, cluster, kubebuilder)
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml, deploy/cloud/operator/config/rbac/role.yaml, deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
Grants get/list on scheduling.run.ai queues; adds ClusterRole/ClusterRoleBinding for queue-reader; updates RoleBinding subjects; adds kubebuilder RBAC marker for queues.
Kai-scheduler detection and config wiring
deploy/cloud/operator/cmd/main.go, deploy/cloud/operator/internal/controller_common/predicate.go
Introduces KaiSchedulerConfig in Config; detects Kai-scheduler API presence via discovery; sets enable flag at startup; logs detection.
Grove/Dynamo integration for Kai-scheduler
deploy/cloud/operator/internal/dynamo/graph.go, deploy/cloud/operator/internal/dynamo/grove.go
Resolves and validates queue (dynamic client); determines queue name; injects schedulerName and queue label into cliques when Grove and KaiScheduler are enabled; respects manual scheduler override.
Unit tests for Kai-scheduler in Grove
deploy/cloud/operator/internal/dynamo/grove_test.go
Adds tests for queue name resolution, injection behavior, queue existence checks with fake client, and determination flow.
Constants
deploy/cloud/operator/internal/consts/consts.go
Adds Kai-scheduler constants: annotation key, label key, scheduler name, default queue.
Docs update (multinode and orchestrators)
docs/guides/dynamo_deploy/multinode-deployment.md
Reworks guide: prerequisites (Grove/KAI), orchestrator selection algorithm, multinode manifest examples, GPU math clarification, and config snippets.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Operator as Operator main
  participant Manager as Controller Manager
  participant API as K8s API Server
  participant Disc as Discovery Client

  Operator->>Manager: Start()
  Operator->>Disc: Init with mgr.GetConfig()
  Disc->>API: List ServerGroups
  API-->>Disc: API groups
  Disc-->>Operator: Contains scheduling.run.ai?
  Operator->>Operator: ctrlConfig.KaiScheduler.Enabled = true/false
  note over Operator: Grove detection remains unchanged
Loading
sequenceDiagram
  autonumber
  participant Reconciler as DynamoGraphDeployment Reconciler
  participant Dyn as Dynamo graph builder
  participant K8s as K8s (dynamic client)
  participant Clique as PodCliqueTemplateSpec

  Reconciler->>Dyn: Build graph (cfg: Grove/KaiScheduler flags)
  alt Grove && KaiScheduler enabled
    Dyn->>Dyn: Resolve queue name (annotation or default)
    Dyn->>K8s: Verify Queue exists (scheduling.run.ai/v2)
    K8s-->>Dyn: Exists / Error
    alt Queue exists
      Dyn->>Clique: Inject schedulerName="kai-scheduler"\n+ label kai.scheduler/queue=<name>
    else Error
      Dyn-->>Reconciler: Return error
    end
  else Not enabled
    Dyn->>Clique: No Kai-scheduler injection
  end
  Reconciler-->>Reconciler: Proceed with remaining flow
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

I hop through Groves where cliques align,
Queues in the breeze, Kai keeps time.
Scheduler set, labels true,
Multinode dreams in YAML hue.
With tiny paws I bind and sing—
list, get, and reconcile—spring! 🐇🗂️⏱️

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (12)
docs/guides/dynamo_deploy/multinode-deployment.md (2)

72-72: Remove trailing colons from headings (MD026)

Minor markdownlint cleanup.

-#### When Both Grove and LWS are Available:
+#### When Both Grove and LWS are Available
-#### When Only One Orchestrator is Available:
+#### When Only One Orchestrator is Available
-#### Scheduler Integration:
+#### Scheduler Integration
-#### Configuration Examples:
+#### Configuration Examples

Also applies to: 76-76, 79-79, 86-86


53-55: Clarify KAI prerequisite about Queue creation

Explicitly state the operator reads queues; users must ensure the Queue exists.

-- [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster with default queue name `dynamo` created. You can use a different queue name by setting the `nvidia.com/kai-scheduler-queue` annotation on the DGD resource.
+- [KAI-Scheduler](https://github.com/NVIDIA/KAI-Scheduler) installed on the cluster. Ensure a `Queue` resource exists (defaults to `dynamo`, or set a different one via the `nvidia.com/kai-scheduler-queue` annotation on the DGD).
deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2)

141-147: Gate queue rule when Role-scoped to avoid ineffective permissions

When namespaceRestriction.enabled=true, this block becomes a Role and won’t grant access to cluster-scoped Queues. Suggest adding a conditional so it’s only included for ClusterRole.

- - apiGroups:
-   - scheduling.run.ai
-   resources:
-   - queues
-   verbs:
-   - get
-   - list
+ {{- if not .Values.namespaceRestriction.enabled }}
+ - apiGroups:
+   - scheduling.run.ai
+   resources:
+   - queues
+   verbs:
+   - get
+   - list
+ {{- end }}

491-525: Make queue-reader ClusterRole/Binding configurable

These objects are always created; consider gating behind a values flag to avoid extra RBAC on clusters not using KAI.

----
-# ClusterRole for kai-scheduler queue access
+{{- if .Values.kaiScheduler.enabled }}
+---
+# ClusterRole for kai-scheduler queue access
@@
----
-# ClusterRoleBinding for kai-scheduler queue access
+---
+# ClusterRoleBinding for kai-scheduler queue access
@@
-  name: {{ include "dynamo-operator.fullname" . }}-queue-reader
+  name: {{ include "dynamo-operator.fullname" . }}-queue-reader
@@
-  namespace: '{{ .Release.Namespace }}'
+  namespace: '{{ .Release.Namespace }}'
+{{- end }}

If you prefer auto-detection at runtime, keeping RBAC unconditional is acceptable; just confirm this aligns with your threat model.

Also applies to: 511-525

deploy/cloud/operator/internal/dynamo/graph.go (1)

885-895: Queue pre-validation is good; consider graceful fallback path

Current behavior fails Graph generation if the queue is absent/misconfigured. If you want “auto-inject when possible” semantics, consider downgrading to a warning and skipping injection rather than returning an error here.

deploy/cloud/operator/cmd/main.go (1)

251-255: Log the detection result to aid troubleshooting

After detection, log the boolean result (similar to Grove) so operators can see if Kai-scheduler was enabled at runtime.

Example:

setupLog.Info("Kai-scheduler detection completed", "enabled", kaiSchedulerEnabled)
deploy/cloud/operator/internal/dynamo/grove.go (3)

58-80: Hard-coded GVR and scope—make version/scope discovery-driven

queues.scheduling.run.ai may vary by version or scope. Prefer resolving via RESTMapper/discovery and support both v1/v2; confirm whether Queue is cluster-scoped (no Namespace()) in your target clusters.


85-105: Avoid per-call dynamic client creation; improve testability

Constructing a dynamic client inside this function hampers unit testing and adds overhead. Accept a dynamic.Interface (or a factory) as a parameter, defaulting to in-cluster config when nil.


114-140: Guard against empty queue labels and confirm label propagation

Even though upstream validation should set a non-empty queue, add a defensive check to skip labeling on empty values, and verify labels reach Pods (else Kai won’t see the queue).

Possible tweak:

- queueName := validatedQueueName
+ queueName := validatedQueueName
+ if queueName == "" {
+   return
+ }
deploy/cloud/operator/internal/controller_common/predicate.go (1)

104-140: Deduplicate discovery logic and optionally verify served versions

This largely duplicates DetectGroveAvailability. Consider a generic DetectAPIGroupAvailability(ctx, mgr, group string) helper, and optionally verify served versions/resources (e.g., that queues exists).

deploy/cloud/operator/internal/dynamo/grove_test.go (2)

294-309: Remove misleading comment about mocking the dynamic client

ensureQueueExists already accepts a dynamic.Interface; the fake client here is sufficient, so the “can’t properly mock” comment is confusing.

Apply:

-			// This test is limited because we can't easily mock the dynamic client
-			// In a real test environment, you would set up a proper test cluster or use envtest
+			// Using a fake dynamic client to simulate presence/absence of the Queue resource

315-357: Improve testability of DetermineKaiSchedulerQueue

Because it creates its own dynamic client, this test must expect failure. Refactor the function to accept a client/factory so you can inject dynamicfake here and assert success paths too.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bad5d12 and a437daa.

📒 Files selected for processing (10)
  • deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (2 hunks)
  • deploy/cloud/operator/cmd/main.go (2 hunks)
  • deploy/cloud/operator/config/rbac/role.yaml (1 hunks)
  • deploy/cloud/operator/internal/consts/consts.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1 hunks)
  • deploy/cloud/operator/internal/controller_common/predicate.go (2 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (2 hunks)
  • deploy/cloud/operator/internal/dynamo/grove.go (2 hunks)
  • deploy/cloud/operator/internal/dynamo/grove_test.go (1 hunks)
  • docs/guides/dynamo_deploy/multinode-deployment.md (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
🧬 Code graph analysis (4)
deploy/cloud/operator/internal/dynamo/graph.go (1)
deploy/cloud/operator/internal/dynamo/grove.go (1)
  • DetermineKaiSchedulerQueue (84-105)
deploy/cloud/operator/internal/dynamo/grove.go (2)
deploy/cloud/operator/internal/consts/consts.go (4)
  • DefaultKaiSchedulerQueue (62-62)
  • KubeAnnotationKaiSchedulerQueue (59-59)
  • KaiSchedulerName (61-61)
  • KubeLabelKaiSchedulerQueue (60-60)
deploy/cloud/operator/internal/controller_common/predicate.go (1)
  • Config (45-54)
deploy/cloud/operator/cmd/main.go (1)
deploy/cloud/operator/internal/controller_common/predicate.go (2)
  • KaiSchedulerConfig (40-43)
  • DetectKaiSchedulerAvailability (106-139)
deploy/cloud/operator/internal/dynamo/grove_test.go (3)
deploy/cloud/operator/internal/consts/consts.go (4)
  • DefaultKaiSchedulerQueue (62-62)
  • KubeAnnotationKaiSchedulerQueue (59-59)
  • KaiSchedulerName (61-61)
  • KubeLabelKaiSchedulerQueue (60-60)
deploy/cloud/operator/internal/dynamo/grove.go (2)
  • ResolveKaiSchedulerQueue (109-111)
  • DetermineKaiSchedulerQueue (84-105)
deploy/cloud/operator/internal/controller_common/predicate.go (3)
  • Config (45-54)
  • GroveConfig (33-38)
  • KaiSchedulerConfig (40-43)
🪛 LanguageTool
docs/guides/dynamo_deploy/multinode-deployment.md

[grammar] ~27-~27: There might be a mistake here.
Context: ...duling and auto-scaling for AI workloads - **[KAI-Scheduler](https://github.com/NVIDIA...

(QB_NEW_EN)


[grammar] ~32-~32: There might be a mistake here.
Context: ...e nodes. Features Enabled with Grove: - Declarative composition of AI workloads ...

(QB_NEW_EN)


[grammar] ~33-~33: There might be a mistake here.
Context: ... Declarative composition of AI workloads - Multi-level horizontal auto-scaling - Cu...

(QB_NEW_EN)


[grammar] ~34-~34: There might be a mistake here.
Context: ...ds - Multi-level horizontal auto-scaling - Custom startup ordering for components -...

(QB_NEW_EN)


[grammar] ~35-~35: There might be a mistake here.
Context: ...- Custom startup ordering for components - Resource-aware rolling updates [KAI-Sc...

(QB_NEW_EN)


[grammar] ~41-~41: There might be a mistake here.
Context: ... Features Enabled with KAI-Scheduler: - Gang scheduling - Network topology-aware...

(QB_NEW_EN)


[grammar] ~42-~42: There might be a mistake here.
Context: ... with KAI-Scheduler:** - Gang scheduling - Network topology-aware pod placement - A...

(QB_NEW_EN)


[grammar] ~43-~43: There might be a mistake here.
Context: ...g - Network topology-aware pod placement - AI workload-optimized scheduling algorit...

(QB_NEW_EN)


[grammar] ~44-~44: There might be a mistake here.
Context: ...workload-optimized scheduling algorithms - GPU resource awareness and allocation - ...

(QB_NEW_EN)


[grammar] ~45-~45: There might be a mistake here.
Context: ... - GPU resource awareness and allocation - Support for complex scheduling constrain...

(QB_NEW_EN)


[grammar] ~46-~46: There might be a mistake here.
Context: ...pport for complex scheduling constraints - Integration with Grove for enhanced capa...

(QB_NEW_EN)


[grammar] ~47-~47: There might be a mistake here.
Context: ...ion with Grove for enhanced capabilities - Performance optimizations for large-scal...

(QB_NEW_EN)


[grammar] ~76-~76: There might be a mistake here.
Context: ...When Only One Orchestrator is Available: - The installed orchestrator (Grove or LWS...

(QB_NEW_EN)


[grammar] ~79-~79: There might be a mistake here.
Context: ...ly selected #### Scheduler Integration: - With Grove: Automatically integrates w...

(QB_NEW_EN)


[grammar] ~80-~80: There might be a mistake here.
Context: ...AI-Scheduler) when available, providing: - Advanced queue management via `nvidia.co...

(QB_NEW_EN)


[grammar] ~81-~81: There might be a mistake here.
Context: ...idia.com/kai-scheduler-queue` annotation - AI-optimized scheduling policies - Res...

(QB_NEW_EN)


[grammar] ~82-~82: There might be a mistake here.
Context: ...ion - AI-optimized scheduling policies - Resource-aware workload placement - **Wi...

(QB_NEW_EN)


[grammar] ~83-~83: There might be a mistake here.
Context: ...es - Resource-aware workload placement - With LWS: Uses Volcano scheduler for g...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/guides/dynamo_deploy/multinode-deployment.md

72-72: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


76-76: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


79-79: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


86-86: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (10)
deploy/cloud/operator/config/rbac/role.yaml (1)

188-194: RBAC rule for scheduling.run.ai/queues is present in the generated manifest

I’ve confirmed that deploy/cloud/operator/config/rbac/role.yaml includes the following entry at lines 188–194:

apiGroups:
- scheduling.run.ai
resources:
- queues
verbs:
- get
- list

Since the rule is already in the checked-in YAML, no further changes are required.

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)

91-91: Kubebuilder RBAC marker for queues verified

The RBAC annotation for queues is present at the expected location—no further code changes are required. Please remember to re-run controller-gen so the generated ClusterRole stays in sync.

• File: deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go
• Line: 91

deploy/cloud/operator/internal/consts/consts.go (1)

58-63: I’ve initiated searches to locate any usage or reference to those constants across the codebase to confirm the canonical label key.

deploy/cloud/helm/platform/components/operator/templates/manager-rbac.yaml (1)

487-490: Added controller-manager subject: LGTM

Binding the controller-manager SA explicitly is fine and matches the rest of the template.

deploy/cloud/operator/internal/dynamo/graph.go (1)

950-952: Verify that clique labels propagate to Pod labels

Kai requires the queue label on Pods. Please confirm Grove propagates clique.Labels to Pod labels; otherwise, inject the label at the Pod level via the appropriate Pod template metadata field.

deploy/cloud/operator/cmd/main.go (1)

176-179: Config wiring for KaiScheduler looks correct

The new field initialization is consistent with Grove’s pattern.

deploy/cloud/operator/internal/dynamo/grove.go (1)

110-112: Resolve-only helper is appropriate

This helper cleanly separates resolution from validation. No issues.

deploy/cloud/operator/internal/controller_common/predicate.go (2)

40-44: KaiSchedulerConfig addition—LGTM

Matches existing Grove config pattern.


45-54: Config.KaiScheduler field—LGTM

Public API extension is straightforward and consistent.

deploy/cloud/operator/internal/dynamo/grove_test.go (1)

160-174: Injection path happy-case tests—LGTM

Good coverage of schedulerName and label injection when both features are enabled.

@julienmancuso julienmancuso enabled auto-merge (squash) August 27, 2025 23:40
@julienmancuso julienmancuso merged commit 9e6972a into main Aug 28, 2025
12 of 14 checks passed
@julienmancuso julienmancuso deleted the jsm/dep-335 branch August 28, 2025 00:09
jasonqinzhou pushed a commit that referenced this pull request Aug 30, 2025
michaelshin pushed a commit that referenced this pull request Sep 2, 2025
KrishnanPrash pushed a commit that referenced this pull request Sep 2, 2025
Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Krishnan Prashanth <[email protected]>
nnshah1 pushed a commit that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants