fix: mpi flow and add resourceClaim #3446

nvrohanv · 2025-10-06T22:29:11Z

Overview:

Details:

This PR fixes the following issues:

Adding --allow-run-as-root to mpirun commands when running TRT-LLM
Remove activating virtual env before running trtllm-llmapi-launch
Allowing the user to specify claims in the resource section and have them propagate to the PodSpec. Currently doing this

      resources:
        requests:
          cpu: "130"
          memory: "800Gi"
        limits:
          cpu: "130"
          memory: "800Gi"
          gpu: "4"
        claims:
          - name: compute-domain-channel

would throw an error but this is required for DRA/ComputeDomain to work properly in enabling MNNVL on blackwell.

Summary by CodeRabbit

New Features
- Added ResourceClaims support via a new claims field in resource configurations for component and graph deployments; claims are propagated to pods/containers alongside existing limits/requests (unchanged).
- Multinode TRT-LLM runs now include the --allow-run-as-root flag; startup command streamlined.
Tests
- Introduced tests validating ResourceClaims propagation, pod-level claim mappings, and related volume setup for single and multiple-claim scenarios.

Signed-off-by: Rohan Varma <[email protected]>

coderabbitai · 2025-10-06T22:32:01Z

Walkthrough

Adds ResourceClaims support across CRDs, API types, controller assembly, and pod spec generation for Dynamo components/graphs. Introduces a claims array in resource schemas, wires it through Resources structs, deepcopy, and merge logic. Updates TRT-LLM backend to add an mpirun flag and remove venv activation, with corresponding test adjustments and new claim-focused tests.

Changes

Cohort / File(s)	Summary
Helm CRD templates (claims field) `deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml`	Add spec.resources.claims array to CRD schemas (items: object with required name, request string).
Operator API types + deepcopy `deploy/cloud/operator/api/dynamo/common/common.go`, `deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go`	Add Resources.Claims []corev1.ResourceClaim and deepcopy handling for the slice.
Operator CRD bases (claims field) `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml`, `deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml`	Extend CRD base schemas with spec.resources.claims array (name required; request present).
Controller resource aggregation `deploy/cloud/operator/internal/controller_common/resource.go`	Update GetResourcesConfig to accumulate non-nil Claims into currentResources.Claims.
PodSpec generation and tests `deploy/cloud/operator/internal/dynamo/graph.go`, `deploy/cloud/operator/internal/dynamo/graph_test.go`	Merge override Resources.Claims into container.Resources.Claims; add tests validating ResourceClaims and PodResourceClaims propagation, volumes, and presence/absence scenarios.
TRT-LLM backend and tests `deploy/cloud/operator/internal/dynamo/backend_trtllm.go`, `deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go`	Remove venv activation from leader mpirun path; add --allow-run-as-root flag; update expected command strings in tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User (CRD)
  participant O as Operator Controller
  participant RC as Resource Config Merge
  participant PG as PodSpec Generator
  participant K as Kubernetes API

  U->>O: Submit Dynamo{Component,Graph}Deployment (resources.claims)
  O->>RC: Build Resources from spec and overrides
  RC->>RC: Accumulate Requests/Limits
  RC->>RC: Append Claims (new)
  RC-->>O: currentResources (incl. Claims)
  O->>PG: Generate PodSpec with Resources
  PG->>PG: container.Resources.Claims += override Claims (new)
  PG->>PG: PodSpec.ResourceClaims from spec.claims (new)
  PG-->>O: Completed PodSpec
  O->>K: Create/Update Pod(s)
  Note over RC,PG: New: propagate ResourceClaims into container and PodSpec

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I nibble specs where claims now sprout,
Little carrots of “name” and “request” about.
I hop through pods, stitching seams,
From CRDs to containers—resource dreams!
mpirun waves a rooty flag—how cute!
Tests thump approval with a happy foot.
Boop the build—deploy is en route! 🥕🐇

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The pull request description includes the Overview and Details sections as required by the repository template but is missing both the "Where should the reviewer start?" and "Related Issues" sections. Without these, reviewers lack guidance on which files to focus on and which issues are being resolved. As these headings are mandatory in the template, the description does not fully comply.	Please add a "Where should the reviewer start?" section that highlights the key files or areas for review and include a "Related Issues" section using the appropriate action keywords to reference any associated GitHub issues.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly summarizes the two primary changes in this PR by noting the MPI flow fix and the addition of resource claim support, matching the core objectives without extraneous detail.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
10184-10203: Preserve map semantics for claims list

All other ResourceClaim lists in this CRD mark the list as a map keyed by name, which keeps server-side apply merges stable and enforces key uniqueness. This new block is missing those markers, so it falls back to an atomic list and will replace the entire array on each patch (and allows duplicate names). Please align it with the existing schema.
                       type: array
+                      x-kubernetes-list-map-keys:
+                        - name
+                      x-kubernetes-list-type: map
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)

10315-10334: Preserve map semantics for claims.

Line 10315: Please add x-kubernetes-list-type: map and x-kubernetes-list-map-keys: ["name"] so the CRD matches corev1.ResourceClaim’s map semantics. Without these keys the list is treated as atomic, breaking server-side apply/merge behaviour for individual claims.
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)
10315-10334: Add SSA list markers for claims.

To match corev1.ResourceClaim semantics (merge by name) and avoid server-side apply conflicts, please mark this list as a map keyed on name.
                           claims:
+                            x-kubernetes-list-type: map
+                            x-kubernetes-list-map-keys:
+                              - name
                             items:
deploy/cloud/operator/internal/dynamo/graph_test.go (1)
4922-5185: LGTM! Comprehensive test coverage for ResourceClaims.

The test thoroughly validates ResourceClaims propagation through GenerateBasePodSpec, covering:

Single and multiple resource claims

Pod-level and container-level claims

Volume handling with claims

Coexistence of claims with standard resources (CPU/Memory/GPU)

Optional: Consider additional edge-case tests.

The current test cases cover the main scenarios well. For even more robust coverage, you might consider adding test cases for:

Component with only ExtraPodSpec.PodSpec.ResourceClaims but no Resources.Claims (to verify they work independently)

Component with only Resources.Claims but no ExtraPodSpec.PodSpec.ResourceClaims (to verify container-level claims work standalone)

These would help document the independent behavior of container-level vs pod-level claims, but the current coverage is already sufficient for the feature.

Optional: More descriptive assertion messages.

At lines 5158-5179, the error messages could be more informative. For example:
-if cpu, exists := container.Resources.Requests[corev1.ResourceCPU]; !exists || cpu.IsZero() {
-    t.Errorf("GenerateBasePodSpec() expected CPU request to be set")
-}
+if cpu, exists := container.Resources.Requests[corev1.ResourceCPU]; !exists || cpu.IsZero() {
+    t.Errorf("GenerateBasePodSpec() expected CPU request to be set, got: exists=%v, value=%v", exists, cpu)
+}
This would make debugging easier if these assertions fail, but the current messages are acceptable for passing tests.
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)
10184-10203: Align claims list semantics with PodSpec claims

To preserve unique keys and strategic merge behavior (like PodSpec’s resourceClaims), please add the list metadata so the API server treats items as a map keyed on name.
                     type: array
+                    x-kubernetes-list-map-keys:
+                      - name
+                    x-kubernetes-list-type: map

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d48fe6 and 93c46f9.

📒 Files selected for processing (11)

deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1 hunks)
deploy/cloud/operator/api/dynamo/common/common.go (1 hunks)
deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go (1 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1 hunks)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1 hunks)
deploy/cloud/operator/internal/controller_common/resource.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_trtllm.go (1 hunks)
deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go (10 hunks)
deploy/cloud/operator/internal/dynamo/graph.go (2 hunks)
deploy/cloud/operator/internal/dynamo/graph_test.go (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

deploy/cloud/operator/internal/dynamo/graph.go (1)

deploy/cloud/operator/api/dynamo/common/common.go (1)

Resources (34-38)

deploy/cloud/operator/internal/dynamo/graph_test.go (5)

deploy/cloud/operator/internal/dynamo/graph.go (3)

Config (64-74)

Resources (82-87)

GenerateBasePodSpec (693-860)

deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)

DynamoComponentDeploymentOverridesSpec (56-58)

DynamoComponentDeploymentSharedSpec (60-118)

deploy/cloud/operator/internal/consts/consts.go (4)

ComponentTypeWorker (53-53)

DefaultSharedMemorySize (66-66)

ComponentTypeFrontend (52-52)

MultinodeDeploymentTypeGrove (87-87)

deploy/cloud/operator/api/dynamo/common/common.go (3)

Resources (34-38)

ResourceItem (25-32)

ExtraPodSpec (59-62)

deploy/cloud/operator/api/v1alpha1/common.go (1)

VolumeMount (42-54)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: changed-files
GitHub Check: Build and Test - dynamo

🔇 Additional comments (7)

deploy/cloud/operator/internal/controller_common/resource.go (1)

471-476: LGTM! ResourceClaims support correctly implemented.

The Claims handling follows the same pattern as Limits and Requests: check for nil, initialize if needed, then populate. The implementation correctly appends the resource claims to the pod specification, enabling Dynamic Resource Allocation (DRA) for features like MNNVL on Blackwell hardware.

deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go (1)

193-197: LGTM! Auto-generated deepcopy implementation is correct.

The deepcopy logic for the Claims field correctly allocates a new slice and copies ResourceClaim elements. Since ResourceClaim contains only a string field (immutable in Go), the shallow copy via copy() is sufficient and follows the established pattern for similar fields.

deploy/cloud/operator/api/dynamo/common/common.go (1)

35-37: LGTM! Claims field properly added to support ResourceClaims.

The new Claims field is correctly defined as a slice of corev1.ResourceClaim with appropriate JSON tags. Using a slice directly (rather than a pointer-to-slice) is idiomatic Go, and the field aligns with the PR objective to enable DRA/ComputeDomain functionality for MNNVL on Blackwell.

deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go (1)

67-67: LGTM! Test expectations correctly updated for --allow-run-as-root flag.

All test expectations have been consistently updated to include the --allow-run-as-root flag in mpirun commands, properly positioned after mpirun and before --oversubscribe. The changes align with the PR objective to enable running TRT-LLM with mpirun as root.

Also applies to: 123-123, 574-574, 584-584, 602-602, 620-620, 638-638, 656-656, 674-674, 692-692

deploy/cloud/operator/internal/dynamo/backend_trtllm.go (2)

151-151: LGTM! MPI root execution flag added as intended.

The --allow-run-as-root flag enables mpirun to execute as the root user, which is typically required in containerized environments. This aligns with the PR objectives for TRT-LLM support.

146-146: Confirm venv bin on PATH or restore activation. Ensure /opt/dynamo/venv/bin/trtllm-llmapi-launch is accessible at runtime without the explicit source /opt/dynamo/venv/bin/activate; otherwise the wrapped command may fail.

deploy/cloud/operator/internal/dynamo/graph.go (1)

762-768: No changes needed for ResourceClaims handling
The graph.go code mirrors controller_common/resource.go by appending new ResourceClaim entries, so this behavior is intentional and consistent with existing patterns.

Signed-off-by: Rohan Varma <[email protected]>

fix mpi flow and add resourceClaim

56cb033

Signed-off-by: Rohan Varma <[email protected]>

nvrohanv requested a review from a team as a code owner October 6, 2025 22:29

pull-request-size bot added the size/L label Oct 6, 2025

nvrohanv requested review from biswapanda and julienmancuso October 6, 2025 22:30

Merge branch 'main' into nvrohanv/fix-mpi-flow

93c46f9

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 22:30 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 6, 2025 22:31 Inactive

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

nvrohanv changed the title ~~fix mpi flow and add resourceClaim~~ fix: mpi flow and add resourceClaim Oct 6, 2025

github-actions bot added the fix label Oct 6, 2025

julienmancuso approved these changes Oct 6, 2025

View reviewed changes

Merge branch 'main' into nvrohanv/fix-mpi-flow

bc6aa02

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 07:26 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 07:27 Inactive

Merge branch 'main' into nvrohanv/fix-mpi-flow

b2f1e33

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 17:54 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 7, 2025 17:55 Inactive

Merge branch 'main' into nvrohanv/fix-mpi-flow

98ae215

copy-pr-bot bot temporarily deployed to GITLAB October 20, 2025 16:16 Inactive

Merge branch 'main' into nvrohanv/fix-mpi-flow

c457fd5

copy-pr-bot bot temporarily deployed to GITLAB October 20, 2025 17:34 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 20, 2025 17:38 Inactive

Merge branch 'main' into nvrohanv/fix-mpi-flow

2c6a423

hutm approved these changes Oct 20, 2025

View reviewed changes

Merge branch 'main' into nvrohanv/fix-mpi-flow

f50d8ae

copy-pr-bot bot temporarily deployed to GITLAB October 21, 2025 02:39 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 21, 2025 02:40 Inactive

Merge branch 'main' into nvrohanv/fix-mpi-flow

564a39f

copy-pr-bot bot temporarily deployed to GITLAB October 21, 2025 17:15 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 21, 2025 17:17 Inactive

Merge branch 'main' into nvrohanv/fix-mpi-flow

ffa22e8

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 20:45 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 20:48 Inactive

Merge branch 'main' into nvrohanv/fix-mpi-flow

2032793

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 22:06 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 22:08 Inactive

Merge branch 'main' into nvrohanv/fix-mpi-flow

5c5e944

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 00:30 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 00:34 Inactive

nvrohanv merged commit 5d90e53 into main Oct 23, 2025
22 of 23 checks passed

nvrohanv deleted the nvrohanv/fix-mpi-flow branch October 23, 2025 05:13

nvrohanv added a commit that referenced this pull request Oct 23, 2025

fix: mpi flow and add resourceClaim (#3446)

6300a7f

Signed-off-by: Rohan Varma <[email protected]>

nvrohanv mentioned this pull request Oct 23, 2025

fix: mpi flow and add resourceClaim (#3446) #3844

Merged

saturley-hall pushed a commit that referenced this pull request Oct 23, 2025

fix: mpi flow and add resourceClaim (#3446) (#3844)

781fa6c

Signed-off-by: Rohan Varma <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: mpi flow and add resourceClaim #3446

fix: mpi flow and add resourceClaim #3446

Uh oh!

nvrohanv commented Oct 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: mpi flow and add resourceClaim #3446

fix: mpi flow and add resourceClaim #3446

Uh oh!

Conversation

nvrohanv commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvrohanv commented Oct 6, 2025 •

edited

Loading

coderabbitai bot commented Oct 6, 2025 •

edited

Loading