-
Notifications
You must be signed in to change notification settings - Fork 745
fix: mpi flow and add resourceClaim #3446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Rohan Varma <[email protected]>
WalkthroughAdds ResourceClaims support across CRDs, API types, controller assembly, and pod spec generation for Dynamo components/graphs. Introduces a claims array in resource schemas, wires it through Resources structs, deepcopy, and merge logic. Updates TRT-LLM backend to add an mpirun flag and remove venv activation, with corresponding test adjustments and new claim-focused tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User (CRD)
participant O as Operator Controller
participant RC as Resource Config Merge
participant PG as PodSpec Generator
participant K as Kubernetes API
U->>O: Submit Dynamo{Component,Graph}Deployment (resources.claims)
O->>RC: Build Resources from spec and overrides
RC->>RC: Accumulate Requests/Limits
RC->>RC: Append Claims (new)
RC-->>O: currentResources (incl. Claims)
O->>PG: Generate PodSpec with Resources
PG->>PG: container.Resources.Claims += override Claims (new)
PG->>PG: PodSpec.ResourceClaims from spec.claims (new)
PG-->>O: Completed PodSpec
O->>K: Create/Update Pod(s)
Note over RC,PG: New: propagate ResourceClaims into container and PodSpec
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)
10184-10203: Preserve map semantics forclaimslistAll other
ResourceClaimlists in this CRD mark the list as amapkeyed byname, which keeps server-side apply merges stable and enforces key uniqueness. This new block is missing those markers, so it falls back to an atomic list and will replace the entire array on each patch (and allows duplicate names). Please align it with the existing schema.type: array + x-kubernetes-list-map-keys: + - name + x-kubernetes-list-type: mapdeploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)
10315-10334: Preserve map semantics forclaims.Line 10315: Please add
x-kubernetes-list-type: mapandx-kubernetes-list-map-keys: ["name"]so the CRD matchescorev1.ResourceClaim’s map semantics. Without these keys the list is treated as atomic, breaking server-side apply/merge behaviour for individual claims.deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (1)
10315-10334: Add SSA list markers forclaims.To match
corev1.ResourceClaimsemantics (merge by name) and avoid server-side apply conflicts, please mark this list as a map keyed onname.claims: + x-kubernetes-list-type: map + x-kubernetes-list-map-keys: + - name items:deploy/cloud/operator/internal/dynamo/graph_test.go (1)
4922-5185: LGTM! Comprehensive test coverage for ResourceClaims.The test thoroughly validates ResourceClaims propagation through GenerateBasePodSpec, covering:
- Single and multiple resource claims
- Pod-level and container-level claims
- Volume handling with claims
- Coexistence of claims with standard resources (CPU/Memory/GPU)
Optional: Consider additional edge-case tests.
The current test cases cover the main scenarios well. For even more robust coverage, you might consider adding test cases for:
- Component with only ExtraPodSpec.PodSpec.ResourceClaims but no Resources.Claims (to verify they work independently)
- Component with only Resources.Claims but no ExtraPodSpec.PodSpec.ResourceClaims (to verify container-level claims work standalone)
These would help document the independent behavior of container-level vs pod-level claims, but the current coverage is already sufficient for the feature.
Optional: More descriptive assertion messages.
At lines 5158-5179, the error messages could be more informative. For example:
-if cpu, exists := container.Resources.Requests[corev1.ResourceCPU]; !exists || cpu.IsZero() { - t.Errorf("GenerateBasePodSpec() expected CPU request to be set") -} +if cpu, exists := container.Resources.Requests[corev1.ResourceCPU]; !exists || cpu.IsZero() { + t.Errorf("GenerateBasePodSpec() expected CPU request to be set, got: exists=%v, value=%v", exists, cpu) +}This would make debugging easier if these assertions fail, but the current messages are acceptable for passing tests.
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (1)
10184-10203: Alignclaimslist semantics with PodSpec claimsTo preserve unique keys and strategic merge behavior (like PodSpec’s
resourceClaims), please add the list metadata so the API server treats items as a map keyed onname.type: array + x-kubernetes-list-map-keys: + - name + x-kubernetes-list-type: map
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml(1 hunks)deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml(1 hunks)deploy/cloud/operator/api/dynamo/common/common.go(1 hunks)deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go(1 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml(1 hunks)deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml(1 hunks)deploy/cloud/operator/internal/controller_common/resource.go(1 hunks)deploy/cloud/operator/internal/dynamo/backend_trtllm.go(1 hunks)deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go(10 hunks)deploy/cloud/operator/internal/dynamo/graph.go(2 hunks)deploy/cloud/operator/internal/dynamo/graph_test.go(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
deploy/cloud/operator/internal/dynamo/graph.go (1)
deploy/cloud/operator/api/dynamo/common/common.go (1)
Resources(34-38)
deploy/cloud/operator/internal/dynamo/graph_test.go (5)
deploy/cloud/operator/internal/dynamo/graph.go (3)
Config(64-74)Resources(82-87)GenerateBasePodSpec(693-860)deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (2)
DynamoComponentDeploymentOverridesSpec(56-58)DynamoComponentDeploymentSharedSpec(60-118)deploy/cloud/operator/internal/consts/consts.go (4)
ComponentTypeWorker(53-53)DefaultSharedMemorySize(66-66)ComponentTypeFrontend(52-52)MultinodeDeploymentTypeGrove(87-87)deploy/cloud/operator/api/dynamo/common/common.go (3)
Resources(34-38)ResourceItem(25-32)ExtraPodSpec(59-62)deploy/cloud/operator/api/v1alpha1/common.go (1)
VolumeMount(42-54)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: changed-files
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (7)
deploy/cloud/operator/internal/controller_common/resource.go (1)
471-476: LGTM! ResourceClaims support correctly implemented.The Claims handling follows the same pattern as Limits and Requests: check for nil, initialize if needed, then populate. The implementation correctly appends the resource claims to the pod specification, enabling Dynamic Resource Allocation (DRA) for features like MNNVL on Blackwell hardware.
deploy/cloud/operator/api/dynamo/common/zz_generated.deepcopy.go (1)
193-197: LGTM! Auto-generated deepcopy implementation is correct.The deepcopy logic for the Claims field correctly allocates a new slice and copies ResourceClaim elements. Since ResourceClaim contains only a string field (immutable in Go), the shallow copy via
copy()is sufficient and follows the established pattern for similar fields.deploy/cloud/operator/api/dynamo/common/common.go (1)
35-37: LGTM! Claims field properly added to support ResourceClaims.The new Claims field is correctly defined as a slice of
corev1.ResourceClaimwith appropriate JSON tags. Using a slice directly (rather than a pointer-to-slice) is idiomatic Go, and the field aligns with the PR objective to enable DRA/ComputeDomain functionality for MNNVL on Blackwell.deploy/cloud/operator/internal/dynamo/backend_trtllm_test.go (1)
67-67: LGTM! Test expectations correctly updated for --allow-run-as-root flag.All test expectations have been consistently updated to include the
--allow-run-as-rootflag in mpirun commands, properly positioned aftermpirunand before--oversubscribe. The changes align with the PR objective to enable running TRT-LLM with mpirun as root.Also applies to: 123-123, 574-574, 584-584, 602-602, 620-620, 638-638, 656-656, 674-674, 692-692
deploy/cloud/operator/internal/dynamo/backend_trtllm.go (2)
151-151: LGTM! MPI root execution flag added as intended.The
--allow-run-as-rootflag enables mpirun to execute as the root user, which is typically required in containerized environments. This aligns with the PR objectives for TRT-LLM support.
146-146: Confirm venv bin on PATH or restore activation. Ensure/opt/dynamo/venv/bin/trtllm-llmapi-launchis accessible at runtime without the explicitsource /opt/dynamo/venv/bin/activate; otherwise the wrapped command may fail.deploy/cloud/operator/internal/dynamo/graph.go (1)
762-768: No changes needed for ResourceClaims handling
The graph.go code mirrors controller_common/resource.go by appending new ResourceClaim entries, so this behavior is intentional and consistent with existing patterns.
Signed-off-by: Rohan Varma <[email protected]>
Signed-off-by: Rohan Varma <[email protected]>
Overview:
Details:
This PR fixes the following issues:
would throw an error but this is required for DRA/ComputeDomain to work properly in enabling MNNVL on blackwell.
Summary by CodeRabbit
New Features
Tests