Skip to content

Conversation

@biswapanda
Copy link
Contributor

@biswapanda biswapanda commented May 30, 2025

Overview:

Cherry pick : #1255

Linear ticket for the issue : https://linear.app/nvidia-dynamo/issue/DEP-130/dynamo-build-populate-default-image-name

This PR cherry picks following fixes -

  • Populate base image name from env variable DYNAMO_IMAGE if its not specified in decorator
  • operator "environment=kuberenetes" setting should be added for planner component only
  • Fix resources service args - match resource spec (cpu,gpu) with operator

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features
    • Resource configuration now supports string-based CPU, GPU, and memory values with updated defaults for improved flexibility.
    • Service deployments can now conditionally set environment arguments based on component type.
  • Bug Fixes
    • Corrected resource field naming in service configuration to ensure proper resource handling.
  • Documentation
    • Updated support matrix and example configuration files to reflect new version numbers and resource specifications.
  • Chores
    • Bumped version numbers across multiple components and dependencies to 0.3.0.
    • Updated Docker image and build script versions for compatibility with latest releases.
  • Tests
    • Enhanced resource configuration tests to verify new default values and field types.
  • Style
    • Removed unnecessary debug and logging statements for cleaner output.

@copy-pr-bot
Copy link

copy-pr-bot bot commented May 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the fix label May 30, 2025
@biswapanda biswapanda changed the base branch from main to release/0.3.0 May 30, 2025 18:21
@biswapanda biswapanda self-assigned this May 30, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 30, 2025

Walkthrough

This update primarily increases version numbers across multiple Rust, Python, and Docker-related files, adjusts dependency versions, and introduces resource configuration changes in both code and example YAMLs. It also modifies logic for resource field handling, adds validation for GPU resource types, and restricts specific Kubernetes argument injection to a targeted component type.

Changes

File(s) Change Summary
Cargo.toml, lib/bindings/python/Cargo.toml, lib/runtime/examples/Cargo.toml, pyproject.toml Bump project and dependency versions from 0.2.1 to 0.3.0
lib/llm/Cargo.toml Update nixl-sys dependency version from 0.2.1-rc.3 to 0.3.0-rc.2
container/Dockerfile.vllm, docs/support_matrix.md Update vllm patched package version from 0.8.4.post1 to 0.8.4.post2
container/build.sh Update NIXL_COMMIT hash to a new commit
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go Add ComponentTypePlanner constant; restrict environment arg injection to Planner components
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go Adjust test expectations for leader container arguments to match new conditional logic
deploy/sdk/src/dynamo/sdk/cli/build.py Improve Docker image resolution logic, correct resource field, remove debug print
deploy/sdk/src/dynamo/sdk/core/lib.py Remove logging statement from service decorator
deploy/sdk/src/dynamo/sdk/core/protocol/interface.py Change resource config field types/defaults, add GPU string validator, rename resource to resources
deploy/sdk/src/dynamo/sdk/tests/test_resources.py Update assertions for resource config, add checks for CPU/GPU/memory as strings
examples/llm/configs/agg_router.yaml Add explicit CPU and memory resource fields to VllmWorker service

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ServiceInterface
    participant ServiceInfo
    participant ServiceConfig

    User->>ServiceInterface: Define service (with/without image/resources)
    ServiceInterface->>ServiceInfo: from_service(service)
    ServiceInfo->>ServiceInfo: Resolve image (service.config.image or DYNAMO_IMAGE)
    ServiceInfo->>ServiceConfig: Use resources field (with validated cpu/gpu/memory)
    ServiceInfo->>User: Return ServiceInfo instance
Loading

Possibly related PRs

  • ai-dynamo/dynamo#1240: Updates the same version strings and dependencies, indicating a directly related version bump and dependency update.
  • ai-dynamo/dynamo#1255: Modifies the from_service method in the same file to handle Docker image logic and resource field corrections.

Suggested labels

size/M, fix

Suggested reviewers

  • ryanolson
  • paulhendricks
  • kkranen
  • tanmayv25
  • nnshah1
  • alec-flowers
  • GuanLuo
  • ptarasiewiczNV
  • oandreeva-nv
  • rmccorm4

Poem

Hopping through fields of versioned delight,
The rabbit updates numbers, making things right.
CPUs and GPUs, resources anew,
Docker and YAML get a fresh view.
With tests and configs all in a row,
This patch brings a harmonious flow!
🐇✨


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/support_matrix.md (1)

71-71: Minor formatting: Add blank line around table.

The static analysis tool suggests adding a blank line around the table for better markdown formatting.

| **Build Dependency** | **Version** |
|----------------------|-------------|
+
| **Base Container**   |    [25.03](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base/tags)    |
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

71-71: Tables should be surrounded by blank lines
null

(MD058, blanks-around-tables)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ea0830 and 5b980db.

⛔ Files ignored due to path filters (3)
  • Cargo.lock is excluded by !**/*.lock
  • lib/bindings/python/Cargo.lock is excluded by !**/*.lock
  • lib/runtime/examples/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • Cargo.toml (2 hunks)
  • container/Dockerfile.vllm (1 hunks)
  • container/build.sh (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1 hunks)
  • deploy/sdk/src/dynamo/sdk/cli/build.py (1 hunks)
  • deploy/sdk/src/dynamo/sdk/core/lib.py (0 hunks)
  • deploy/sdk/src/dynamo/sdk/core/protocol/interface.py (2 hunks)
  • deploy/sdk/src/dynamo/sdk/tests/test_resources.py (2 hunks)
  • docs/support_matrix.md (1 hunks)
  • examples/llm/configs/agg_router.yaml (1 hunks)
  • lib/bindings/python/Cargo.toml (2 hunks)
  • lib/llm/Cargo.toml (1 hunks)
  • lib/runtime/examples/Cargo.toml (1 hunks)
  • pyproject.toml (2 hunks)
💤 Files with no reviewable changes (1)
  • deploy/sdk/src/dynamo/sdk/core/lib.py
🧰 Additional context used
🧬 Code Graph Analysis (2)
deploy/sdk/src/dynamo/sdk/tests/test_resources.py (1)
deploy/sdk/src/dynamo/sdk/core/protocol/interface.py (2)
  • ServiceInterface (105-198)
  • config (116-118)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2)
deploy/cloud/operator/internal/consts/consts.go (1)
  • KubeLabelDynamoComponent (47-47)
deploy/cloud/operator/internal/dynamo/graph.go (1)
  • ComponentTypePlanner (48-48)
🪛 markdownlint-cli2 (0.17.2)
docs/support_matrix.md

71-71: Tables should be surrounded by blank lines
null

(MD058, blanks-around-tables)

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
🔇 Additional comments (22)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)

85-85:

Details

❌ Incorrect review comment

Verify the component type casing consistency.

There's a potential inconsistency with the component type constant. The new constant ComponentTypePlanner = "Planner" uses capital "P", but there's already a similar constant ComponentTypePlanner = "planner" (lowercase) in deploy/cloud/operator/internal/dynamo/graph.go line 48.

Please verify which casing is correct for the component type identification:


🏁 Script executed:

#!/bin/bash
# Search for component type usages and labels to determine the correct casing
echo "=== Searching for Planner component type references ==="
rg -i "planner" --type go -A 2 -B 2

echo -e "\n=== Searching for component type label usage ==="
rg "KubeLabelDynamoComponent.*[Pp]lanner" --type go -A 2 -B 2

echo -e "\n=== Searching for component type constants ==="
ast-grep --pattern $'ComponentTypePlanner = $_'

Length of output: 13159


Ignore the casing discrepancy – it’s intentional

The two ComponentTypePlanner constants serve different contexts and must remain as-is:

  • In deploy/cloud/operator/internal/dynamo/graph.go (and its tests), ComponentTypePlanner = "planner" (lowercase) is used for Kubernetes label values (commonconsts.KubeLabelDynamoComponent).
  • In deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (and related JSON‐based DYN_DEPLOYMENT_CONFIG), ComponentTypePlanner = "Planner" (capitalized) maps to configuration keys and command-line flags.

No changes required.

Likely an incorrect or invalid review comment.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)

937-937: Correct test update reflecting the conditional environment argument logic.

The test correctly removes the --test-lws-deploy-service.environment=kubernetes argument from the expected command since the test deployment doesn't have the Planner component type label. This properly validates the new conditional behavior implemented in the controller.

deploy/sdk/src/dynamo/sdk/core/protocol/interface.py (3)

66-70: LGTM! Good validation for GPU field.

The validator ensures consistent string handling for GPU values, converting integers to strings when needed. This prevents type inconsistencies.


77-77:

Details

✅ Verification successful

Verify field rename from 'resource' to 'resources' across codebase.

The field has been renamed from resource to resources in ServiceConfig. This is a breaking change that requires updates in all code that accesses this field.


🏁 Script executed:

#!/bin/bash
# Description: Check for usage of the old 'resource' field name that needs updating

# Search for any code accessing the old 'resource' field
rg "\.resource\b" --type py -A 2 -B 2

# Search for any remaining references to old field in tests or other files
ast-grep --pattern $'$_.resource'

Length of output: 393


No outdated .resource references found
A search for .resource returned only an import of the resource module, with no attribute accesses using the old field name. The rename to resources has been applied consistently across the codebase.


62-64:

Details

✅ Verification successful

Verify resource type changes don't break existing code.

The resource fields have been changed from mixed types to all strings with updated defaults:

  • cpu: Changed from int (default 1) to str (default "1")
  • memory: Default increased from "100Mi" to "500Mi"
  • gpu: Now enforced as string type with validator

These type changes could break existing code that expects integer CPU values or relies on the old memory default.


🏁 Script executed:

#!/bin/bash
# Description: Check for any code that might be affected by the resource type changes

# Search for direct access to cpu field that might expect integer
rg -A 3 -B 3 "\.cpu\s*=" --type py

# Search for any hardcoded references to the old memory default
rg "100Mi" --type py

# Search for potential integer assignments to cpu field
ast-grep --pattern $'$_.cpu = $INT'

Length of output: 1405


ResourceConfig string-based fields verified

  • Tests in deploy/sdk/src/dynamo/sdk/tests/test_resources.py already assert string values ("2", "1", "4Gi"), confirming consumer code expects strings.
  • No occurrences of the old "100Mi" default were found in the codebase.
  • Deployment logic in deploy/sdk/src/dynamo/sdk/core/protocol/deployment.py strips and validates string CPU formats.
  • The gpu validator aligns with existing usage of string values.
  • The ServiceConfig.resourceresources rename is consistently applied (tests reference .resources).

No breaking changes detected.

examples/llm/configs/agg_router.yaml (1)

43-44: LGTM! Resource specifications align with new schema.

The explicit CPU and memory resource specifications align well with the updated ResourceConfig schema. The allocations (10 CPUs, 20Gi memory, 1 GPU) are reasonable for a VLLM worker service.

pyproject.toml (2)

18-18: Version bump to 0.3.0 looks appropriate.

The minor version bump from 0.2.1 to 0.3.0 is consistent with the scope of changes including resource configuration updates and operator fixes.


32-32: Runtime dependency version correctly updated.

The ai-dynamo-runtime dependency version is properly updated to match the project version, ensuring compatibility.

lib/runtime/examples/Cargo.toml (1)

24-24: Version consistency maintained across workspace.

The version bump to 0.3.0 maintains consistency with the Python package and overall project versioning.

lib/llm/Cargo.toml (1)

84-84: LGTM - Dependency version update aligns with workspace version bump.

The nixl-sys dependency version has been updated from 0.2.1-rc.3 to 0.3.0-rc.2, which aligns with the broader workspace version bump to 0.3.0. The change from rc.3 to rc.2 is expected as this represents a new major/minor version series with its own release candidate numbering.

lib/bindings/python/Cargo.toml (2)

22-22: LGTM - Version bump aligns with workspace updates.

The version update from 0.2.1 to 0.3.0 is consistent with the coordinated workspace version bump across all related packages.


78-78: Good addition of trailing newline.

Adding the trailing newline improves file formatting consistency.

Cargo.toml (2)

31-31: LGTM - Workspace version bump to 0.3.0.

The workspace version has been properly updated from 0.2.1 to 0.3.0, coordinating the version bump across the entire project.


42-44: LGTM - Local dependency versions updated consistently.

All local workspace dependencies (dynamo-runtime, dynamo-llm, dynamo-tokens) have been updated consistently to version 0.3.0, ensuring proper version alignment across the workspace.

deploy/sdk/src/dynamo/sdk/tests/test_resources.py (2)

19-19: Good addition of ServiceInterface import.

Adding the ServiceInterface import enables proper type annotation and interface-based testing, which improves type safety and code clarity.


44-48: LGTM - Improved test assertions with proper typing.

The updates improve the test in several ways:

  • Proper type annotation using ServiceInterface provides better type safety
  • Updated assertions using dyn_svc.config correctly test the interface contract
  • Resource value assertions verify that the decorator properly configures CPU ("2"), GPU ("1"), and memory ("4Gi") as string values
  • The change to string-based resource values aligns with the updated resource configuration model
container/Dockerfile.vllm (1)

172-172: LGTM! Version bump is coordinated with other package updates.

The version update from "0.8.4.post1" to "0.8.4.post2" is consistent with the overall project version bump and aligns with documentation updates.

container/build.sh (1)

112-112: LGTM! NIXL commit hash update aligns with dependency version bump.

The commit hash update corresponds to the nixl-sys dependency upgrade and follows the coordinated version bump strategy.

docs/support_matrix.md (1)

65-69: LGTM! Coordinated version updates are consistent across all packages.

All version updates follow the coordinated release pattern and align with changes in other files:

  • ai-dynamo and related packages: 0.2.1 → 0.3.0
  • ai-dynamo-vllm: 0.8.4.post1 → 0.8.4.post2
  • NIXL: 0.2.1 → 0.3.0

Also applies to: 74-74, 76-76, 79-79

deploy/sdk/src/dynamo/sdk/cli/build.py (3)

132-136: LGTM! Improved image resolution with proper fallback logic.

The new image resolution logic provides flexibility by checking service.config.image first, then falling back to the global DYNAMO_IMAGE constant. The assertion ensures that an image is always specified, improving robustness.


143-143: LGTM! Using resolved image variable improves code clarity.

Using the image variable instead of directly accessing service.config.image improves code clarity and ensures the fallback logic is properly applied.


141-141:

Details

✅ Verification successful

Verify the field name change is consistent across the codebase.

The field name changed from resource to resources. This appears to be part of a schema update, but please ensure this change is consistent across all files that reference this field.


🏁 Script executed:

#!/bin/bash
# Description: Verify all references to the resource/resources field are updated consistently
# Expected: All references should use 'resources' not 'resource'

# Search for any remaining references to the old field name
rg -A 2 -B 2 "\.resource\b" --type py

# Search for the new field name usage
rg -A 2 -B 2 "\.resources\b" --type py

Length of output: 1265


Field rename consistency verified

All references to the config field now use .resources. The only remaining resource occurrences are module imports in allocator.py (from dynamo.sdk.lib.resource import …), not attribute accesses. No further changes needed.

Comment on lines +1458 to +1460
if componentType, exists := opt.dynamoComponentDeployment.Labels[commonconsts.KubeLabelDynamoComponent]; exists && componentType == ComponentTypePlanner {
args = append(args, fmt.Sprintf("--%s.environment=%s", opt.dynamoComponentDeployment.Spec.ServiceName, KubernetesDeploymentStrategy))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Good implementation of conditional environment argument injection.

The logic correctly restricts the --environment=kubernetes argument to only Planner components as specified in the PR objectives. The implementation properly checks the component type before adding the argument.

Please verify that KubeLabelDynamoComponent contains the component type (not just the component name) for proper matching:


🏁 Script executed:

#!/bin/bash
# Verify the label contains component type information
echo "=== Checking KubeLabelDynamoComponent usage ==="
rg "KubeLabelDynamoComponent" --type go -A 3 -B 3

echo -e "\n=== Checking how component labels are set ==="
ast-grep --pattern $'labels[commonconsts.KubeLabelDynamoComponent] = $_'

Length of output: 22753


Use the correct label key for component-type checks

The code is currently using nvidia.com/dynamo-component (which holds the component name) to decide when to inject the Planner environment flag. It should instead read nvidia.com/dynamo-component-type (which holds the component type). Please update the conditional as follows:

--- a/deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
+++ b/deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
@@ -1457,7 +1457,7 @@ func (r *DynamoComponentDeploymentReconciler) reconcile(...) {
     }
-    if componentType, exists := opt.dynamoComponentDeployment.Labels[commonconsts.KubeLabelDynamoComponent]; exists && componentType == ComponentTypePlanner {
+    if componentType, exists := opt.dynamoComponentDeployment.Labels[commonconsts.KubeLabelDynamoComponentType]; exists && componentType == ComponentTypePlanner {
         args = append(args, fmt.Sprintf("--%s.environment=%s", opt.dynamoComponentDeployment.Spec.ServiceName, KubernetesDeploymentStrategy))
     }
 }

• File: deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
• Lines: 1458–1460

This change ensures the Planner-only --environment=kubernetes flag is tied to the type label, not the name label.

🤖 Prompt for AI Agents
In
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go
around lines 1458 to 1460, the code incorrectly uses the label key for component
name (commonconsts.KubeLabelDynamoComponent) to check the component type. Update
the conditional to use the correct label key
commonconsts.KubeLabelDynamoComponentType that holds the component type. This
ensures the environment argument is only injected for Planner components based
on their type, not their name.

@nv-anants nv-anants merged commit 799748b into release/0.3.0 May 30, 2025
15 checks passed
@nv-anants nv-anants deleted the bis/dep-130-cp-fixes branch May 30, 2025 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants