Skip to content

Conversation

@tmonty12
Copy link
Contributor

@tmonty12 tmonty12 commented Sep 24, 2025

Overview:

This PR adds a field subComponentType to the DGD service API for the planner component to determine the prefill and decode services (components) to scale.

Details:

  • benchmark/profiler: updates profiler to add support for using the subComponentType to modify configs for prefill/decode profiling. Has backwards support for the hardcoded component names per backend framework.
  • For alldisagg_planner.yaml:
    • Remove the Prometheus service definitions in favor of using the kube-prometheus-stack k8s installation.
    • Add subComponentType field
  • components/planner/src/dynamo/planner:
    • updates kubernetes_connector.py to use the subComponentType field but includes fallback logic for hardcoded component names.
    • updates planner_core.py to verify that the DGD deployment has a prefill and decode component defined (via subComponentType or hardcoded component names) before starting planner loop - this ensures we fail fast in the case of an invalid DGD according to planner.
  • deploy/cloud: Define subComponentType in API - optional field. Adds sub component type label to downstream k8s resources.

Edited:

  • Dockerfile*: removing the prometheus dependency as components/planner/src/dynamo/planner/prometheus.py has been removed. This would start a prometheus server as a component of DGD but has been deprecated in favor of using cluster wide prometheus helm installation.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features
    • Introduced explicit sub-component types (prefill/decode) across deployments and planner for more precise scaling and targeting.
    • Planner now supports configurable adjustment intervals and readiness verification with clearer error messages.
    • Added environment override for Prometheus endpoint.
  • Documentation
    • Updated SLA planner guide: interval-based behavior, kube-prometheus-stack prerequisite, troubleshooting for unknown subComponentType.
  • Tests
    • Expanded coverage for planner, Kubernetes connector, and configuration modifiers.
  • Chores
    • Deployment manifests and CRDs updated to include subComponentType and main container; streamlined Prometheus blocks and added pod labels.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label Sep 24, 2025
@tmonty12 tmonty12 marked this pull request as ready for review September 24, 2025 17:46
@tmonty12 tmonty12 requested review from a team as code owners September 24, 2025 17:46
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 24, 2025

Walkthrough

Introduces subComponentType across CRDs, operator, deployments, and planner code. Refactors planner and Kubernetes connector to resolve services by sub-component type (prefill/decode), updates Kubernetes API methods, adds new exceptions, and adjusts profiler utilities and tests to be target-aware. Removes embedded Prometheus blocks and adds optional env override for Prometheus endpoint.

Changes

Cohort / File(s) Summary of changes
Profiler entrypoint changes
benchmarks/profiler/profile_sla.py
Updated calls to config modifiers: pass target mode ("prefill"/"decode") and full context to KV-cache log parsing.
Profiler config utilities
benchmarks/profiler/utils/config.py
Added subComponentType to Service model; introduced target-aware APIs; added helpers for log path and service resolution; refactored worker resolution, model/port retrieval, and KV-cache parsing.
Profiler tests
benchmarks/profiler/tests/config_test.py
New comprehensive tests and fixtures for VLLM/SGLang/TRTLLM covering model name, conversions, TP sizing, arg translation, and service naming with/without subComponentType.
Backend deployments (YAML)
components/backends/vllm/deploy/disagg_planner.yaml, components/backends/sglang/deploy/disagg_planner.yaml, components/backends/trtllm/deploy/disagg_planner.yaml
Added subComponentType to prefill/decode workers; removed Prometheus blocks/envs from planner manifests.
Planner exports and defaults
components/planner/src/dynamo/planner/__init__.py, .../defaults.py
Exported TargetReplica/SubComponentType; added PROMETHEUS_ENDPOINT env override.
Kubernetes API adapter
components/planner/src/dynamo/planner/kube.py
Reworked API: sync get_graph_deployment(name), update_graph_replicas(...), is_deployment_ready(deployment); adjusted wait loop; integrated logging and not-found exception.
Planner exceptions
components/planner/src/dynamo/planner/utils/exceptions.py
New exception hierarchy for deployment and sub-component resolution (not found, duplicates, validation, empty targets).
Kubernetes connector (planner)
components/planner/src/dynamo/planner/kubernetes_connector.py
Introduced SubComponentType, Service, TargetReplica; methods now operate on sub-component types; added readiness wait, verification, and service resolution by sub-component/name.
Planner core and virtual connector
components/planner/src/dynamo/planner/utils/planner_core.py, .../virtual_connector.py
Planner uses TargetReplica and verifies components/readiness; virtual connector updated to target-based APIs and adds readiness/verification helpers.
Planner tests
components/planner/test/kube.py, components/planner/test/kubernetes_connector.py
Updated to new APIs/types and exceptions; added coverage for not-found cases, duplicates, readiness, and target-replica operations.
Operator CRDs (Helm/base)
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml, .../nvidia.com_dynamographdeployments.yaml, deploy/cloud/operator/config/crd/bases/*
Added mainContainer (component CRD) and subComponentType fields; updated descriptions; propagated schema updates to bases.
Operator API/types and consts
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go, deploy/cloud/operator/internal/consts/consts.go
Added SubComponentType to shared spec; added KubeLabelDynamoSubComponentType constant.
Operator controllers and tests
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go, .../dynamocomponentdeployment_controller_test.go, .../dynamographdeployment_controller.go
Labeled pods with sub-component type when set; updated tests to expect label; improved status logging; adjusted reconcile flow.
Operator graph builder and tests
deploy/cloud/operator/internal/dynamo/graph.go, .../graph_test.go
Applied sub-component label to resources when present; updated tests to set/verify labels and spec field.
Docs
docs/kubernetes/sla_planner_deployment.md
Documented interval-driven planner loop, Prometheus prerequisites/commands, and CRD subComponentType troubleshooting.
Planner test configs (YAML)
tests/planner/perf_test_configs/disagg_8b_planner.yaml, tests/planner/scaling/disagg_planner.yaml
Removed Prometheus sections; added subComponentType to VLLM workers.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Profiler as Profiler
  participant Config as Config Modifiers
  participant FS as Logs FS
  note over Profiler,Config: Target-aware profiling (prefill/decode)
  Profiler->>Config: set_config_tp_size(config, tp, target)
  Profiler->>Config: get_kv_cache_size_from_dynamo_log(config, work_dir, name, target)
  Config->>Config: Resolve service by subComponentType (target)
  Config->>FS: get_dynamo_log_path(work_dir, name, resolved_service)
  FS-->>Config: log path
  Config-->>Profiler: KV cache size
Loading
sequenceDiagram
  autonumber
  participant Planner as SLA Planner
  participant K8sConn as KubernetesConnector
  participant KubeAPI as KubernetesAPI
  note over Planner,K8sConn: Sub-component-based scaling
  Planner->>K8sConn: verify_prefill_and_decode_components_exist(...)
  K8sConn->>KubeAPI: get_graph_deployment(name)
  KubeAPI-->>K8sConn: deployment(spec.services[].subComponentType)
  K8sConn-->>Planner: OK or raise
  Planner->>K8sConn: set_component_replicas([TargetReplica(PREFILL), TargetReplica(DECODE)])
  K8sConn->>KubeAPI: update_graph_replicas(name, component_name, replicas)
  KubeAPI-->>K8sConn: patched
  K8sConn->>KubeAPI: get_graph_deployment(name)
  K8sConn->>KubeAPI: is_deployment_ready(deployment)
  KubeAPI-->>K8sConn: ready?
  K8sConn-->>Planner: scaling complete
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

I hop between prefill and decode lanes,
Tagging pods with tidy names.
Sub-components shine, neatly aligned,
Planner taps the beats in time.
Logs in burrows, metrics in sight—
Scale with a twitch, then goodnight.
(\_/) ✅

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.81% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description follows the template by including Overview, Details, and Related Issues sections, but the “Where should the reviewer start?” section is left at the placeholder without listing any specific files or directories, so reviewers lack guidance on which parts of the changes to inspect first. Please update the “Where should the reviewer start?” section to include the key files and directories to review first, for example the updated profiler utilities in benchmarks/profiler, the planner connector and core changes under components/planner, and the disagg_planner.yaml updates in each backend’s deploy folder.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly and concisely summarizes the primary change by stating the addition of the subComponentType field in the DGD API and its integration (“uptake”) in the planner, matching the core objective of the pull request. It avoids extraneous details and uses direct, actionable language.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
docs/kubernetes/sla_planner_deployment.md (1)

52-60: Remove Prometheus from expected pods; it now runs under kube-prometheus-stack in monitoring namespace

Docs say Prometheus blocks were removed in favor of kube-prometheus-stack. The expected pods list still shows a Prometheus pod in the app namespace, which is misleading.

Apply:

 vllm-disagg-planner-frontend-*            1/1 Running
-vllm-disagg-planner-prometheus-*          1/1 Running
 vllm-disagg-planner-planner-*             1/1 Running
 vllm-disagg-planner-backend-*             1/1 Running
 vllm-disagg-planner-prefill-*             1/1 Running
+# Prometheus is provided by kube-prometheus-stack in the "monitoring" namespace and is not expected here.
components/planner/src/dynamo/planner/utils/planner_core.py (1)

395-420: Potential UnboundLocalError: next_num_p/next_num_d may be undefined when predictions are None.

If predict_load() returns any None, the compute block is skipped and we later reference undefined variables in target_replicas.

Apply:

         next_num_req, next_isl, next_osl = self.predict_load()
 
-        if next_num_req is not None and next_isl is not None and next_osl is not None:
+        if next_num_req is None or next_isl is None or next_osl is None:
+            logger.info("Missing load predictions; skipping adjustment")
+            return
+        else:
             try:
                 next_num_p, next_num_d = self._compute_replica_requirements(
                     next_num_req, next_isl, next_osl
                 )
             except Exception as e:
                 logger.error(f"Failed to compute number of replicas: {e}")
                 return
 
         if not self.args.no_operation:
             # Ensure change for virtual connector does not break things
             target_replicas = [
                 TargetReplica(
                     sub_component_type=SubComponentType.PREFILL, 
                     component_name=self.prefill_component_name, 
                     desired_replicas=next_num_p
                 ),
                 TargetReplica(
                     sub_component_type=SubComponentType.DECODE, 
                     component_name=self.decode_component_name, 
                     desired_replicas=next_num_d
                 ),
             ]
components/planner/src/dynamo/planner/kubernetes_connector.py (1)

221-234: Fix CLI: map --component to SubComponentType and correct default.

add_component/remove_component expect SubComponentType, but the CLI passes a string and defaults to "planner" (invalid). This will raise AttributeError at runtime.

Apply this diff:

 parser = argparse.ArgumentParser()
 parser.add_argument("--dynamo_namespace", type=str, default="dynamo")
 parser.add_argument("--k8s_namespace", type=str, default="default")
 parser.add_argument("--action", type=str, choices=["add", "remove"])
-parser.add_argument("--component", type=str, default="planner")
+parser.add_argument(
+    "--component",
+    type=str,
+    choices=[t.value for t in SubComponentType],
+    default=SubComponentType.PREFILL.value,
+    help="Target sub-component to scale",
+)
 parser.add_argument("--blocking", action="store_true")
 args = parser.parse_args()
 connector = KubernetesConnector(args.dynamo_namespace, args.k8s_namespace)

 if args.action == "add":
-    task = connector.add_component(args.component, args.blocking)
+    task = connector.add_component(SubComponentType(args.component), args.blocking)
 elif args.action == "remove":
-    task = connector.remove_component(args.component, args.blocking)
+    task = connector.remove_component(SubComponentType(args.component), args.blocking)
 asyncio.run(task)
components/planner/test/kube.py (3)

141-146: Patching async method with a non‑awaitable will fail

get_graph_deployment is async; patching it to return a dict makes await blow up. Use AsyncMock.

Apply this diff:

-    with patch.object(
-        k8s_api, "get_graph_deployment", return_value=mock_deployment
-    ):
+    with patch.object(k8s_api, "get_graph_deployment", new_callable=AsyncMock) as mock_get:
+        mock_get.return_value = mock_deployment
         # Test with minimal attempts and delay for faster testing
         await k8s_api.wait_for_graph_deployment_ready(
             "test-deployment", max_attempts=2, delay_seconds=0.1
         )

Also add AsyncMock to imports:

# at top of file
from unittest.mock import MagicMock, patch, AsyncMock

167-173: Same AsyncMock issue in timeout test

Patch the async method with AsyncMock.

Apply this diff:

-    with patch.object(
-        k8s_api, "get_graph_deployment", return_value=mock_deployment
-    ):
+    with patch.object(k8s_api, "get_graph_deployment", new_callable=AsyncMock) as mock_get:
+        mock_get.return_value = mock_deployment
         # Test with minimal attempts and delay for faster testing
         with pytest.raises(TimeoutError) as exc_info:
             await k8s_api.wait_for_graph_deployment_ready(
                 "test-deployment", max_attempts=2, delay_seconds=0.1
             )

202-208: Async patching needed here as well

Use AsyncMock to patch get_graph_deployment.

Apply this diff:

-    with patch.object(
-        k8s_api, "get_graph_deployment", return_value=mock_deployment
-    ):
+    with patch.object(k8s_api, "get_graph_deployment", new_callable=AsyncMock) as mock_get:
+        mock_get.return_value = mock_deployment
         # Test with minimal attempts and delay for faster testing
         with pytest.raises(TimeoutError) as exc_info:
             await k8s_api.wait_for_graph_deployment_ready(
                 "test-deployment", max_attempts=2, delay_seconds=0.1
             )
components/planner/src/dynamo/planner/virtual_connector.py (1)

306-312: Raise the standardized EmptyTargetReplicasError

Aligns with the KubernetesConnector and public exception API.

-        if not target_replicas:
-            raise ValueError("target_replicas cannot be empty")
+        if not target_replicas:
+            raise EmptyTargetReplicasError()

Also add the import:

-from dynamo.planner.utils.exceptions import DeploymentValidationError
+from dynamo.planner.utils.exceptions import DeploymentValidationError, EmptyTargetReplicasError
benchmarks/profiler/utils/config.py (3)

243-244: Guard optional flag removal.

.remove raises if the flag is absent.

Apply this diff:

-            args.remove("--is-prefill-worker")
+            if "--is-prefill-worker" in args:
+                args.remove("--is-prefill-worker")

400-427: Possible UnboundLocalError in except; brittle parsing for concurrency line.

line may be undefined if file open fails; avoid referencing it in the except. Also avoid slicing [:-1] to strip trailing chars.

Apply this diff:

-        try:
+        try:
             with open(dynamo_log_fn, "r") as f:
                 for line in f:
                     if "Maximum concurrency for" in line:
                         line = line.strip().split("Maximum concurrency for ")[1]
                         token_count = int(
                             line.split(" tokens per request: ")[0].replace(",", "")
                         )
-                        concurrency = float(line.split(" tokens per request: ")[1][:-1])
+                        # Strip any trailing punctuation/percent symbols/spaces
+                        concurrency_str = line.split(" tokens per request: ")[1].strip().rstrip("%). ")
+                        concurrency = float(concurrency_str)
@@
-        except Exception as e:
-            logger.warning(
-                f"Failed to parse KV cache size from line: {line}. Error: {e}"
-            )
+        except Exception as e:
+            logger.warning(f"Failed to parse KV cache size from file {dynamo_log_fn}. Error: {e}")

687-725: Ensure TRTLLMWorker exists after conversion; fail fast if neither worker present.

When neither prefill/decode is found, accessing cfg.spec.services["TRTLLMWorker"] will crash.

Apply this diff:

-            worker_service = cfg.spec.services["TRTLLMWorker"]
+            if "TRTLLMWorker" not in cfg.spec.services:
+                raise ValueError("Missing TRT-LLM worker service after conversion; ensure prefill or decode worker exists")
+            worker_service = cfg.spec.services["TRTLLMWorker"]
🧹 Nitpick comments (46)
components/planner/src/dynamo/planner/defaults.py (1)

57-62: Clarify expected PROMETHEUS_ENDPOINT format and ensure consistency

Confirm whether PROMETHEUS_ENDPOINT must include a scheme (http/https). Current logic returns a full URL when the env var is set, but the fallback path may return a bare hostname (without scheme), which could be inconsistent at call sites.

Would you like a follow-up patch to normalize the return value to a full URL in all branches, or to validate/auto-prepend a default scheme (e.g., http://) when missing?

docs/kubernetes/sla_planner_deployment.md (5)

12-13: Clarify scrape vs. adjustment interval relationship to avoid empty query windows

State that the planner’s adjustment interval should be >= Prometheus scrape interval (with small slack). Otherwise, planner queries may hit gaps and produce unstable scaling.

You can insert a short NOTE right after the paragraph below Line 17:

 The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found [here](/components/planner/src/dynamo/planner/defaults.py#L31).
 
+> Note
+> Ensure the planner adjustment interval is greater than or equal to the Prometheus scrape interval (plus a few seconds of slack) to avoid querying empty/partial windows and to stabilize scaling decisions.

16-17: Use descriptive link text and avoid fragile line anchors

Replace “here” with descriptive text and drop the line anchor to prevent link rot when the file changes.

Apply:

-The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found [here](/components/planner/src/dynamo/planner/defaults.py#L31).
+The adjustment interval can be defined in the planner manifest as an argument. See the default value in the [planner defaults (defaults.py)](/components/planner/src/dynamo/planner/defaults.py).

This also addresses markdownlint MD059.


16-17: Standardize terminology: decode vs backend

Elsewhere (Lines 92–98) logs say “decode workers,” while the diagram and Components list reference “backend.” Standardize on one term (ideally “decode”) or explicitly note that “backend” equals “decode” for vLLM to avoid confusion.


109-111: Parameterize Prometheus service name; it depends on Helm release

Service naming for kube-prometheus-stack varies by release (e.g., -kube-prometheus-prometheus). Provide discovery + port-forward steps.

Apply:

-# Verify Prometheus is accessible
-kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
-curl "http://localhost:9090/api/v1/query?query=up"
+# Verify Prometheus is accessible
+# Discover the Prometheus service name (replace monitoring if you use a different namespace)
+kubectl -n monitoring get svc | grep prometheus
+# Port-forward (replace RELEASE with your Helm release name)
+kubectl -n monitoring port-forward svc/RELEASE-kube-prometheus-prometheus 9090:9090
+curl "http://localhost:9090/api/v1/query?query=up"

126-133: subComponentType troubleshooting: verify minimum CRD version, fix filename, and improve link text

  • The filename in the error (“disagg.yaml”) should match earlier instructions (“disagg_planner.yaml”).
  • “here” link text violates MD059; use descriptive text.
  • “> 0.5.0” may be inaccurate; please confirm the exact minimum CRD version that introduces subComponentType and update the docs.

Apply:

-Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"
+Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg_planner.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"
-This is because the `subComponentType` field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions [here](/docs/kubernetes/installation_guide.md).
+This occurs because the `subComponentType` field exists only in newer versions of the DynamoGraphDeployment CRD (confirm the minimum version). Upgrade the CRD by following the [Kubernetes installation guide](/docs/kubernetes/installation_guide.md).

Please verify and update the exact minimum version in the sentence above. Optionally add a quick check:

# Check if subComponentType is present in your CRD schema
kubectl explain dynamographdeployment.spec.services.DecodeWorker | grep -i subComponentType || true
kubectl explain dynamographdeployment.spec.services.PrefillWorker | grep -i subComponentType || true
deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (2)

788-788: Grammar tweak: “allows overriding”

Tighten the wording for clarity and consistency.

-                        description: Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive)
+                        description: Dynamo namespace of the service (allows overriding the Dynamo namespace defined in annotations inside the Dynamo archive for this service)

918-920: Doc polish: grammar + expand “k8s” to “Kubernetes”; verify casing of mainContainer

Use formal terminology and fix “allows to override”. Also, ensure the field is actually named “mainContainer” (lower camelCase) to avoid confusion.

-                          ExtraPodSpec allows to override the main pod spec configuration.
-                          It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field
-                          that allows overriding the main container configuration.
+                          ExtraPodSpec allows overriding the main pod spec configuration.
+                          It is a standard Kubernetes PodSpec. It also contains a mainContainer (standard Kubernetes Container) field
+                          that allows overriding the main container configuration.

Please confirm the actual field name is “mainContainer”. If it’s “MainContainer” in the schema, keep the casing here.

components/planner/src/dynamo/planner/kubernetes_connector.py (1)

186-193: Remove unused variable ‘available_types’.

It’s computed but never used.

Apply this diff:

-        # Collect all available subComponentTypes for better error messages
-        available_types = []
         matching_services = []
         
         for curr_name, curr_service in services.items():
             service_sub_type = curr_service.get("subComponentType", "")
-            if service_sub_type:
-                available_types.append(service_sub_type)
             
             if service_sub_type == sub_component_type.value:
                 matching_services.append((curr_name, curr_service))
deploy/cloud/operator/internal/dynamo/graph.go (1)

995-997: Propagate sub-component label — consider also setting it on the parent DCD labels

Setting KubeLabelDynamoSubComponentType here is good. For consistency/queryability, also add this label when initially constructing the DynamoComponentDeployment labels so the DCD itself is labeled (not just pods/cliques). Example insertion in GenerateDynamoComponentsDeployments where other labels are set:

if component.SubComponentType != "" {
    labels[commonconsts.KubeLabelDynamoSubComponentType] = component.SubComponentType
}
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1)

1157-1160: Avoid mutating CRD metadata labels in-place

podLabels points to the CRD’s metadata labels map via getKubeLabels; adding the sub-component label mutates that base map. To reduce side effects, clone before augmenting pod labels:

// after: podLabels := r.getKubeLabels(opt.dynamoComponentDeployment)
podLabels = maps.Clone(podLabels)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)

76-78: Add validation for allowed sub-component values

Constrain values at the CRD level to prevent drift and ease client interoperability.

-    // SubComponentType indicates the sub-role of this component (for example, "prefill").
-    SubComponentType string `json:"subComponentType,omitempty"`
+    // SubComponentType indicates the sub-role of this component (for example, "prefill").
+    // +kubebuilder:validation:Enum=prefill;decode
+    SubComponentType string `json:"subComponentType,omitempty"`
deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (4)

686-686: Wording nit: "allows overriding"

Minor grammar fix to improve clarity.

-                  description: Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive)
+                  description: Dynamo namespace of the service (allows overriding the Dynamo namespace defined in annotations inside the Dynamo archive)

819-822: Clarify and align casing for mainContainer

Use the exact field name in the description to avoid confusion and make intent explicit.

-                    ExtraPodSpec allows to override the main pod spec configuration.
-                    It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field
-                    that allows overriding the main container configuration.
+                    ExtraPodSpec allows overriding the main PodSpec configuration.
+                    It is a standard Kubernetes PodSpec with an additional "mainContainer" (standard k8s Container)
+                    field that allows overriding the main container configuration.

10243-10243: Align description with field name "serviceName"

Current text is ambiguous; clarify it refers to the Kubernetes Service name for this component.

-                  description: The name of the component
+                  description: The name of the Kubernetes Service for this component

10257-10259: Constrain subComponentType to known values (and add discoverability)

Add enum validation to prevent typos and consider exposing it as a printer column.

Apply enum to the schema:

                 subComponentType:
                   description: SubComponentType indicates the sub-role of this component (for example, "prefill").
                   type: string
+                  enum:
+                    - prefill
+                    - decode

Additionally, add a printer column (outside this hunk) for quick visibility when listing CRs:

@@
     - additionalPrinterColumns:
       - description: Dynamo component
         jsonPath: .spec.dynamoComponent
         name: DynamoComponent
         type: string
       - description: Available
         jsonPath: .status.conditions[?(@.type=='Available')].status
         name: Available
         type: string
+      - description: Sub-component role
+        jsonPath: .spec.subComponentType
+        name: SubComponent
+        type: string

Please confirm planner/operator currently accept only "prefill"/"decode"; if more values are expected imminently, we can switch to a regex pattern or defer the enum.

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1)

701-705: Use domain-like SubComponentType and add nil-case test

-                               SubComponentType: "test-sub-component",
+                               SubComponentType: "prefill",
  • Add a companion test case with SubComponentType unset to assert the sub-component label is omitted (backward compatibility).
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (3)

819-821: Guard against conflicting ExtraPodSpec overrides

ExtraPodSpec now documents a MainContainer override; add a CEL validation to prevent setting both mainContainer and containers simultaneously to avoid ambiguous precedence.

Apply this diff:

   extraPodSpec:
     description: |-
       ExtraPodSpec allows to override the main pod spec configuration.
       It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field
       that allows overriding the main container configuration.
+    x-kubernetes-validations:
+      - rule: "!(has(self.mainContainer) && has(self.containers) && size(self.containers) > 0)"
+        message: "Set either mainContainer or containers, not both."

10243-10244: Clarify serviceName description

Avoid confusion with “component” vs K8s Service. Recommend clarifying the description.

-                serviceName:
-                  description: The name of the component
+                serviceName:
+                  description: The Kubernetes Service name for this component

10257-10260: Strengthen subComponentType schema (enum + clearer doc)

To prevent typos and align with planner expectations, restrict allowed values and clarify examples.

-                subComponentType:
-                  description: SubComponentType indicates the sub-role of this component (for example, "prefill").
-                  type: string
+                subComponentType:
+                  description: SubComponentType indicates the sub-role of this component (e.g., "prefill" or "decode").
+                  type: string
+                  enum:
+                    - prefill
+                    - decode

Optional: expose this in kubectl get output by adding an additionalPrinterColumn (outside this hunk):

  • Path: .spec.subComponentType
  • Name: SubComponentType
  • Type: string

Would you like me to propose the exact additionalPrinterColumns snippet in the versions block?

benchmarks/profiler/profile_sla.py (3)

155-156: Avoid cumulative mutation of config across TP loop

set_config_tp_size returns a new dict but you feed the previous iteration’s mutated config back in. It’s safer to start from the base per iteration to prevent argument duplication or drift across backends.

Consider keeping a base prefill_config = convert_config(config, "prefill") and, inside the loop, calling set_config_tp_size on that base to get an iteration‑local config.


281-282: Same concern for decode TP loop

Reuse the base decode_config from convert_config(config, "decode") each iteration to avoid unwanted accumulation.


610-611: Repeat the KV‑tokens fallback for decode interpolation

Mirror the earlier safeguard to avoid empty sweeps when log parsing yields 0.

Apply this diff:

-            max_kv_tokens = config_modifier.get_kv_cache_size_from_dynamo_log(
-                decode_config, work_dir, client.deployment_name, "decode"
-            )
+            max_kv_tokens = config_modifier.get_kv_cache_size_from_dynamo_log(
+                decode_config, work_dir, client.deployment_name, "decode"
+            ) or (args.isl + args.osl)
components/planner/test/kube.py (3)

250-250: Remove unused fixture argument

mock_custom_api isn’t used in this test.

Apply this diff:

-async def test_get_graph_deployment(k8s_api, mock_custom_api):
+async def test_get_graph_deployment(k8s_api):

264-264: Remove unused fixture argument

mock_custom_api isn’t used here either.

Apply this diff:

-async def test_get_graph_deployment_not_found(k8s_api, mock_custom_api):
+async def test_get_graph_deployment_not_found(k8s_api):

139-146: Minor API design note: async wrapper over sync client blocks the loop

get_graph_deployment calls a synchronous Kubernetes client; awaiting it doesn’t prevent event‑loop blocking. Consider making get_graph_deployment synchronous or running the sync call in a thread executor to avoid blocking.

Proposed change in dynamo/planner/kube.py (outside this diff):

# Option A: make it sync
def get_graph_deployment(self, graph_deployment_name: str) -> dict:
    try:
        return self._get_graph_deployment_from_name(graph_deployment_name)
    except client.ApiException as e:
        if e.status == 404:
            raise DynamoGraphDeploymentNotFoundError(
                deployment_name=graph_deployment_name, namespace=self.current_namespace
            )
        raise

# and in wait_for_graph_deployment_ready: call it without await

# Option B: keep async and offload
async def get_graph_deployment(self, graph_deployment_name: str) -> dict:
    loop = asyncio.get_running_loop()
    try:
        return await loop.run_in_executor(
            None, self._get_graph_deployment_from_name, graph_deployment_name
        )
    except client.ApiException as e:
        if e.status == 404:
            raise DynamoGraphDeploymentNotFoundError(
                deployment_name=graph_deployment_name, namespace=self.current_namespace
            )
        raise
deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (3)

788-788: Tighten wording for clarity (remove “allows to override”).

Suggested edit improves grammar and keeps meaning intact.

-                        description: Dynamo namespace of the service (allows to override the Dynamo namespace of the service defined in annotations inside the Dynamo archive)
+                        description: Dynamo namespace of the service (allows overriding the namespace defined via annotations in the Dynamo archive)

918-920: Clarify “ExtraPodSpec” text and fix “MainContainer” casing.

  • Use “allows overriding” instead of “allows to override”.
  • Refer to the field as “mainContainer” (camelCase) to match schema conventions.
-                          ExtraPodSpec allows to override the main pod spec configuration.
-                          It is a k8s standard PodSpec. It also contains a MainContainer (standard k8s Container) field
-                          that allows overriding the main container configuration.
+                          ExtraPodSpec allows overriding the main PodSpec configuration.
+                          It is a standard Kubernetes PodSpec and includes a mainContainer (standard Kubernetes Container) field
+                          for overriding the main container configuration.

10342-10342: Align terminology: “serviceName” should describe a service.

Current text says “component” which can be confusing given the field is serviceName.

-                        description: The name of the component
+                        description: The name of the service (component)
components/planner/test/kubernetes_connector.py (3)

35-39: Use Mock for sync method to avoid confusion

update_graph_replicas is synchronous. Using AsyncMock here is misleading. Switch to Mock() for accuracy and to avoid accidental awaits.

-    mock_api.update_graph_replicas = AsyncMock()
+    mock_api.update_graph_replicas = Mock()

59-65: Avoid brittle assertion on exact error string

Asserting exact message text is fragile. Check for the presence of the env var token instead.

-    assert set(exception.errors) == {"DYN_PARENT_DGD_K8S_NAME environment variable is not set"}
+    assert any("DYN_PARENT_DGD_K8S_NAME" in e for e in exception.errors)

184-189: Pass SubComponentType instead of a string

For clarity and type safety, pass SubComponentType (even though the exception is thrown before it’s used).

-    with pytest.raises(DynamoGraphDeploymentNotFoundError):
-        await kubernetes_connector.add_component(component_name)
+    with pytest.raises(DynamoGraphDeploymentNotFoundError):
+        await kubernetes_connector.add_component(SubComponentType.PREFILL)
components/planner/src/dynamo/planner/utils/exceptions.py (1)

120-122: Remove stray trailing comma in constructor

Minor cleanup: __init__(self, ) is odd; make it a standard no-arg signature.

-    def __init__(self, ):
+    def __init__(self):
         message = "target_replicas cannot be empty"
         super().__init__(message)
components/planner/src/dynamo/planner/kube.py (2)

104-138: Check readiness before sleeping; don’t delay the first attempt

Currently you sleep before the first readiness check, adding unnecessary latency. Check first, then sleep between retries.

-        for attempt in range(max_attempts):
-            await asyncio.sleep(delay_seconds)
-
-            graph_deployment = await self.get_graph_deployment(
-                graph_deployment_name
-            )
+        for attempt in range(1, max_attempts + 1):
+            graph_deployment = await self.get_graph_deployment(graph_deployment_name)
@@
-            logger.info(
-                f"[Attempt {attempt + 1}/{max_attempts}] "
+            logger.info(
+                f"[Attempt {attempt}/{max_attempts}] "
                 f"(status: {ready_condition.get('status') if ready_condition else 'N/A'}, "
                 f"message: {ready_condition.get('message') if ready_condition else 'no condition found'})"
             )
+            if attempt < max_attempts:
+                await asyncio.sleep(delay_seconds)

26-28: Avoid configuring logging at import-time

Calling configure_dynamo_logging() on import can have unintended side-effects. Prefer configuring logging in the application entrypoint and using logger = logging.getLogger(__name__) here.

components/planner/src/dynamo/planner/virtual_connector.py (1)

282-305: Accept SubComponentType and normalize internally

These helpers should accept SubComponentType for symmetry with KubernetesConnector. Keep backward compatibility by accepting str too.

-    async def add_component(self, worker_type: str, blocking: bool = True):
+    async def add_component(self, worker_type: str | SubComponentType, blocking: bool = True):
         """Add a component by increasing its replica count by 1"""
-        if worker_type == "prefill":
+        kind = worker_type.value if isinstance(worker_type, SubComponentType) else worker_type
+        if kind == "prefill":
             await self._update_scaling_decision(
                 num_prefill=self.num_prefill_workers + 1
             )
-        elif worker_type == "decode":
+        elif kind == "decode":
             await self._update_scaling_decision(num_decode=self.num_decode_workers + 1)
@@
-    async def remove_component(self, worker_type: str, blocking: bool = True):
+    async def remove_component(self, worker_type: str | SubComponentType, blocking: bool = True):
         """Remove a component by decreasing its replica count by 1"""
-        if worker_type == "prefill":
+        kind = worker_type.value if isinstance(worker_type, SubComponentType) else worker_type
+        if kind == "prefill":
             new_count = max(0, self.num_prefill_workers - 1)
             await self._update_scaling_decision(num_prefill=new_count)
-        elif worker_type == "decode":
+        elif kind == "decode":
             new_count = max(0, self.num_decode_workers - 1)
             await self._update_scaling_decision(num_decode=new_count)
benchmarks/profiler/tests/config_test.py (1)

376-448: Consider parametrizing repeated “convert_config + validate” flows

These blocks repeat across backends/targets. Parametrization would reduce duplication and improve maintainability.

benchmarks/profiler/utils/config.py (9)

182-194: Custom exception for multi-match; resolves Ruff TRY003 and clarifies failure mode.

Replace the inline ValueError with a dedicated exception and move the message into the class.

Apply this diff:

-    if len(matching_service_names) > 1:
-        raise ValueError(f"Multiple services with subComponentType {sub_component_type} found")
+    if len(matching_service_names) > 1:
+        raise MultipleSubComponentTypeMatchError(sub_component_type)

Add this exception near the logger declarations:

class MultipleSubComponentTypeMatchError(ValueError):
    def __init__(self, sub_component_type: str):
        super().__init__(
            f"Multiple services with subComponentType '{sub_component_type}' found; expected at most one."
        )

60-60: Constrain subComponentType to known values.

Use a Literal or Enum to validate only "prefill" or "decode" and catch config typos early.

Example:

-class Service(BaseModel):
+class Service(BaseModel):
@@
-    subComponentType: Optional[str] = None
+    subComponentType: Optional[Literal["prefill", "decode"]] = None

316-323: Error message mentions decode even for prefill.

Keep message neutral to match actual worker_name.

Apply this diff:

-            raise ValueError(
-                f"Missing extraPodSpec or mainContainer in VLLM decode worker service '{worker_name}'"
-            )
+            raise ValueError(
+                f"Missing extraPodSpec or mainContainer in VLLM worker service '{worker_name}'"
+            )

487-490: Guard deletion if prefill service is absent.

-            del cfg.spec.services[
-                prefill_worker_name
-            ]
+            if prefill_worker_name in cfg.spec.services:
+                del cfg.spec.services[prefill_worker_name]

573-579: Add prefill fallback for SGLang model-name resolution.

Avoids KeyError when only prefill exists.

-        worker_service_name = get_service_name_from_sub_component_type(cfg.spec.services, "decode")
-
-        if not worker_service_name:
-            worker_service_name = WORKER_COMPONENT_NAMES["sglang"].decode_worker_k8s_name
+        worker_service_name = get_service_name_from_sub_component_type(cfg.spec.services, "decode")
+        if not worker_service_name:
+            worker_service_name = get_service_name_from_sub_component_type(cfg.spec.services, "prefill")
+        if not worker_service_name:
+            worker_service_name = WORKER_COMPONENT_NAMES["sglang"].decode_worker_k8s_name
+        worker_service = cfg.spec.services.get(worker_service_name)
+        if not worker_service:
+            worker_service = cfg.spec.services.get(WORKER_COMPONENT_NAMES["sglang"].prefill_worker_k8s_name)
+        if not worker_service:
+            logger.warning(
+                f"Worker service '{worker_service_name}' not found, using default model name: {DEFAULT_MODEL_NAME}"
+            )
+            return DEFAULT_MODEL_NAME
-        worker_service = cfg.spec.services[worker_service_name]

684-686: Guard deletion of decode worker.

Avoid KeyError if decode worker isn’t in services.

-            del cfg.spec.services[decode_worker_name]
+            if decode_worker_name in cfg.spec.services:
+                del cfg.spec.services[decode_worker_name]

726-739: Same guard for decode-only conversion path.

-            worker_service = cfg.spec.services["TRTLLMWorker"]
+            if "TRTLLMWorker" not in cfg.spec.services:
+                raise ValueError("Missing TRT-LLM worker service after conversion; ensure decode or prefill worker exists")
+            worker_service = cfg.spec.services["TRTLLMWorker"]

789-792: Safer dict access for TRT-LLM worker service.

If the resolved name isn’t present, this will KeyError. Consider .get with an explicit error.

-        worker_service = cfg.spec.services[
-            worker_name
-        ]
+        worker_service = cfg.spec.services.get(worker_name)
+        if not worker_service:
+            raise ValueError(f"TRT-LLM worker service '{worker_name}' not found in config")

368-368: TODO: Frontend resolution by componentType.

Acknowledged. Consider adding a helper similar to get_service_name_from_sub_component_type for frontend when you tackle this.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70f9938 and 6622928.

📒 Files selected for processing (29)
  • benchmarks/profiler/profile_sla.py (6 hunks)
  • benchmarks/profiler/tests/config_test.py (1 hunks)
  • benchmarks/profiler/utils/config.py (21 hunks)
  • components/backends/sglang/deploy/disagg_planner.yaml (2 hunks)
  • components/backends/trtllm/deploy/disagg_planner.yaml (2 hunks)
  • components/backends/vllm/deploy/disagg_planner.yaml (2 hunks)
  • components/planner/src/dynamo/planner/__init__.py (1 hunks)
  • components/planner/src/dynamo/planner/defaults.py (1 hunks)
  • components/planner/src/dynamo/planner/kube.py (5 hunks)
  • components/planner/src/dynamo/planner/kubernetes_connector.py (2 hunks)
  • components/planner/src/dynamo/planner/utils/exceptions.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/planner_core.py (4 hunks)
  • components/planner/src/dynamo/planner/virtual_connector.py (6 hunks)
  • components/planner/test/kube.py (9 hunks)
  • components/planner/test/kubernetes_connector.py (8 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml (4 hunks)
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml (4 hunks)
  • deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (4 hunks)
  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (4 hunks)
  • deploy/cloud/operator/internal/consts/consts.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (1 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (3 hunks)
  • deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph_test.go (7 hunks)
  • docs/kubernetes/sla_planner_deployment.md (4 hunks)
  • tests/planner/perf_test_configs/disagg_8b_planner.yaml (2 hunks)
  • tests/planner/scaling/disagg_planner.yaml (2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-18T16:05:05.534Z
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-09-04T19:03:06.643Z
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2872
File: examples/multimodal/deploy/agg_qwen.yaml:53-60
Timestamp: 2025-09-04T19:03:06.643Z
Learning: In the dynamo repository, Kubernetes Custom Resources use `gpu: "1"` format for GPU resource limits and requests, not the standard Kubernetes `nvidia.com/gpu: 1` format. This applies to DynamoGraphDeployment resources and other dynamo CRs.

Applied to files:

  • deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml
  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
📚 Learning: 2025-07-18T16:04:47.465Z
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml:1233-1235
Timestamp: 2025-07-18T16:04:47.465Z
Learning: The `stopSignal` field in Kubernetes CRDs like DynamoGraphDeployment and DynamoComponentDeployment is autogenerated by controller-gen when upgrading Kubernetes library versions, and represents expected upstream API changes rather than manual code that needs custom validation.

Applied to files:

  • deploy/cloud/helm/crds/templates/nvidia.com_dynamographdeployments.yaml
🧬 Code graph analysis (14)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller.go (2)
components/planner/src/dynamo/planner/kubernetes_connector.py (1)
  • SubComponentType (38-40)
deploy/cloud/operator/internal/consts/consts.go (1)
  • KubeLabelDynamoSubComponentType (40-40)
deploy/cloud/operator/api/v1alpha1/dynamocomponentdeployment_types.go (1)
components/planner/src/dynamo/planner/kubernetes_connector.py (1)
  • SubComponentType (38-40)
components/planner/test/kube.py (2)
components/planner/src/dynamo/planner/kube.py (5)
  • KubernetesAPI (40-137)
  • update_graph_replicas (80-92)
  • is_deployment_ready (94-102)
  • wait_for_graph_deployment_ready (104-137)
  • get_graph_deployment (63-78)
components/planner/src/dynamo/planner/utils/exceptions.py (1)
  • DynamoGraphDeploymentNotFoundError (35-53)
deploy/cloud/operator/internal/dynamo/graph.go (2)
components/planner/src/dynamo/planner/kubernetes_connector.py (1)
  • SubComponentType (38-40)
deploy/cloud/operator/internal/consts/consts.go (1)
  • KubeLabelDynamoSubComponentType (40-40)
components/planner/src/dynamo/planner/__init__.py (1)
components/planner/src/dynamo/planner/kubernetes_connector.py (2)
  • TargetReplica (50-53)
  • SubComponentType (38-40)
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (2)
deploy/cloud/operator/internal/consts/consts.go (2)
  • ComponentTypeWorker (53-53)
  • KubeLabelDynamoSubComponentType (40-40)
components/planner/src/dynamo/planner/kubernetes_connector.py (1)
  • SubComponentType (38-40)
components/planner/src/dynamo/planner/virtual_connector.py (2)
components/planner/src/dynamo/planner/kubernetes_connector.py (6)
  • SubComponentType (38-40)
  • TargetReplica (50-53)
  • wait_for_deployment_ready (126-136)
  • verify_prefill_and_decode_components_exist (100-124)
  • add_component (67-81)
  • remove_component (83-98)
components/planner/src/dynamo/planner/utils/exceptions.py (1)
  • DeploymentValidationError (100-110)
benchmarks/profiler/profile_sla.py (1)
benchmarks/profiler/utils/config.py (8)
  • set_config_tp_size (164-165)
  • set_config_tp_size (290-335)
  • set_config_tp_size (523-567)
  • set_config_tp_size (782-830)
  • get_kv_cache_size_from_dynamo_log (176-177)
  • get_kv_cache_size_from_dynamo_log (401-427)
  • get_kv_cache_size_from_dynamo_log (632-651)
  • get_kv_cache_size_from_dynamo_log (896-930)
benchmarks/profiler/tests/config_test.py (1)
benchmarks/profiler/utils/config.py (20)
  • Config (72-75)
  • SGLangConfigModifier (430-651)
  • TrtllmConfigModifier (654-930)
  • VllmV1ConfigModifier (197-427)
  • Service (56-61)
  • get_service_name_from_sub_component_type (182-193)
  • break_arguments (78-90)
  • parse_override_engine_args (136-155)
  • get_model_name (168-169)
  • get_model_name (338-366)
  • get_model_name (570-598)
  • get_model_name (833-876)
  • convert_config (160-161)
  • convert_config (199-287)
  • convert_config (432-520)
  • convert_config (656-779)
  • set_config_tp_size (164-165)
  • set_config_tp_size (290-335)
  • set_config_tp_size (523-567)
  • set_config_tp_size (782-830)
components/planner/src/dynamo/planner/utils/planner_core.py (2)
components/planner/src/dynamo/planner/kubernetes_connector.py (6)
  • KubernetesConnector (56-214)
  • TargetReplica (50-53)
  • SubComponentType (38-40)
  • set_component_replicas (138-169)
  • verify_prefill_and_decode_components_exist (100-124)
  • wait_for_deployment_ready (126-136)
components/planner/src/dynamo/planner/virtual_connector.py (4)
  • VirtualConnector (29-328)
  • set_component_replicas (306-328)
  • verify_prefill_and_decode_components_exist (115-126)
  • wait_for_deployment_ready (95-113)
deploy/cloud/operator/internal/dynamo/graph_test.go (2)
components/planner/src/dynamo/planner/kubernetes_connector.py (1)
  • SubComponentType (38-40)
deploy/cloud/operator/internal/consts/consts.go (2)
  • KubeLabelDynamoSubComponentType (40-40)
  • ComponentTypeWorker (53-53)
components/planner/src/dynamo/planner/kubernetes_connector.py (3)
components/planner/src/dynamo/planner/utils/exceptions.py (5)
  • DeploymentValidationError (100-110)
  • DuplicateSubComponentError (80-97)
  • EmptyTargetReplicasError (113-122)
  • ComponentError (56-62)
  • SubComponentNotFoundError (65-77)
benchmarks/profiler/utils/config.py (1)
  • Service (56-61)
components/planner/src/dynamo/planner/kube.py (5)
  • KubernetesAPI (40-137)
  • get_graph_deployment (63-78)
  • update_graph_replicas (80-92)
  • wait_for_graph_deployment_ready (104-137)
  • is_deployment_ready (94-102)
components/planner/test/kubernetes_connector.py (3)
components/planner/src/dynamo/planner/utils/exceptions.py (5)
  • DeploymentValidationError (100-110)
  • DuplicateSubComponentError (80-97)
  • DynamoGraphDeploymentNotFoundError (35-53)
  • EmptyTargetReplicasError (113-122)
  • SubComponentNotFoundError (65-77)
components/planner/src/dynamo/planner/kubernetes_connector.py (9)
  • KubernetesConnector (56-214)
  • SubComponentType (38-40)
  • TargetReplica (50-53)
  • get_service_from_sub_component_type_or_name (173-214)
  • number_replicas (47-48)
  • add_component (67-81)
  • remove_component (83-98)
  • set_component_replicas (138-169)
  • verify_prefill_and_decode_components_exist (100-124)
components/planner/src/dynamo/planner/kube.py (4)
  • is_deployment_ready (94-102)
  • get_graph_deployment (63-78)
  • update_graph_replicas (80-92)
  • wait_for_graph_deployment_ready (104-137)
components/planner/src/dynamo/planner/kube.py (2)
components/planner/src/dynamo/planner/utils/exceptions.py (1)
  • DynamoGraphDeploymentNotFoundError (35-53)
lib/bindings/python/src/dynamo/runtime/logging.py (1)
  • configure_dynamo_logging (77-107)
🪛 Ruff (0.13.1)
components/planner/test/kube.py

250-250: Unused function argument: mock_custom_api

(ARG001)


264-264: Unused function argument: mock_custom_api

(ARG001)

components/planner/src/dynamo/planner/virtual_connector.py

115-115: Unused method argument: prefill_component_name

(ARG002)


115-115: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


115-115: Unused method argument: decode_component_name

(ARG002)


115-115: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

components/planner/src/dynamo/planner/kubernetes_connector.py

100-100: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


100-100: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

components/planner/src/dynamo/planner/kube.py

77-77: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

benchmarks/profiler/utils/config.py

189-189: Avoid specifying long messages outside the exception class

(TRY003)

🪛 markdownlint-cli2 (0.18.1)
docs/kubernetes/sla_planner_deployment.md

16-16: Link text should be descriptive

(MD059, descriptive-link-text)


132-132: Link text should be descriptive

(MD059, descriptive-link-text)

🔇 Additional comments (31)
docs/kubernetes/sla_planner_deployment.md (1)

30-30: LGTM: kube-prometheus-stack prerequisite is clear

The prerequisite aligns with removing embedded Prometheus from planner deployments.

components/planner/src/dynamo/planner/utils/planner_core.py (2)

14-14: Ensure re-exports or import directly from submodules
Importing KubernetesConnector, VirtualConnector, TargetReplica, and SubComponentType from dynamo.planner root requires them to be exposed in dynamo/planner/__init__.py. Confirm they’re listed in __all__ (or explicitly imported), or switch to importing each from its defining submodule.


96-98: Attribute access is safe; all component classes define both prefill_worker_k8s_name and decode_worker_k8s_name.

deploy/cloud/operator/internal/dynamo/graph_test.go (1)

65-69: Sub-component labeling and propagation look correct.

Usage of SubComponentType in specs and KubeLabelDynamoSubComponentType in labels is consistent and covered across single-node and multinode cases. Good coverage.

Also applies to: 110-115, 1093-1095, 1246-1247, 1650-1652, 1774-1776, 1925-1927

components/planner/src/dynamo/planner/kubernetes_connector.py (2)

38-54: Shift to SubComponentType- and TargetReplica‑based scaling is solid.

Enum and models are clear/typed; fallback logic is handled elsewhere. Nice.


147-151: Confirm readiness gate before scaling won’t stall recovery.

Skipping updates when the deployment isn’t Ready may block corrective scaling during convergence. If intentional, consider documenting the rationale, or allow an override flag.

components/planner/src/dynamo/planner/__init__.py (1)

11-13: Public API exports look right.

Re-exporting SubComponentType and TargetReplica aligns callers with the new interface.

Also applies to: 17-17

deploy/cloud/operator/internal/controller/dynamographdeployment_controller.go (1)

128-129: Improved status update logging.

Including CRD, state, reason, and message will help triage status write failures.

components/backends/vllm/deploy/disagg_planner.yaml (1)

70-71: Correctly typed subComponentType for VLLM workers.

Values match planner expectations: decode/prefill.

If applicable, ensure sglang/trtllm disagg configs also set subComponentType consistently across workers.

Also applies to: 96-97

deploy/cloud/operator/internal/consts/consts.go (1)

40-40: LGTM: new label constant

The key name aligns with existing label conventions.

tests/planner/scaling/disagg_planner.yaml (1)

70-70: LGTM: subComponentType values match planner enum

decode/prefill align with the planner’s expected values.

Also applies to: 94-94

deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (2)

787-787: Leader pod now carries SubComponentType label: LGTM

Good to see the label propagated to leader pods; helps selectors/observability stay consistent. Ensure any consumers expect it on leaders too.


898-898: Worker pod SubComponentType label: LGTM

Expectation matches the new API surface and consts.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamocomponentdeployments.yaml (1)

686-688: Docs tweak LGTM

Description casing/readability improvement looks good.

components/backends/sglang/deploy/disagg_planner.yaml (1)

70-70: subComponentType annotations for workers — LGTM

The additions align with planner/kube lookup by subComponentType and the runtime args (--disaggregation-mode) are consistent.

Confirm the CRD/operator version deployed in your clusters accepts spec.services.*.subComponentType so these manifests won’t be rejected.

Also applies to: 102-102

components/backends/trtllm/deploy/disagg_planner.yaml (1)

88-88: Explicit subComponentType for TRT‑LLM workers — LGTM

Consistent with planner/profiler changes and fallback naming. Probes/args already reflect decode/prefill modes.

Double‑check WORKER_COMPONENT_NAMES["trtllm"] maps to TRTLLMDecodeWorker/TRTLLMPrefillWorker to keep fallback behavior sane.

Also applies to: 137-137

tests/planner/perf_test_configs/disagg_8b_planner.yaml (1)

104-104: VLLM workers labeled with subComponentType — LGTM

Matches planner/kube and profiler expectations.

Also applies to: 161-161

benchmarks/profiler/profile_sla.py (2)

490-491: Best prefill TP application — LGTM

Signature update with explicit "prefill" target looks correct.


562-563: Best decode TP application — LGTM

Signature update with explicit "decode" target looks correct.

components/planner/test/kube.py (5)

79-91: update_graph_replicas test — LGTM

Validates the patch payload/path precisely.


105-106: is_deployment_ready tests — LGTM

Covers True/False paths via conditions parsing.

Also applies to: 123-124


182-194: Not‑found path coverage — LGTM

Asserts DynamoGraphDeploymentNotFoundError and validates contextual fields.


20-21: Import client — LGTM

Needed for ApiException checks.


24-24: Using public exception type in tests — LGTM

Validates new error surface.

deploy/cloud/operator/config/crd/bases/nvidia.com_dynamographdeployments.yaml (1)

10356-10359: Constrain subComponentType with enum in both CRDs
Open string is error-prone—replace the plain string with an enum of known values (e.g. “prefill”, “decode”) in both nvidia.com_dynamographdeployments.yaml and nvidia.com_dynamocomponentdeployments.yaml. Confirm the full list of valid subComponentType values in the operator code and update accordingly.

 subComponentType:
-  description: SubComponentType indicates the sub-role of this component (for example, "prefill").
-  type: string
+  description: SubComponentType indicates the sub-role of this component (e.g., "prefill" or "decode").
+  type: string
+  enum:
+    - prefill
+    - decode
components/planner/test/kubernetes_connector.py (1)

396-426: Nice coverage of duplicate/missing sub-components

Good validation of combined error aggregation and message content for duplicates and missing components.

components/planner/src/dynamo/planner/kube.py (1)

94-103: LGTM: readiness check is correct

The Ready condition detection is straightforward and robust.

components/planner/src/dynamo/planner/virtual_connector.py (1)

95-114: LGTM: non-K8s readiness loop is sensible

The readiness polling mirrors K8s behavior without raising; good parity.

benchmarks/profiler/tests/config_test.py (1)

1-13: Solid, comprehensive test coverage

Good breadth across VLLM/SGLang/TRT-LLM paths with and without subComponentType, including TP updates and arg translation.

benchmarks/profiler/utils/config.py (2)

933-937: Approve code changes
All call sites pass config as the first argument to get_kv_cache_size_from_dynamo_log; updating the Protocol signature remains the only required change.


28-28: WORKER_COMPONENT_NAMES['trtllm'] defines decode_worker_k8s_name – verified that TrtllmComponentName includes decode_worker_k8s_name (and related attributes), so the lookup is safe.

tmonty12 and others added 3 commits September 26, 2025 12:29
Signed-off-by: tmontfort <[email protected]>
Signed-off-by: tmontfort <[email protected]>
Copy link
Contributor

@tedzhouhk tedzhouhk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the planner doc and add that now planner needs the cluster to setup a cluster-wide prometheus server and provide instructions/point to existing doc?

@tmonty12
Copy link
Contributor Author

can we update the planner doc and add that now planner needs the cluster to setup a cluster-wide prometheus server and provide instructions/point to existing doc?

I updated sla_planner_deployment.md to include installing kube-prometheus-stack as a pre-requisite

Signed-off-by: hongkuanz <[email protected]>
@tmonty12 tmonty12 merged commit 1a5016b into main Sep 29, 2025
21 of 23 checks passed
@tmonty12 tmonty12 deleted the tmonty12/add-sub-component-type branch September 29, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants