-
Notifications
You must be signed in to change notification settings - Fork 739
refactor: move engine configs out of components directory #3772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
WalkthroughThis PR refactors path resolution across the TRTLLM backend by introducing a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Rationale: Large cohort of changes (20+ files) but predominantly homogeneous refactoring—repeated path migration pattern applied consistently across multiple file types. Each individual change involves straightforward path substitution with minimal logic complexity. Primary review focus should verify path consistency across files and validate that all recipe directory structures are correctly specified. Poem
Pre-merge checks❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/backends/trtllm/deploy/agg-with-config.yaml (1)
54-54: Critical path resolution mismatch between workingDir and ConfigMap mount.The ConfigMap is mounted at
/workspace/(line 58), but the workingDir remains/workspace/components/backends/trtllm(line 54). The relative path./recipes/qwen3/trtllm/agg.yamlwill resolve to/workspace/components/backends/trtllm/recipes/qwen3/trtllm/agg.yaml, which does not match the ConfigMap mount location at/workspace/recipes/qwen3/trtllm/agg.yaml.This will cause the deployment to fail when attempting to load the engine configuration.
Fix: Update workingDir to
/workspace/to match the ConfigMap mount, or update the extra-engine-args path to be absolute (e.g.,/workspace/recipes/qwen3/trtllm/agg.yaml).mainContainer: image: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag - workingDir: /workspace/components/backends/trtllm + workingDir: /workspace/ volumeMounts: - name: nvidia-config - mountPath: /workspace/ + mountPath: /workspace/ readOnly: true command: - python3 - -m - dynamo.trtllm args: - --model-path - Qwen/Qwen3-0.6B - --served-model-name - Qwen/Qwen3-0.6B - --extra-engine-args - ./recipes/qwen3/trtllm/agg.yamlAlso applies to: 58-58, 70-70
🧹 Nitpick comments (4)
docs/backends/trtllm/llama4_plus_eagle.md (1)
33-33: Update documentation note with full path clarification.Line 33 references the
recipes/llama4/trtllm/eaglefolder, but this is a relative path. Consider clarifying whether this path is relative to$DYNAMO_HOME, the container root, or from a specific working directory to reduce user confusion.-* Make sure the (`eagle3_one_model: true`) is set in the LLM API config inside the `recipes/llama4/trtllm/eagle` folder. +* Make sure the (`eagle3_one_model: true`) is set in the LLM API config inside the `$DYNAMO_HOME/recipes/llama4/trtllm/eagle` folder.docs/backends/trtllm/gpt-oss.md (1)
93-93: Documentation correctly reflects recipe-based paths, but consider mentioning DYNAMO_HOME environment variable.The configuration file references have been properly updated to use recipe-based paths. However, users may benefit from knowing that the paths assume
DYNAMO_HOME=/workspaceby default, and this environment variable can be customized. Consider adding a brief note in the Prerequisites or Instructions section about this.Also applies to: 100-100, 150-150, 167-167
recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml (1)
30-33: Helpful inline documentation for configuration changes.The comments referencing TensorRT-LLM upstream changes provide valuable context for maintainers. Consider adding similar documentation to other recipe files (prefill, decode variants) if they also reflect upstream API changes.
components/backends/trtllm/launch/disagg.sh (1)
6-6: Path resolution inconsistency: shell scripts vs deployment YAMLs.Shell scripts use absolute paths via
$DYNAMO_HOME/recipes/(Lines 10-11) while deployment YAMLs use relative paths./recipes/depending on workingDir. Both approaches assume/workspace/recipes/exists. While this should work, the inconsistency increases maintenance complexity. Consider standardizing on one approach (e.g., all absolute, or all relative with consistent workingDir).Also applies to: 10-11
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (32)
benchmarks/router/run_engines.sh(2 hunks)components/backends/trtllm/deploy/agg-with-config.yaml(2 hunks)components/backends/trtllm/deploy/agg.yaml(2 hunks)components/backends/trtllm/deploy/agg_router.yaml(2 hunks)components/backends/trtllm/deploy/disagg-multinode.yaml(4 hunks)components/backends/trtllm/deploy/disagg.yaml(4 hunks)components/backends/trtllm/deploy/disagg_planner.yaml(4 hunks)components/backends/trtllm/deploy/disagg_router.yaml(4 hunks)components/backends/trtllm/launch/agg.sh(1 hunks)components/backends/trtllm/launch/agg_metrics.sh(1 hunks)components/backends/trtllm/launch/agg_router.sh(1 hunks)components/backends/trtllm/launch/disagg.sh(1 hunks)components/backends/trtllm/launch/disagg_router.sh(1 hunks)components/backends/trtllm/launch/epd_disagg.sh(1 hunks)components/backends/trtllm/launch/gpt_oss_disagg.sh(1 hunks)docs/backends/trtllm/README.md(1 hunks)docs/backends/trtllm/gemma3_sliding_window_attention.md(4 hunks)docs/backends/trtllm/gpt-oss.md(3 hunks)docs/backends/trtllm/llama4_plus_eagle.md(2 hunks)docs/backends/trtllm/multimodal_support.md(2 hunks)docs/backends/trtllm/multinode/multinode-examples.md(2 hunks)docs/backends/trtllm/multinode/multinode-multimodal-example.md(2 hunks)docs/kubernetes/README.md(1 hunks)examples/basics/multinode/trtllm/srun_aggregated.sh(3 hunks)examples/basics/multinode/trtllm/srun_disaggregated.sh(4 hunks)recipes/deepseek-r1/trtllm/wide_ep/wide_ep_agg.yaml(1 hunks)recipes/deepseek-r1/trtllm/wide_ep/wide_ep_decode.yaml(1 hunks)recipes/deepseek-r1/trtllm/wide_ep/wide_ep_prefill.yaml(1 hunks)recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml(1 hunks)recipes/qwen3/trtllm/agg.yaml(1 hunks)recipes/qwen3/trtllm/decode.yaml(1 hunks)recipes/qwen3/trtllm/prefill.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-07-03T10:14:30.570Z
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
Applied to files:
examples/basics/multinode/trtllm/srun_aggregated.shexamples/basics/multinode/trtllm/srun_disaggregated.sh
📚 Learning: 2025-07-02T13:20:28.800Z
Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/job_script_template.j2:59-59
Timestamp: 2025-07-02T13:20:28.800Z
Learning: In the SLURM job script template at examples/sglang/slurm_jobs/job_script_template.j2, the `--total_nodes` parameter represents the total nodes per worker type (prefill or decode), not the total nodes in the entire cluster. Each worker type needs to know its own group size for distributed coordination.
Applied to files:
examples/basics/multinode/trtllm/srun_disaggregated.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: vllm (arm64)
- GitHub Check: sglang
- GitHub Check: vllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (31)
components/backends/trtllm/launch/agg_metrics.sh (2)
6-6: DYNAMO_HOME default is consistent across all launch scripts.Verified: All 7 launch scripts (
agg_metrics.sh,agg.sh,agg_router.sh,disagg.sh,disagg_router.sh,epd_disagg.sh,gpt_oss_disagg.sh) use the identical default${DYNAMO_HOME:-"/workspace"}at line 6.
9-9: No issues found—recipes directory structure is correctly defined.The referenced path
$DYNAMO_HOME/recipes/qwen3/trtllm/agg.yamlexists in the repository atrecipes/qwen3/trtllm/agg.yaml, along with the expecteddecode.yamlandprefill.yamlfiles. The directory structure follows the established pattern used across all supported models in the repository.recipes/deepseek-r1/trtllm/wide_ep/wide_ep_decode.yaml (1)
20-20: Load balancer file verified and properly configured.The
eplb.yamlfile exists atrecipes/deepseek-r1/trtllm/wide_ep/eplb.yamland is consistently referenced across all three wide_ep config files (decode, prefill, and agg). The file has been correctly migrated to the new recipes path.components/backends/trtllm/deploy/agg.yaml (2)
39-39: Manual verification required for path and file existence.The verification could not locate the file
components/backends/trtllm/deploy/agg.yaml, therecipesdirectory, or anyagg.yamlfiles in the repository structure. This could indicate the files are newly added in this PR and not yet present in the cloned state, or the paths differ from what was expected.Please manually verify that:
- The referenced engine config file exists at
./recipes/qwen3/trtllm/agg.yamlin the repository- The working directory context from which this relative path is resolved is correctly configured in your deployment setup
- The path change from the previous location is intentional and the file has been moved/created at the new location
28-28: Relative path resolution verified in repository structure.The repository contains
recipes/qwen3/trtllm/agg.yamlat the root level, confirming the relative path./recipes/qwen3/trtllm/agg.yamlin the args will resolve correctly whenworkingDiris set to/workspace/. The code change itself is valid.However, this requires confirming that the deployment container includes the recipes directory mounted or copied at
/workspace/recipes/during image build or pod initialization. Verify with your container deployment configuration and Dockerfile/build scripts to ensure recipes are present in the runtime environment.docs/backends/trtllm/llama4_plus_eagle.md (1)
57-57: File extension inconsistency confirmed in documentation—requires verification of correct extension.The documentation does contain the inconsistency: line 57 references
eagle_agg.yml(.yml) while lines 65 and 67 referenceeagle_prefill.yamlandeagle_decode.yaml(.yaml). However, these config files are deployment-time artifacts (paths reference/mnt/recipes/which is a runtime location) and do not exist in the repository.The inconsistency in the documented file extensions is real, but determining which is correct requires verifying against your actual deployment configuration or confirming the intended setup. Please verify whether
eagle_agg.ymlshould use.ymlor.yamlto match the other config files.recipes/deepseek-r1/trtllm/wide_ep/wide_ep_prefill.yaml (1)
20-20: No changes required—eplb.yaml file is correctly migrated and referenced.The load_balancer configuration file (
eplb.yaml) exists atrecipes/deepseek-r1/trtllm/wide_ep/eplb.yamland is properly referenced. The/mnt/recipes/...path is an intentional deployment-time mount point used consistently across all three related config files (wide_ep_prefill.yaml, wide_ep_agg.yaml, and wide_ep_decode.yaml). The migration from engine_configs/ to recipes is complete and verified.recipes/deepseek-r1/trtllm/wide_ep/wide_ep_agg.yaml (1)
14-14: Load balancer file exists and is correctly referenced.The
eplb.yamlfile is present atrecipes/deepseek-r1/trtllm/wide_ep/eplb.yamlwith valid content. The path reference inwide_ep_agg.yamlat line 14 correctly points to this file, and it has been properly migrated to the recipes directory structure.docs/backends/trtllm/README.md (1)
165-165: Incorrect review comment.The documented path
./recipes/deepseek-r1/trtllm/mtp/mtp_agg.yamlis an example environment variable setting for users, not a source repository file. These are external recipe configuration files that users need to obtain or create, relative to$DYNAMO_HOME/components/backends/trtllmwhere the command executes. This pattern is consistent throughout documentation (gemma3, gpt-oss, multinode examples), where recipe files are user-provided, not tracked in source control. The absence from the repository is expected and correct.Likely an incorrect or invalid review comment.
docs/backends/trtllm/gpt-oss.md (1)
131-143: Minor path inconsistency between launch script reference and manual command path.Line 131 references
/workspace/components/backends/trtllm, but line 139 shows/workspace/dynamo/components/backends/trtllm. Verify which path is correct for the updated directory structure, or clarify the context (e.g., whether mounted paths differ in containers).recipes/qwen3/trtllm/decode.yaml (1)
1-31: Recipe configuration file properly structured.The decode configuration is well-formed with appropriate defaults. The settings align with typical decode-phase optimizations (lower batch size, default cache transceiver backend).
recipes/qwen2-vl-7b-instruct/trtllm/agg.yaml (1)
1-29: Aggregated mode configuration properly structured.The configuration is complete with appropriate tensor parallelism (8) and batch sizing for non-disaggregated inference.
benchmarks/router/run_engines.sh (1)
7-7: DYNAMO_HOME and RECIPE_PATH properly implemented.The environment variable defaults are clean and the path construction is consistent. Engine config resolution now correctly uses the recipe-based structure.
Also applies to: 10-10, 89-95
examples/basics/multinode/trtllm/srun_disaggregated.sh (1)
13-13: Recipe config paths properly updated for multinode context.The engine config paths now correctly reference
/mnt/recipes/deepseek-r1/trtllm/wide_ep/with the new recipe structure. The mount point depth adjustment supports the new repository structure.Also applies to: 20-20, 24-24
components/backends/trtllm/launch/epd_disagg.sh (1)
6-6: DYNAMO_HOME pattern consistently applied.The script properly introduces DYNAMO_HOME with a sensible default and constructs the engine config paths correctly. Verify that all three recipe files (prefill, decode, encode) exist in
recipes/qwen2-vl-7b-instruct/trtllm/.Also applies to: 10-12
components/backends/trtllm/deploy/disagg_planner.yaml (2)
89-89: Worker workingDir properly updated for recipe path resolution.The TRTLLMDecodeWorker and TRTLLMPrefillWorker workingDir changes to
/workspace/correctly align with the recipe-based path structure. Relative paths like./recipes/qwen3/trtllm/decode.yamlwill now resolve correctly.Also applies to: 100-100, 118-118, 129-129
20-20: Frontend workingDir unchanged—verify this is intentional.The Frontend container workingDir remains at
/workspace/components/backends/trtllmwhile worker components have been moved to/workspace/. Confirm this is the intended behavior or update for consistency.components/backends/trtllm/deploy/disagg_router.yaml (1)
31-31: Verify recipes directory structure for both prefill and decode configurations.Similar to agg_router.yaml, the workingDir changes (Lines 31, 59) shift path resolution for engine configs (Lines 42, 70) to
/workspace/recipes/qwen3/trtllm/{prefill,decode}.yaml. Confirm these paths are accessible in deployed environments.Also applies to: 42-42, 59-59, 70-70
docs/backends/trtllm/multinode/multinode-examples.md (1)
139-139: Verify recipes directory is present at host path mounted to/mnt.The documentation examples reference
/mnt/recipes/...paths in containers, which map to the host viaMOUNTS="${PWD}/../:/mnt"(Line 110). Confirm that the recipes directory exists at the parent directory of the documentation folder and contains the referenced config files (wide_ep_agg.yaml, wide_ep_prefill.yaml, etc.).Also applies to: 168-169
components/backends/trtllm/launch/agg.sh (1)
6-6: Path resolution approach consistent with disagg.sh.This script follows the same DYNAMO_HOME-based path resolution as disagg.sh. Refer to earlier comment about path resolution inconsistency between shell scripts and deployment YAMLs.
Also applies to: 9-9
docs/backends/trtllm/multinode/multinode-multimodal-example.md (1)
37-37: Verify recipes directory structure includes model-specific subdirectories.The documentation references
/mnt/recipes/llama4/trtllm/multimodal/paths. Ensure the recipes directory hierarchy includes model-specific subdirectories (llama4, gemma3, deepseek-r1, etc.) and that the sed command on Line 37 correctly targets these new paths.Also applies to: 103-104
components/backends/trtllm/deploy/disagg.yaml (1)
29-29: Path resolution consistent with other YAML deployments, but see critical issue flagged for gemma3 documentation.This file follows the same relative path pattern as other deployment YAMLs (workingDir=/workspace/ with ./recipes/... references). However, note the critical path resolution context mismatch identified in the gemma3 documentation file regarding where recipes directory should actually exist.
Also applies to: 40-40, 57-57, 68-68
components/backends/trtllm/deploy/agg_router.yaml (1)
31-31: Verify volume mounts for recipes directory are configured elsewhere or baked into container image.The file
./recipes/qwen3/trtllm/agg.yamlexists in the repository, and the relative path will correctly resolve to/workspace/recipes/qwen3/trtllm/agg.yamlat runtime. However, theagg_router.yamlshown contains no volume mount configuration for therecipes/directory. Confirm either:
- The
recipes/directory is included in the container image at/workspace/recipes/, or- Volume mounts for the recipes directory are configured in a parent deployment resource or elsewhere in your setup.
components/backends/trtllm/launch/agg_router.sh (1)
6-6: LGTM!The DYNAMO_HOME introduction and path update are correct and consistent with the refactoring pattern across the PR.
Also applies to: 9-9
docs/backends/trtllm/multimodal_support.md (1)
28-28: LGTM!Documentation correctly reflects the new path structure and working directory conventions. The use of relative paths after
cd $DYNAMO_HOMEis appropriate and clear for users.Also applies to: 30-30, 78-78, 83-84
recipes/qwen3/trtllm/agg.yaml (1)
1-34: LGTM on configuration structure.The new YAML configuration is well-documented and includes helpful comments referencing upstream TensorRT-LLM changes. Consider verifying that this configuration works correctly with the qwen3-0.6B model during testing.
recipes/qwen3/trtllm/prefill.yaml (1)
1-30: LGTM on configuration structure.The prefill-specific configuration includes appropriate tuning for prefill-only workers, notably disabling overlap_scheduler with a clear explanation. The cache_transceiver_config suggests proper disaggregated mode setup.
components/backends/trtllm/launch/disagg_router.sh (1)
6-6: LGTM on pattern, but verify decode.yaml exists.The DYNAMO_HOME introduction and path updates follow the refactoring pattern. However, this script references
recipes/qwen3/trtllm/decode.yaml, which was not provided in the PR files.Verify that
recipes/qwen3/trtllm/decode.yamlexists in the repository or will be created in this PR.Also applies to: 10-11
components/backends/trtllm/launch/gpt_oss_disagg.sh (1)
6-6: LGTM on pattern, but verify gpt-oss-120b config files exist.The DYNAMO_HOME introduction is consistent with the PR pattern. The script references
recipes/gpt-oss-120b/trtllm/disagg/prefill.yamlanddecode.yaml, which were not provided in the PR files.Verify that the following files exist in the repository or will be created in this PR:
recipes/gpt-oss-120b/trtllm/disagg/prefill.yamlrecipes/gpt-oss-120b/trtllm/disagg/decode.yamlAlso applies to: 10-11
components/backends/trtllm/deploy/disagg-multinode.yaml (2)
128-142: Potential inconsistency: ConfigMap definitions not used by workers.The deployment YAML defines
prefill.yamlanddecode.yamlinline in the ConfigMap (lines 8-63), but the workers reference external files via./recipes/qwen3/trtllm/prefill.yamlanddecode.yaml. This suggests the inline ConfigMap definitions are unused.This could indicate:
- Incomplete refactoring where ConfigMap should be removed, or
- The args should reference the ConfigMap-mounted files instead (e.g.,
/workspace/prefill.yaml)Please clarify whether the inline ConfigMap definitions should be:
- Removed entirely (if recipes/ directory is now the source of truth), or
- Referenced correctly by updating the args to use them
Also applies to: 168-182
131-131: LGTM on path structure.The workingDir change to
/workspace/correctly enables relative path resolution for./recipes/...references. This is consistent with the documentation updates.Also applies to: 171-171
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
Signed-off-by: Anant Sharma <[email protected]>
richardhuo-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Overview:
move engine configs and multi node configs out of components directory. Engine configs were added to recipes folder.
ref: components dir restructing DEP.
Summary by CodeRabbit
Documentation
Chores
recipesdirectory structure.