-
Notifications
You must be signed in to change notification settings - Fork 2k
chore: update trtllm-serve usage doc by removing backend parameter when it use torch as backend. #6419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: update trtllm-serve usage doc by removing backend parameter when it use torch as backend. #6419
Conversation
📝 WalkthroughWalkthroughThis change removes all explicit uses of the Changes
Sequence Diagram(s)No sequence diagrams generated, as the changes are limited to documentation, scripts, and test argument lists. Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
24e3be9 to
2f353f3
Compare
|
/bot run |
|
PR_Github #13222 [ run ] triggered by Bot |
|
PR_Github #13222 [ run ] completed with state |
2f353f3 to
0db508a
Compare
|
/bot run |
|
PR_Github #13269 [ run ] triggered by Bot |
|
PR_Github #13269 [ run ] completed with state |
LinPoly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Hi @nv-guomingz @LinPoly , This might be out of the scope of this PR, but I found that the TensorRT backend has different names across commands: Should we unify them before 1.0? |
Good catch. If you don't mind, I wanna to update it in another PR to make the dedicated changes. @syuoni |
0db508a to
d3104a1
Compare
|
/bot reuse-pipeline |
|
/bot run |
|
PR_Github #13335 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #13380 [ run ] completed with state |
f1c4f2e to
af04666
Compare
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
examples/disaggregated/README.md (1)
112-116: Fix typorefersh_interval→refresh_interval-refersh_interval: 10.0 +refresh_interval: 10.0
🧹 Nitpick comments (5)
examples/disaggregated/README.md (5)
28-36: Remove residual double-spaces after--portfor cleaner CLI snippetsExtra spaces remained after the
--portflag when the--backend pytorchtoken was deleted. They are harmless but visually noisy and propagate into copy-pasted commands.-CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_0 & -CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_1 & -CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --extra_llm_api_options ./gen_extra-llm-api-config.yml &> log_gen_0 & +CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_0 & +CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_1 & +CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --extra_llm_api_options ./gen_extra-llm-api-config.yml &> log_gen_0 &
98-103: Apply the same spacing tidy-up in dynamic-scaling examplesThe commands in the ETCD section still carry the double space after
--port. Align them with the static-launch examples for consistency.-CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT ... +CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT ...(Repeat for ports 8002 and 8003.)
124-129: Indentation off-by-one in added-server snippetLine 126 is indented by five spaces whereas the surrounding flag lines use four, breaking alignment inside the code block.
- --server_role GENERATION \ + --server_role GENERATION \
181-183: Typo in section heading
Know Issues→Known Issues.
48-60: Optional: clarify thatbackend: pytorchis redundantSince PyTorch is now the default backend, specifying it in
disagg_config.yamlis optional. Consider adding a brief note (or removing the field) to prevent confusion after dropping the--backendCLI flag.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md(1 hunks)docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md(1 hunks)docs/source/commands/trtllm-serve.rst(5 hunks)examples/disaggregated/README.md(3 hunks)examples/llm-api/llm_mgmn_trtllm_serve.sh(0 hunks)examples/models/core/gemma/README.md(0 hunks)examples/models/core/llama/README.md(0 hunks)examples/models/core/llama4/README.md(0 hunks)examples/models/core/qwen/README.md(0 hunks)examples/serve/deepseek_r1_reasoning_parser.sh(0 hunks)tests/integration/defs/disaggregated/test_disaggregated_etcd.py(2 hunks)tests/unittest/llmapi/apps/_test_openai_chat_json.py(1 hunks)tests/unittest/llmapi/apps/_test_openai_chat_multimodal.py(1 hunks)tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py(1 hunks)tests/unittest/llmapi/apps/_test_trtllm_serve_duplicated_args.py(1 hunks)
💤 Files with no reviewable changes (6)
- examples/llm-api/llm_mgmn_trtllm_serve.sh
- examples/serve/deepseek_r1_reasoning_parser.sh
- examples/models/core/gemma/README.md
- examples/models/core/llama4/README.md
- examples/models/core/llama/README.md
- examples/models/core/qwen/README.md
✅ Files skipped from review due to trivial changes (2)
- docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md
- docs/source/commands/trtllm-serve.rst
🚧 Files skipped from review as they are similar to previous changes (6)
- tests/integration/defs/disaggregated/test_disaggregated_etcd.py
- tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py
- tests/unittest/llmapi/apps/_test_trtllm_serve_duplicated_args.py
- tests/unittest/llmapi/apps/_test_openai_chat_json.py
- tests/unittest/llmapi/apps/_test_openai_chat_multimodal.py
- docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-29T08:25:52.877Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
examples/disaggregated/README.md (2)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
🪛 markdownlint-cli2 (0.17.2)
examples/disaggregated/README.md
35-35: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
|
PR_Github #13399 [ run ] triggered by Bot |
|
PR_Github #13399 [ run ] completed with state |
af04666 to
2e48c8e
Compare
|
/bot run --stage-list "H100_PCIe-TensorRT-Perf-1" |
|
PR_Github #13474 [ run ] triggered by Bot |
|
PR_Github #13474 [ run ] completed with state |
2e48c8e to
d4028cd
Compare
|
/bot run |
|
PR_Github #13499 [ run ] triggered by Bot |
|
PR_Github #13499 [ run ] completed with state |
d4028cd to
016242e
Compare
|
/bot run |
|
PR_Github #13526 [ run ] triggered by Bot |
|
PR_Github #13526 [ run ] completed with state |
Signed-off-by: nv-guomingz <[email protected]>
016242e to
b6696e8
Compare
|
/bot reuse-pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/disaggregated/README.md (1)
82-84: Add fenced-block language identifier to silence markdown-lint warnings
markdownlint-cli2reports MD040 because this fenced block lacks a language tag.-``` +```bash python3 ./clients/disagg_client.py -c disagg_config.yaml -p ./clients/prompts.json -e chat</blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used: .coderabbit.yaml** **Review profile: CHILL** **Plan: Pro** <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 016242e6401ffb3ec6b5119754bc06fcc648a0b2 and b6696e88f493366bbc60361819dc0d35374faa48. </details> <details> <summary>📒 Files selected for processing (15)</summary> * `docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md` (1 hunks) * `docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md` (1 hunks) * `docs/source/commands/trtllm-serve.rst` (5 hunks) * `examples/disaggregated/README.md` (3 hunks) * `examples/llm-api/llm_mgmn_trtllm_serve.sh` (0 hunks) * `examples/models/core/gemma/README.md` (0 hunks) * `examples/models/core/llama/README.md` (0 hunks) * `examples/models/core/llama4/README.md` (0 hunks) * `examples/models/core/qwen/README.md` (0 hunks) * `examples/serve/deepseek_r1_reasoning_parser.sh` (0 hunks) * `tests/integration/defs/disaggregated/test_disaggregated_etcd.py` (2 hunks) * `tests/unittest/llmapi/apps/_test_openai_chat_json.py` (1 hunks) * `tests/unittest/llmapi/apps/_test_openai_chat_multimodal.py` (1 hunks) * `tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py` (1 hunks) * `tests/unittest/llmapi/apps/_test_trtllm_serve_duplicated_args.py` (1 hunks) </details> <details> <summary>💤 Files with no reviewable changes (6)</summary> * examples/llm-api/llm_mgmn_trtllm_serve.sh * examples/serve/deepseek_r1_reasoning_parser.sh * examples/models/core/llama/README.md * examples/models/core/llama4/README.md * examples/models/core/gemma/README.md * examples/models/core/qwen/README.md </details> <details> <summary>✅ Files skipped from review due to trivial changes (5)</summary> * docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md * tests/unittest/llmapi/apps/_test_openai_chat_structural_tag.py * tests/unittest/llmapi/apps/_test_openai_chat_multimodal.py * docs/source/commands/trtllm-serve.rst * tests/unittest/llmapi/apps/_test_openai_chat_json.py </details> <details> <summary>🚧 Files skipped from review as they are similar to previous changes (3)</summary> * tests/unittest/llmapi/apps/_test_trtllm_serve_duplicated_args.py * tests/integration/defs/disaggregated/test_disaggregated_etcd.py * docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧠 Learnings (2)</summary> <details> <summary>📓 Common learnings</summary>Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-07-30T06:11:42.362Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.</details> <details> <summary>examples/disaggregated/README.md (2)</summary> Learnt from: yechank-nvidia PR: NVIDIA/TensorRT-LLM#6254 File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204 Timestamp: 2025-07-22T09:22:14.726Z Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation. Learnt from: moraxu PR: NVIDIA/TensorRT-LLM#6303 File: tests/integration/test_lists/qa/examples_test_list.txt:494-494 Timestamp: 2025-07-28T17:06:08.621Z Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist. </details> </details><details> <summary>🪛 markdownlint-cli2 (0.17.2)</summary> <details> <summary>examples/disaggregated/README.md</summary> 101-101: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary> * GitHub Check: Pre-commit Check </details> <details> <summary>🔇 Additional comments (5)</summary><blockquote> <details> <summary>examples/disaggregated/README.md (5)</summary> `28-29`: **Removal of `--backend pytorch` is correct and keeps the example in-sync with the new default backend** No further action required for these two context-server commands. --- `35-35`: **Likewise, generation-server example is now backend-agnostic** Looks good. --- `98-99`: **Dynamic-scaling context-server commands also updated correctly** Change aligns with the default backend behaviour. --- `102-102`: **Generation-server dynamic-scaling command updated** Nothing else to flag. --- `126-126`: **Newly-added generation server example consistent with prior updates** All good here. </details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
|
PR_Github #13548 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #13548 [ reuse-pipeline ] completed with state |
…en it use torch as backend. (NVIDIA#6419) Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
…en it use torch as backend. (NVIDIA#6419) Signed-off-by: nv-guomingz <[email protected]>
Remove the explicit
--backend pytorchfor trtllm-serve doc and example since we've set the PyTorch as the default backend with #5717Summary by CodeRabbit
Documentation
--backend pytorchoption from alltrtllm-servecommand examples and related documentation, simplifying usage instructions across blogs, READMEs, and guides.Tests
--backend pytorchargument when launching servers, aligning tests with the streamlined command usage.