Skip to content

Conversation

@zhenhuaw-me
Copy link
Member

@zhenhuaw-me zhenhuaw-me commented Aug 14, 2025

This is to address Kaiyu's offline suggestion to
#6853 .

Keep this separate from the original PR for clean.

Summary by CodeRabbit

  • Documentation
    • Polished deployment guides for DeepSeek R1, Llama3.3‑70B, and Llama4 Scout.
    • Standardized formatting of commands, URLs, ports, paths, and options (inline code, code blocks, angle-bracket links).
    • Converted prerequisites and parameters to clear bullet lists with consistent defaults and value styling.
    • Expanded and clarified YAML/CLI examples and configuration sections (cache, CUDA/attention/MoE).
    • Improved troubleshooting, benchmarking, and sample-output presentation; no behavior or command changes.

@zhenhuaw-me zhenhuaw-me requested a review from a team as a code owner August 14, 2025 01:03
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 14, 2025

📝 Walkthrough

Walkthrough

Documentation-only refactors of three TRT-LLM quick-start guides: standardized code formatting, bulletized option descriptions, consistent code/backtick and link markup, and expanded/clarified YAML and command examples. No changes to commands, defaults, server behavior, or exported/public APIs.

Changes

Cohort / File(s) Summary
Deepseek & Llama3 quick-start formatting
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md, docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
Rewrote prerequisites, commands, paths, ports, URLs, and YAML/CLI examples to use consistent code/backtick or angle-bracket markup and bullet lists; normalized config option presentation and code blocks. No functional or semantic changes.
Llama4 Scout guide refinements
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
Unified link and code formatting, converted many sections to bullet-style option descriptions, added/clarified YAML snippets (e.g., enable_attention_dp, cuda_graph_config, kv_cache_config), standardized sample outputs and benchmark references. No behavioral changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Suggested labels

1.0_doc

Suggested reviewers

  • QiJune
  • kaiyux
  • litaotju

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@zhenhuaw-me zhenhuaw-me self-assigned this Aug 14, 2025
@zhenhuaw-me zhenhuaw-me requested a review from kaiyux August 14, 2025 01:03
@zhenhuaw-me zhenhuaw-me enabled auto-merge (squash) August 14, 2025 01:03
@zhenhuaw-me
Copy link
Member Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (23)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (8)

11-15: Tighten prerequisite wording and casing (“later”, “Python 3”); minor grammar fixes

Lowercase “later,” prefer “Python 3,” and make the optional note consistent.

-* Drivers: CUDA Driver 575 or Later  
-* Python3 and python3-pip (Optional, for accuracy evaluation only)
+* Drivers: CUDA driver 575 or later  
+* Python 3 and python3-pip (optional; for accuracy evaluation only)

44-47: Polish notes: fix mkdir, reduce duplication, add punctuation

Use mkdir -p, add a comma, remove the duplicate “from your host,” and end with periods.

-* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using `$ mkdir ~/.cache`.
+* The command mounts your user `.cache` directory to save the downloaded model checkpoints, which are saved to `~/.cache/huggingface/hub/` by default. This prevents redownloading the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist, create it with: `mkdir -p ~/.cache`.
-* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host  
+* The command also maps port `8000` from the container to the host so you can access the LLM API endpoint.

49-49: Grammar: “use the latest main branch”

Insert the definite article and simplify phrasing.

-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. 
+If you want to use the latest main branch, build from source to install TensorRT-LLM; see <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.

72-92: Clarify FP8 vs FP4 config selection to avoid confusion

You show two config snippets that both set dtype: fp8, where the second adds moe_config “for FP8.” Readers using FP4 may be unsure which snippet to follow. Add an explicit note that FP4 users should use the first snippet (without moe_config) and keep the rest identical.

Would you like me to add one clarifying sentence after Line 72 stating “For FP4, use the same config without the ‘moe_config’ section”?


236-238: Fix incorrect PyTorch CUDA memory doc URL

The current link has “docs” duplicated in the hostname path.

-... please refer to the [PyTorch documentation on optimizing memory usage](https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf).
+... please refer to the [PyTorch documentation on optimizing memory usage](https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf).

267-272: Add language for fenced code block to satisfy markdownlint (MD040)

Use a neutral language like “text” for tabular sample output.

-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
 |     |       |strict-match    |     5|exact_match|↑  |0.9500|±  |0.0060|
-```
+``` 

286-291: Add language for fenced code block to satisfy markdownlint (MD040)

Same issue as above for the FP4 sample results table.

-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9462|±  |0.0062|
 |     |       |strict-match    |     5|exact_match|↑  |0.9447|±  |0.0063|
-```
+``` 

335-335: Remove unnecessary backslashes in GitHub URL

Backslashes aren’t needed within angle brackets and will render literally.

-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>. 
+For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (8)

15-19: Tighten prerequisite wording and casing

Lowercase “later,” use “Python 3,” and keep “optional” lowercase for consistency.

-* Drivers: CUDA Driver 575 or Later  
-* Python3 and python3-pip (Optional, for accuracy evaluation only)
+* Drivers: CUDA driver 575 or later  
+* Python 3 and python3-pip (optional; for accuracy evaluation only)

47-49: Improve .cache note: fix mkdir and punctuation

Use mkdir -p and add a comma after “exist”.

-* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using  mkdir `~/.cache`.
+* The command mounts your user `.cache` directory to save the downloaded model checkpoints, which are saved to `~/.cache/huggingface/hub/` by default. This prevents redownloading the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist, create it with: `mkdir -p ~/.cache`.

50-53: Grammar nits: end sentence and add “the”

Add a period to the port mapping bullet and fix “use the latest main branch”.

-* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
+* The command also maps port `8000` from the container to the host so you can access the LLM API endpoint.
-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
+If you want to use the latest main branch, build from source to install TensorRT-LLM; see <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.

241-246: Add language for fenced code block to satisfy markdownlint

Mark sample results as plain text.

-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9189|±  |0.0075|
 |     |       |strict-match    |     5|exact_match|↑  |0.8984|±  |0.0083|
-```
+``` 

258-263: Add language for fenced code block to satisfy markdownlint

Same adjustment for FP4 results.

-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9075|±  |0.0080|
 |     |       |strict-match    |     5|exact_match|↑  |0.8908|±  |0.0086|
-```
+``` 

267-268: Grammar: “first create a wrapper”

Correct gerund usage.

-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
+To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. First, create a wrapper `bench.sh` script.

307-307: Remove unnecessary backslashes in GitHub URL

Avoid escaping underscores inside angle-bracket links.

-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>.
+For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.

221-229: Consistency: refer to the tool as “lm-evaluation-harness” or lm_eval, not “lm-eval”

The command uses lm_eval, while the text says “lm-eval.” Pick one label for consistency (typically “lm-evaluation-harness” in prose, lm_eval in commands).

docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (7)

50-53: Polish notes: avoid repetition, add period, fix grammar

End the port mapping bullet with a period and add “the” before “latest main branch”.

-* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host  
+* The command also maps port `8000` from the container to the host so you can access the LLM API endpoint.
-* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support.
+* See <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all available containers. The weekly main-branch builds have an “rcN” suffix; the monthly QA’d release has no “rcN” suffix. Use the “rc” release to get the latest model and feature support.
-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. 
+If you want to use the latest main branch, build from source to install TensorRT-LLM; see <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.

67-69: Trim trailing spaces in YAML key

Minor formatting: remove extra spaces after kv_cache_config:.

-kv_cache_config:     
+kv_cache_config:

234-240: Grammar and clarity in evaluation note; keep BOS warning concise

Remove the comma after “So,” and slightly tighten wording.

-* Note: The tokenizer will add BOS (beginning of sentence token) before input prompt by default which leads to accuracy regression on GSM8K task for Llama 3.3 70B instruction model. So, set `add_special_tokens=False` to avoid it.
+* Note: By default, the tokenizer prepends a BOS (beginning-of-sentence) token to the input prompt, which degrades GSM8K accuracy for the Llama 3.3 70B Instruct model. Set `add_special_tokens=False` to avoid this.

253-253: Fix capitalization: “Llama”, not “LLama”

Small typo in model name.

-* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instruction model. So set `add_special_tokens=False` to avoid it.
+* Note: The tokenizer will add BOS before the input prompt by default, which leads to accuracy regression on GSM8K for the Llama 3.3 70B Instruct model. Set `add_special_tokens=False` to avoid it.

263-268: Add language for fenced code block to satisfy markdownlint

Mark the tabular sample results as text.

-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9356|±  |0.0068|
 |     |       |strict-match    |     5|exact_match|↑  |0.8393|±  |0.0101|
-```
+``` 

272-273: Grammar: “First, create a wrapper script”

Fix the “first creating” phrasing.

-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
+To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. First, create a wrapper `bench.sh` script.

312-315: Remove unnecessary backslashes and improve emphasis

Unescape underscores in the GitHub link and wrap bench.sh in code ticks for consistency.

-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>. 
-Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
+For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.
+Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb4ed18 and 1d343d9.

📒 Files selected for processing (3)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (10 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (9 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
🪛 LanguageTool
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md

[grammar] ~15-~15: There might be a mistake here.
Context: ... NVIDIA Blackwell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 o...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...ell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 or Later * Do...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...ux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...

(QB_NEW_EN)


[grammar] ~52-~52: There might be a mistake here.
Context: ...el and feature support. If you want to use latest main branch, you can choose to b...

(QB_NEW_EN)

docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md

[grammar] ~11-~11: There might be a mistake here.
Context: ... NVIDIA Blackwell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 o...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...ell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 or Later * Do...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...ux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...

(QB_NEW_EN)


[grammar] ~44-~44: There might be a mistake here.
Context: ...ease create it using $ mkdir ~/.cache. * You can mount additional directories and...

(QB_NEW_EN)


[grammar] ~49-~49: There might be a mistake here.
Context: ...el and feature support. If you want to use latest main branch, you can choose to b...

(QB_NEW_EN)


[grammar] ~257-~257: There might be a mistake here.
Context: ...l add BOS (beginning of sentence token) before input prompt by default which leads to ...

(QB_NEW_EN)


[grammar] ~257-~257: There might be a mistake here.
Context: ...ault which leads to accuracy regression on GSM8K task for DeepSeek R1 model. So, s...

(QB_NEW_EN)


[grammar] ~257-~257: There might be a mistake here.
Context: ...ds to accuracy regression on GSM8K task for DeepSeek R1 model. So, set `add_special...

(QB_NEW_EN)


[grammar] ~276-~276: There might be a mistake here.
Context: ...8K: * Note: The tokenizer will add BOS before input prompt by default, which leads to...

(QB_NEW_EN)


[grammar] ~276-~276: There might be a mistake here.
Context: ...ult, which leads to accuracy regression on GSM8K task for DeepSeek R1 model. So se...

(QB_NEW_EN)


[grammar] ~276-~276: There might be a mistake here.
Context: ...ds to accuracy regression on GSM8K task for DeepSeek R1 model. So set `add_special_...

(QB_NEW_EN)

docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md

[grammar] ~53-~53: There might be a mistake here.
Context: ...el and feature support. If you want to use latest main branch, you can choose to b...

(QB_NEW_EN)


[grammar] ~234-~234: There might be a mistake here.
Context: ...l add BOS (beginning of sentence token) before input prompt by default which leads to ...

(QB_NEW_EN)


[grammar] ~234-~234: There might be a mistake here.
Context: ...ault which leads to accuracy regression on GSM8K task for Llama 3.3 70B instructio...

(QB_NEW_EN)


[grammar] ~234-~234: There might be a mistake here.
Context: ...ds to accuracy regression on GSM8K task for Llama 3.3 70B instruction model. So, se...

(QB_NEW_EN)


[grammar] ~253-~253: There might be a mistake here.
Context: ...M8K * Note: The tokenizer will add BOS before input prompt by default, which leads to...

(QB_NEW_EN)


[grammar] ~253-~253: There might be a mistake here.
Context: ...ult, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instructio...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md

241-241: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


258-258: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)

155-201: Optional: sanity-check option names against TorchLlmArgs

Names like num_nextn_predict_layers and enable_attention_dp are uncommon; ensure they match TorchLlmArgs exactly for the target release.

Do you want me to run a quick cross-reference against the published TorchLlmArgs docs and open an issue if there’s a mismatch?

docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (2)

214-219: Nice additions to troubleshooting

Adding the explicit port-collision tip is helpful and actionable.


97-129: Overall: documentation polish achieves stated goal without semantic changes

The reformatting of CLI flags and YAML options into consistent bullets and inline code improves clarity. No semantic changes detected.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15196 [ run ] triggered by Bot

@zhenhuaw-me zhenhuaw-me disabled auto-merge August 14, 2025 01:16
This is to address Kaiyu's offline suggestion to
NVIDIA#6853 .

Keep this separate from the original PR for clean.

Signed-off-by: Zhenhua Wang <[email protected]>
@zhenhuaw-me zhenhuaw-me enabled auto-merge (squash) August 14, 2025 01:35
@zhenhuaw-me
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15203 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15196 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (2)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)

267-273: Fix MD040: add language to fenced code blocks

markdownlint-cli2 reports MD040 for fenced code blocks in the file — add a language identifier (e.g., text) to the opening fence.

  • File: docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
    • MD040 at lines: 267, 286, 345 — replace the opening triple-backticks (```) with a language (e.g., ```text).

Suggested change (for the block at ~lines 267–273):

-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
 |     |       |strict-match    |     5|exact_match|↑  |0.9500|±  |0.0060|

</blockquote></details>
<details>
<summary>docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)</summary><blockquote>

`244-250`: **Add language to fenced code blocks in quick-start-recipe-for-llama3.3-70b-on-trtllm.md — fixes MD040**

Confirmed with markdownlint-cli2: MD040 appears for fenced code blocks in this file. Add a language (e.g., "text") for ASCII tables.

Files / locations to fix:
- docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md — MD040 at lines 244, 263, 322

Apply the change (example diffs):

```diff
-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9348|±  |0.0068|
 |     |       |strict-match    |     5|exact_match|↑  |0.8870|±  |0.0087|

And for the other fenced blocks flagged at lines 263 and 322, replace their opening fences similarly:

```diff
-```
+```text
🧹 Nitpick comments (15)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (5)

49-50: Grammar: “use latest main branch” → “use the latest main branch”

Tighten wording and break the run-on for clarity.

-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
+If you want to use the latest main branch, you can build TensorRT-LLM from source. See <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html> for instructions.

72-92: Avoid overwriting the YAML when adding moe_config for FP8

As written, the second heredoc overwrites the first file. Either mention that it replaces the previous config or append only the MoE section. Appending keeps the doc concise.

Option A — clarify overwrite (minimal text-only tweak):

  • Add a sentence before the second block: “This command replaces the previous configuration.”

Option B — append only the new section:

-EXTRA_LLM_API_FILE=/tmp/config.yml
-
-cat << EOF > ${EXTRA_LLM_API_FILE}
-enable_attention_dp: true
-cuda_graph_config:
-  enable_padding: true
-  max_batch_size: 128
-kv_cache_config:
-  dtype: fp8
-stream_interval: 10
-speculative_config:
-  decoding_type: MTP
-  num_nextn_predict_layers: 1
-moe_config:
-  backend: DEEPGEMM
-  max_num_tokens: 3200
-EOF
+EXTRA_LLM_API_FILE=/tmp/config.yml
+cat << EOF >> ${EXTRA_LLM_API_FILE}
+moe_config:
+  backend: DEEPGEMM
+  max_num_tokens: 3200
+EOF

286-292: Fix markdownlint MD040: add language to second sample table block

Same lint as above.

-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9462|±  |0.0062|
 |     |       |strict-match    |     5|exact_match|↑  |0.9447|±  |0.0063|

---

`295-296`: **Grammar: “first creating a wrapper” → “first create a wrapper”**

Small fix to avoid the gerund.


```diff
-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
+To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper `bench.sh` script.

335-336: Remove unnecessary backslashes in URL

Escaping underscores is unnecessary inside angle-bracket links and can render oddly.

-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>.
+For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (5)

15-19: Make prerequisites a bulleted list for consistency with other guides

Other quick-starts use bullets. Converting improves scannability and keeps style consistent.

-GPU: NVIDIA Blackwell or Hopper Architecture
-OS: Linux
-Drivers: CUDA Driver 575 or Later
-Docker with NVIDIA Container Toolkit installed
-Python3 and python3-pip (Optional, for accuracy evaluation only)
+* GPU: NVIDIA Blackwell or Hopper architecture
+* OS: Linux
+* Drivers: CUDA Driver 575 or later
+* Docker with NVIDIA Container Toolkit installed
+* Python3 and python3-pip (optional, for accuracy evaluation only)

53-54: Grammar: “use latest main branch” → “use the latest main branch”

Also tighten the second clause.

-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
+If you want to use the latest main branch, you can build TensorRT-LLM from source. See <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html> for instructions.

214-218: Nit: format nvidia-smi as code and add a comma

Keeps CLI tool names consistent with code formatting used elsewhere.

-* For performance issues, check GPU utilization with nvidia-smi while the server is running.
+* For performance issues, check GPU utilization with `nvidia-smi` while the server is running.

272-273: Grammar: “first creating a wrapper” → “first create a wrapper”

Same nit as other guides.

-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
+To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper `bench.sh` script.

312-313: Remove unnecessary backslashes in URL

Unescaped underscores are fine in angle-bracket links.

-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>.
+For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (5)

52-53: Grammar: “use latest main branch” → “use the latest main branch”

Also simplify structure.

-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
+If you want to use the latest main branch, you can build TensorRT-LLM from source. See <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html> for instructions.

241-246: Fix markdownlint MD040: add language to sample table code block

Add a language to the fenced block.

-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9189|±  |0.0075|
 |     |       |strict-match    |     5|exact_match|↑  |0.8984|±  |0.0083|

---

`258-263`: **Fix markdownlint MD040: add language to second sample table block**

Same lint as above.


```diff
-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9075|±  |0.0080|
 |     |       |strict-match    |     5|exact_match|↑  |0.8908|±  |0.0086|

---

`267-268`: **Grammar: “first creating a wrapper” → “first create a wrapper”**

Align phrasing with the other guides.


```diff
-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
+To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper `bench.sh` script.

307-308: Remove unnecessary backslashes in URL

Underscores don’t need escaping in angle-bracket links.

-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>.
+For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1d343d9 and 6493aab.

📒 Files selected for processing (3)
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (12 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (13 hunks)
  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
  • docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
  • docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
🪛 LanguageTool
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md

[grammar] ~15-~15: There might be a mistake here.
Context: ... NVIDIA Blackwell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 or ...

(QB_NEW_EN)


[grammar] ~16-~16: There might be a mistake here.
Context: ...kwell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 or Later * Dock...

(QB_NEW_EN)


[grammar] ~17-~17: There might be a mistake here.
Context: ...inux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...

(QB_NEW_EN)


[grammar] ~47-~47: There might be a mistake here.
Context: ...ease create it using $ mkdir ~/.cache. * You can mount additional directories and...

(QB_NEW_EN)


[grammar] ~52-~52: There might be a mistake here.
Context: ...el and feature support. If you want to use latest main branch, you can choose to b...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md

241-241: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


258-258: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md

241-241: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


258-258: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md

241-241: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)

42-47: Good consistency on container runtime notes

The bulleted “Note:” block improves readability and matches the styling used across related guides.

docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)

95-180: Solid, consistent option descriptions

The restructured “Configs and Parameters” section reads clearly and matches the style of the other quick-starts. Good alignment on defaults and terminology.

docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (2)

11-12: Licensing link improvement LGTM

Using a clear anchor text for the Llama 4 Community License enhances readability.


93-180: Consistent, clear parameter docs

The standardized bullets and defaults under “Configs and Parameters” and “Extra LLM API Options” are clear and align with the other quick-starts.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15203 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11483 completed with status: 'SUCCESS'

@zhenhuaw-me zhenhuaw-me merged commit 868c5d1 into NVIDIA:main Aug 14, 2025
5 checks passed
@zhenhuaw-me zhenhuaw-me deleted the fix-doc-format branch August 14, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants