-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][chore] fix markdown format for the deployment guide #6879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughDocumentation-only refactors of three TRT-LLM quick-start guides: standardized code formatting, bulletized option descriptions, consistent code/backtick and link markup, and expanded/clarified YAML and command examples. No changes to commands, defaults, server behavior, or exported/public APIs. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (23)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (8)
11-15: Tighten prerequisite wording and casing (“later”, “Python 3”); minor grammar fixesLowercase “later,” prefer “Python 3,” and make the optional note consistent.
-* Drivers: CUDA Driver 575 or Later -* Python3 and python3-pip (Optional, for accuracy evaluation only) +* Drivers: CUDA driver 575 or later +* Python 3 and python3-pip (optional; for accuracy evaluation only)
44-47: Polish notes: fix mkdir, reduce duplication, add punctuationUse mkdir -p, add a comma, remove the duplicate “from your host,” and end with periods.
-* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using `$ mkdir ~/.cache`. +* The command mounts your user `.cache` directory to save the downloaded model checkpoints, which are saved to `~/.cache/huggingface/hub/` by default. This prevents redownloading the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist, create it with: `mkdir -p ~/.cache`. -* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host +* The command also maps port `8000` from the container to the host so you can access the LLM API endpoint.
49-49: Grammar: “use the latest main branch”Insert the definite article and simplify phrasing.
-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. +If you want to use the latest main branch, build from source to install TensorRT-LLM; see <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
72-92: Clarify FP8 vs FP4 config selection to avoid confusionYou show two config snippets that both set
dtype: fp8, where the second addsmoe_config“for FP8.” Readers using FP4 may be unsure which snippet to follow. Add an explicit note that FP4 users should use the first snippet (withoutmoe_config) and keep the rest identical.Would you like me to add one clarifying sentence after Line 72 stating “For FP4, use the same config without the ‘moe_config’ section”?
236-238: Fix incorrect PyTorch CUDA memory doc URLThe current link has “docs” duplicated in the hostname path.
-... please refer to the [PyTorch documentation on optimizing memory usage](https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf). +... please refer to the [PyTorch documentation on optimizing memory usage](https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf).
267-272: Add language for fenced code block to satisfy markdownlint (MD040)Use a neutral language like “text” for tabular sample output.
-``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9538|± |0.0058| | | |strict-match | 5|exact_match|↑ |0.9500|± |0.0060| -``` +```
286-291: Add language for fenced code block to satisfy markdownlint (MD040)Same issue as above for the FP4 sample results table.
-``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9462|± |0.0062| | | |strict-match | 5|exact_match|↑ |0.9447|± |0.0063| -``` +```
335-335: Remove unnecessary backslashes in GitHub URLBackslashes aren’t needed within angle brackets and will render literally.
-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>. +For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (8)
15-19: Tighten prerequisite wording and casingLowercase “later,” use “Python 3,” and keep “optional” lowercase for consistency.
-* Drivers: CUDA Driver 575 or Later -* Python3 and python3-pip (Optional, for accuracy evaluation only) +* Drivers: CUDA driver 575 or later +* Python 3 and python3-pip (optional; for accuracy evaluation only)
47-49: Improve .cache note: fix mkdir and punctuationUse
mkdir -pand add a comma after “exist”.-* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using mkdir `~/.cache`. +* The command mounts your user `.cache` directory to save the downloaded model checkpoints, which are saved to `~/.cache/huggingface/hub/` by default. This prevents redownloading the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist, create it with: `mkdir -p ~/.cache`.
50-53: Grammar nits: end sentence and add “the”Add a period to the port mapping bullet and fix “use the latest main branch”.
-* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host +* The command also maps port `8000` from the container to the host so you can access the LLM API endpoint. -If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. +If you want to use the latest main branch, build from source to install TensorRT-LLM; see <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
241-246: Add language for fenced code block to satisfy markdownlintMark sample results as plain text.
-``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9189|± |0.0075| | | |strict-match | 5|exact_match|↑ |0.8984|± |0.0083| -``` +```
258-263: Add language for fenced code block to satisfy markdownlintSame adjustment for FP4 results.
-``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9075|± |0.0080| | | |strict-match | 5|exact_match|↑ |0.8908|± |0.0086| -``` +```
267-268: Grammar: “first create a wrapper”Correct gerund usage.
-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script. +To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. First, create a wrapper `bench.sh` script.
307-307: Remove unnecessary backslashes in GitHub URLAvoid escaping underscores inside angle-bracket links.
-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>. +For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.
221-229: Consistency: refer to the tool as “lm-evaluation-harness” orlm_eval, not “lm-eval”The command uses
lm_eval, while the text says “lm-eval.” Pick one label for consistency (typically “lm-evaluation-harness” in prose,lm_evalin commands).docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (7)
50-53: Polish notes: avoid repetition, add period, fix grammarEnd the port mapping bullet with a period and add “the” before “latest main branch”.
-* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host +* The command also maps port `8000` from the container to the host so you can access the LLM API endpoint. -* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. Use the rc release to get the latest model and feature support. +* See <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all available containers. The weekly main-branch builds have an “rcN” suffix; the monthly QA’d release has no “rcN” suffix. Use the “rc” release to get the latest model and feature support. -If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. +If you want to use the latest main branch, build from source to install TensorRT-LLM; see <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
67-69: Trim trailing spaces in YAML keyMinor formatting: remove extra spaces after
kv_cache_config:.-kv_cache_config: +kv_cache_config:
234-240: Grammar and clarity in evaluation note; keep BOS warning conciseRemove the comma after “So,” and slightly tighten wording.
-* Note: The tokenizer will add BOS (beginning of sentence token) before input prompt by default which leads to accuracy regression on GSM8K task for Llama 3.3 70B instruction model. So, set `add_special_tokens=False` to avoid it. +* Note: By default, the tokenizer prepends a BOS (beginning-of-sentence) token to the input prompt, which degrades GSM8K accuracy for the Llama 3.3 70B Instruct model. Set `add_special_tokens=False` to avoid this.
253-253: Fix capitalization: “Llama”, not “LLama”Small typo in model name.
-* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instruction model. So set `add_special_tokens=False` to avoid it. +* Note: The tokenizer will add BOS before the input prompt by default, which leads to accuracy regression on GSM8K for the Llama 3.3 70B Instruct model. Set `add_special_tokens=False` to avoid it.
263-268: Add language for fenced code block to satisfy markdownlintMark the tabular sample results as text.
-``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9356|± |0.0068| | | |strict-match | 5|exact_match|↑ |0.8393|± |0.0101| -``` +```
272-273: Grammar: “First, create a wrapper script”Fix the “first creating” phrasing.
-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script. +To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. First, create a wrapper `bench.sh` script.
312-315: Remove unnecessary backslashes and improve emphasisUnescape underscores in the GitHub link and wrap
bench.shin code ticks for consistency.-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>. -Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script. +For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>. +Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md(10 hunks)docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md(9 hunks)docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md(8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.mddocs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.mddocs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
🪛 LanguageTool
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
[grammar] ~15-~15: There might be a mistake here.
Context: ... NVIDIA Blackwell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 o...
(QB_NEW_EN)
[grammar] ~16-~16: There might be a mistake here.
Context: ...ell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 or Later * Do...
(QB_NEW_EN)
[grammar] ~17-~17: There might be a mistake here.
Context: ...ux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...
(QB_NEW_EN)
[grammar] ~18-~18: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...
(QB_NEW_EN)
[grammar] ~52-~52: There might be a mistake here.
Context: ...el and feature support. If you want to use latest main branch, you can choose to b...
(QB_NEW_EN)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
[grammar] ~11-~11: There might be a mistake here.
Context: ... NVIDIA Blackwell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 o...
(QB_NEW_EN)
[grammar] ~12-~12: There might be a mistake here.
Context: ...ell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 or Later * Do...
(QB_NEW_EN)
[grammar] ~13-~13: There might be a mistake here.
Context: ...ux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...
(QB_NEW_EN)
[grammar] ~14-~14: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...
(QB_NEW_EN)
[grammar] ~44-~44: There might be a mistake here.
Context: ...ease create it using $ mkdir ~/.cache. * You can mount additional directories and...
(QB_NEW_EN)
[grammar] ~49-~49: There might be a mistake here.
Context: ...el and feature support. If you want to use latest main branch, you can choose to b...
(QB_NEW_EN)
[grammar] ~257-~257: There might be a mistake here.
Context: ...l add BOS (beginning of sentence token) before input prompt by default which leads to ...
(QB_NEW_EN)
[grammar] ~257-~257: There might be a mistake here.
Context: ...ault which leads to accuracy regression on GSM8K task for DeepSeek R1 model. So, s...
(QB_NEW_EN)
[grammar] ~257-~257: There might be a mistake here.
Context: ...ds to accuracy regression on GSM8K task for DeepSeek R1 model. So, set `add_special...
(QB_NEW_EN)
[grammar] ~276-~276: There might be a mistake here.
Context: ...8K: * Note: The tokenizer will add BOS before input prompt by default, which leads to...
(QB_NEW_EN)
[grammar] ~276-~276: There might be a mistake here.
Context: ...ult, which leads to accuracy regression on GSM8K task for DeepSeek R1 model. So se...
(QB_NEW_EN)
[grammar] ~276-~276: There might be a mistake here.
Context: ...ds to accuracy regression on GSM8K task for DeepSeek R1 model. So set `add_special_...
(QB_NEW_EN)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
[grammar] ~53-~53: There might be a mistake here.
Context: ...el and feature support. If you want to use latest main branch, you can choose to b...
(QB_NEW_EN)
[grammar] ~234-~234: There might be a mistake here.
Context: ...l add BOS (beginning of sentence token) before input prompt by default which leads to ...
(QB_NEW_EN)
[grammar] ~234-~234: There might be a mistake here.
Context: ...ault which leads to accuracy regression on GSM8K task for Llama 3.3 70B instructio...
(QB_NEW_EN)
[grammar] ~234-~234: There might be a mistake here.
Context: ...ds to accuracy regression on GSM8K task for Llama 3.3 70B instruction model. So, se...
(QB_NEW_EN)
[grammar] ~253-~253: There might be a mistake here.
Context: ...M8K * Note: The tokenizer will add BOS before input prompt by default, which leads to...
(QB_NEW_EN)
[grammar] ~253-~253: There might be a mistake here.
Context: ...ult, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instructio...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
241-241: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
258-258: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)
155-201: Optional: sanity-check option names against TorchLlmArgsNames like
num_nextn_predict_layersandenable_attention_dpare uncommon; ensure they match TorchLlmArgs exactly for the target release.Do you want me to run a quick cross-reference against the published TorchLlmArgs docs and open an issue if there’s a mismatch?
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (2)
214-219: Nice additions to troubleshootingAdding the explicit port-collision tip is helpful and actionable.
97-129: Overall: documentation polish achieves stated goal without semantic changesThe reformatting of CLI flags and YAML options into consistent bullets and inline code improves clarity. No semantic changes detected.
|
PR_Github #15196 [ run ] triggered by Bot |
This is to address Kaiyu's offline suggestion to NVIDIA#6853 . Keep this separate from the original PR for clean. Signed-off-by: Zhenhua Wang <[email protected]>
1d343d9 to
6493aab
Compare
|
/bot run |
|
PR_Github #15203 [ run ] triggered by Bot |
|
PR_Github #15196 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (2)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)
267-273: Fix MD040: add language to fenced code blocksmarkdownlint-cli2 reports MD040 for fenced code blocks in the file — add a language identifier (e.g.,
text) to the opening fence.
- File: docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
- MD040 at lines: 267, 286, 345 — replace the opening triple-backticks (
```) with a language (e.g.,```text).Suggested change (for the block at ~lines 267–273):
-``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9538|± |0.0058| | | |strict-match | 5|exact_match|↑ |0.9500|± |0.0060|</blockquote></details> <details> <summary>docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)</summary><blockquote> `244-250`: **Add language to fenced code blocks in quick-start-recipe-for-llama3.3-70b-on-trtllm.md — fixes MD040** Confirmed with markdownlint-cli2: MD040 appears for fenced code blocks in this file. Add a language (e.g., "text") for ASCII tables. Files / locations to fix: - docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md — MD040 at lines 244, 263, 322 Apply the change (example diffs): ```diff -``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9348|± |0.0068| | | |strict-match | 5|exact_match|↑ |0.8870|± |0.0087|And for the other fenced blocks flagged at lines 263 and 322, replace their opening fences similarly: ```diff -``` +```text
🧹 Nitpick comments (15)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (5)
49-50: Grammar: “use latest main branch” → “use the latest main branch”Tighten wording and break the run-on for clarity.
-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. +If you want to use the latest main branch, you can build TensorRT-LLM from source. See <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html> for instructions.
72-92: Avoid overwriting the YAML when adding moe_config for FP8As written, the second heredoc overwrites the first file. Either mention that it replaces the previous config or append only the MoE section. Appending keeps the doc concise.
Option A — clarify overwrite (minimal text-only tweak):
- Add a sentence before the second block: “This command replaces the previous configuration.”
Option B — append only the new section:
-EXTRA_LLM_API_FILE=/tmp/config.yml - -cat << EOF > ${EXTRA_LLM_API_FILE} -enable_attention_dp: true -cuda_graph_config: - enable_padding: true - max_batch_size: 128 -kv_cache_config: - dtype: fp8 -stream_interval: 10 -speculative_config: - decoding_type: MTP - num_nextn_predict_layers: 1 -moe_config: - backend: DEEPGEMM - max_num_tokens: 3200 -EOF +EXTRA_LLM_API_FILE=/tmp/config.yml +cat << EOF >> ${EXTRA_LLM_API_FILE} +moe_config: + backend: DEEPGEMM + max_num_tokens: 3200 +EOF
286-292: Fix markdownlint MD040: add language to second sample table blockSame lint as above.
-``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9462|± |0.0062| | | |strict-match | 5|exact_match|↑ |0.9447|± |0.0063|--- `295-296`: **Grammar: “first creating a wrapper” → “first create a wrapper”** Small fix to avoid the gerund. ```diff -To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script. +To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper `bench.sh` script.
335-336: Remove unnecessary backslashes in URLEscaping underscores is unnecessary inside angle-bracket links and can render oddly.
-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>. +For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (5)
15-19: Make prerequisites a bulleted list for consistency with other guidesOther quick-starts use bullets. Converting improves scannability and keeps style consistent.
-GPU: NVIDIA Blackwell or Hopper Architecture -OS: Linux -Drivers: CUDA Driver 575 or Later -Docker with NVIDIA Container Toolkit installed -Python3 and python3-pip (Optional, for accuracy evaluation only) +* GPU: NVIDIA Blackwell or Hopper architecture +* OS: Linux +* Drivers: CUDA Driver 575 or later +* Docker with NVIDIA Container Toolkit installed +* Python3 and python3-pip (optional, for accuracy evaluation only)
53-54: Grammar: “use latest main branch” → “use the latest main branch”Also tighten the second clause.
-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. +If you want to use the latest main branch, you can build TensorRT-LLM from source. See <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html> for instructions.
214-218: Nit: format nvidia-smi as code and add a commaKeeps CLI tool names consistent with code formatting used elsewhere.
-* For performance issues, check GPU utilization with nvidia-smi while the server is running. +* For performance issues, check GPU utilization with `nvidia-smi` while the server is running.
272-273: Grammar: “first creating a wrapper” → “first create a wrapper”Same nit as other guides.
-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script. +To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper `bench.sh` script.
312-313: Remove unnecessary backslashes in URLUnescaped underscores are fine in angle-bracket links.
-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>. +For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (5)
52-53: Grammar: “use latest main branch” → “use the latest main branch”Also simplify structure.
-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. +If you want to use the latest main branch, you can build TensorRT-LLM from source. See <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html> for instructions.
241-246: Fix markdownlint MD040: add language to sample table code blockAdd a language to the fenced block.
-``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9189|± |0.0075| | | |strict-match | 5|exact_match|↑ |0.8984|± |0.0083|--- `258-263`: **Fix markdownlint MD040: add language to second sample table block** Same lint as above. ```diff -``` +```text |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9075|± |0.0080| | | |strict-match | 5|exact_match|↑ |0.8908|± |0.0086|--- `267-268`: **Grammar: “first creating a wrapper” → “first create a wrapper”** Align phrasing with the other guides. ```diff -To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script. +To benchmark the performance of your TensorRT-LLM server, you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper `bench.sh` script.
307-308: Remove unnecessary backslashes in URLUnderscores don’t need escaping in angle-bracket links.
-For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>. +For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py>.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md(12 hunks)docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md(13 hunks)docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md(8 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.mddocs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.mddocs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
🪛 LanguageTool
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
[grammar] ~15-~15: There might be a mistake here.
Context: ... NVIDIA Blackwell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 or ...
(QB_NEW_EN)
[grammar] ~16-~16: There might be a mistake here.
Context: ...kwell or Hopper Architecture * OS: Linux * Drivers: CUDA Driver 575 or Later * Dock...
(QB_NEW_EN)
[grammar] ~17-~17: There might be a mistake here.
Context: ...inux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...
(QB_NEW_EN)
[grammar] ~18-~18: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...
(QB_NEW_EN)
[grammar] ~47-~47: There might be a mistake here.
Context: ...ease create it using $ mkdir ~/.cache. * You can mount additional directories and...
(QB_NEW_EN)
[grammar] ~52-~52: There might be a mistake here.
Context: ...el and feature support. If you want to use latest main branch, you can choose to b...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
241-241: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
258-258: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
241-241: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
258-258: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
241-241: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)
42-47: Good consistency on container runtime notesThe bulleted “Note:” block improves readability and matches the styling used across related guides.
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)
95-180: Solid, consistent option descriptionsThe restructured “Configs and Parameters” section reads clearly and matches the style of the other quick-starts. Good alignment on defaults and terminology.
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (2)
11-12: Licensing link improvement LGTMUsing a clear anchor text for the Llama 4 Community License enhances readability.
93-180: Consistent, clear parameter docsThe standardized bullets and defaults under “Configs and Parameters” and “Extra LLM API Options” are clear and align with the other quick-starts.
|
PR_Github #15203 [ run ] completed with state |
This is to address Kaiyu's offline suggestion to
#6853 .
Keep this separate from the original PR for clean.
Summary by CodeRabbit