Skip to content

Conversation

@dongfengy
Copy link
Collaborator

@dongfengy dongfengy commented Aug 21, 2025

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive quick-start guide for deploying GPT-OSS on TensorRT-LLM with NVIDIA GPUs.
    • Covers prerequisites, supported backends for MoE, and recommended configuration options.
    • Provides step-by-step instructions for container setup, server configuration, and launch commands.
    • Includes guidance for testing the API, troubleshooting common issues, and running benchmarks.
    • Defined key metrics (TTFT, TPOT, ITL, E2EL, throughput) and example outputs.
    • Updated documentation navigation to include the new deployment guide.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Dongfeng Yu <[email protected]>
Signed-off-by: Dongfeng Yu <[email protected]>
@dongfengy dongfengy requested a review from a team as a code owner August 21, 2025 23:14
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 21, 2025

📝 Walkthrough

Walkthrough

Adds a new quick-start deployment guide for GPT-OSS on TensorRT-LLM and updates the docs index to include it. No code changes.

Changes

Cohort / File(s) Summary
New deployment quick-start guide
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
New doc covering prerequisites, config (TRTLLM/CUTLASS MoE backends), Docker run, server launch (trtllm-serve), parameters, testing, troubleshooting, evaluation, and benchmarking.
Docs toctree update
docs/source/index.rst
Adds the new deployment guide to the Deployment Guide toctree.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

1.0_doc

Suggested reviewers

  • QiJune
  • nv-guomingz
  • juney-nvidia

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (13)
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (13)

11-16: Tighten prerequisites wording and casing.

Minor grammar/style fixes improve clarity and consistency.

-* GPU: NVIDIA Blackwell Architecture
-* OS: Linux
-* Drivers: CUDA Driver 575 or Later
-* Docker with NVIDIA Container Toolkit installed
-* Python3 and python3-pip (Optional, for accuracy evaluation only)
+* GPU: NVIDIA Blackwell architecture
+* OS: Linux
+* Drivers: CUDA driver 575 or later
+* Docker with NVIDIA Container Toolkit
+* Python 3 and pip (optional, for accuracy evaluation only)

52-56: Polish notes; minor grammar and double-space cleanup.

-* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
+* See <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all available containers. Weekly “main” builds use an `rcN` suffix; monthly releases with QA do not. Use the `rc` build to get the latest model and feature support.

63-76: YAML indentation and keys look fine; add a hint about when to choose TRTLLM vs CUTLASS.

Config is valid. Consider adding one line explaining the trade-off so readers don’t have to scroll up to the table.

 moe_config:
-    backend: TRTLLM
+    backend: TRTLLM  # Prefer for low latency on B200/GB200 with MXFP8 activations and MXFP4 MoE weights

80-91: Duplicate file path will overwrite; call it out explicitly.

Both snippets write to /tmp/config.yml, which is fine, but say “this overwrites the previous file” to avoid confusion for copy-paste users.

-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config.yml  # This overwrites the previous file if it exists

189-196: Great to include a health probe; suggest adding readiness nuance.

Consider adding a note that compilation/warmup can cause longer first response and that 200 on /health indicates HTTP liveness, not necessarily compilers warmed.


231-234: Fix the Markdown link and clarify ownership.

The link text/URL are reversed. Also, consider clarifying that this is an optional external eval.

-We use OpenAI's official evaluation tool to test the model's accuracy. For more information see [https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals](gpt-oss-eval).
+We use the GPT‑OSS evaluation toolkit to test model accuracy. For more information, see [gpt-oss-eval](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals).

240-266: Benchmark script: ensure results directory exists; integrate save options.

result_dir is defined but unused. Suggest creating it and enabling save flags so users get artifacts by default.

 cat <<'EOF' > bench.sh
 #!/usr/bin/env bash
 set -euo pipefail
 
 concurrency_list="32 64 128 256 512 1024 2048 4096"
 multi_round=5
 isl=1024
 osl=1024
-result_dir=/tmp/gpt_oss_output
+result_dir=/tmp/gpt_oss_output
+mkdir -p "${result_dir}"
 
 for concurrency in ${concurrency_list}; do
     num_prompts=$((concurrency * multi_round))
     python -m tensorrt_llm.serve.scripts.benchmark_serving \
         --model openai/gpt-oss-120b \
         --backend openai \
         --dataset-name "random" \
         --random-input-len ${isl} \
         --random-output-len ${osl} \
         --random-prefix-len 0 \
         --random-ids \
         --num-prompts ${num_prompts} \
         --max-concurrency ${concurrency} \
         --ignore-eos \
         --tokenize-on-client \
-        --percentile-metrics "ttft,tpot,itl,e2el"
+        --percentile-metrics "ttft,tpot,itl,e2el" \
+        --save-result \
+        --result-dir "${result_dir}" \
+        --result-filename "concurrency_${concurrency}.json"
 done
 EOF
 chmod +x bench.sh

286-315: Add a code fence language to satisfy markdownlint MD040.

Mark this as text to avoid syntax highlighting and lint failures.

-```
+```text
 ============ Serving Benchmark Result ============
 Successful requests:                      16
 ...
 ==================================================

---

`1-9`: **Consistency: prefer “TensorRT-LLM” or “TRT-LLM”, not a mix.**

Title and body alternate between “TensorRT-LLM”, “TRT-LLM”, and “TRTLLM”. Consider a quick pass to standardize:
- Use “TensorRT-LLM” for the product name.
- Use “TRT-LLM” as the abbreviation in prose.
- Keep `TRTLLM` only where it is a literal config value.

---

`95-112`: **Sanity-check hardware parallelism guidance.**

`--tp_size 8` and `--ep_size 8` imply 64 GPUs per instance. Consider adding a short mapping table (e.g., 8×8=64 GPUs; 16×4=64; 16×8=128), and a reminder to set NCCL/multi-node env if scaling beyond a single node.

---

`55-58`: **Build-from-source link is helpful; add a quick pointer about driver/toolkit alignment.**

Readers often miss CUDA driver/runtime compatibility. Add a one-liner pointing to the support matrix page.


```diff
-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
+If you want the latest main branch, build from source: <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>. Ensure your CUDA driver/toolkit versions meet the support matrix.

68-76: Optional: quote YAML booleans to avoid accidental truthiness in some parsers.

If these files are ever preprocessed by tools with loose typing, quoting can help. Not critical if only the LLM API reads them.

-enable_attention_dp: false
+enable_attention_dp: "false"
 ...
-  enable_padding: true
+  enable_padding: "true"

147-151: Link out to the authoritative args reference with a local cross-ref too.

You already link to TorchLlmArgs later; consider cross-linking here for convenience.

-* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
+* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. See also the TorchLlmArgs reference for all security-related flags.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f7c597e and eb9e645.

📒 Files selected for processing (2)
  • docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (1 hunks)
  • docs/source/index.rst (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

[grammar] ~11-~11: There might be a mistake here.
Context: ...es * GPU: NVIDIA Blackwell Architecture * OS: Linux * Drivers: CUDA Driver 575 or ...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...VIDIA Blackwell Architecture * OS: Linux * Drivers: CUDA Driver 575 or Later * Dock...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...inux * Drivers: CUDA Driver 575 or Later * Docker with NVIDIA Container Toolkit ins...

(QB_NEW_EN)


[grammar] ~14-~14: There might be a mistake here.
Context: ... with NVIDIA Container Toolkit installed * Python3 and python3-pip (Optional, for a...

(QB_NEW_EN)


[grammar] ~26-~26: There might be a mistake here.
Context: ...ts Type | MoE Backend | Use Case | |------------|------------------|-------...

(QB_NEW_EN)


[grammar] ~27-~27: There might be a mistake here.
Context: ...--------|-------------|----------------| | B200/GB200 | MXFP8 | MXFP4 ...

(QB_NEW_EN)


[grammar] ~28-~28: There might be a mistake here.
Context: ... | TRTLLM | Low Latency | | B200/GB200 | MXFP8 | MXFP4 ...

(QB_NEW_EN)


[grammar] ~52-~52: There might be a mistake here.
Context: ...ease create it using $ mkdir ~/.cache. * You can mount additional directories and...

(QB_NEW_EN)


[grammar] ~57-~57: There might be a mistake here.
Context: ...el and feature support. If you want to use latest main branch, you can choose to b...

(QB_NEW_EN)


[grammar] ~215-~215: Ensure spelling is correct
Context: ... answers the questions. TODO: Use Chat Compeletions API / Responses API as the example afte...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.17.2)
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

288-288: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (4)
docs/source/index.rst (1)

41-41: Deployment guide TOC entry verified
The file deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md exists once and is referenced exactly once in docs/source/index.rst, so there are no duplicates and the Sphinx linkcheck will remain clean.

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (3)

162-171: I've initiated a code search for the TorchLlmArgs class and its cuda_graph_config defaults to verify the actual values set in the codebase. I'll update once we have definitive information.


39-48: Update TensorRT-LLM container tag to release:1.1.0rc0 and parameterize GPU visibility

The official NGC catalog now lists release:1.1.0rc0 as the latest TensorRT-LLM container (Modified August 15, 2025). Hard-coding the tag will quickly become outdated; parameterizing it simplifies future upgrades and maintains reproducibility (catalog.ngc.nvidia.com).

- docker run --rm -it \
+ TRTLLM_TAG=release:1.1.0rc0  # latest published TensorRT-LLM release as of August 15, 2025
+ docker run --rm -it \
   --ipc=host \
   --gpus all \
+  -e NVIDIA_VISIBLE_DEVICES=all \
   -p 8000:8000 \
   -v ~/.cache:/root/.cache:rw \
   --name tensorrt_llm \
-  nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
+  nvcr.io/nvidia/tensorrt-llm/${TRTLLM_TAG} \
   /bin/bash

Additional recommendations:


24-31: Fix grammar and clarify which “default” is meant (LLM‑API default vs recipe recommendation)

Short: fix the subject–verb issue and explicitly state which “default” you mean — the LLM‑API default is CUTLASS in the codebase, while some recipe docs recommend TRTLLM for certain hardware/latency cases.

Files to check / reference:

  • docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (lines ~24–31) — update here.
  • Code showing the LLM‑API default / allowed backends: tensorrt_llm/_torch/model_config.py (moe_backend = 'CUTLASS') and tensorrt_llm/llmapi/llm_args.py (backend Literal includes "CUTLASS","CUTEDSL","WIDEEP","TRTLLM","DEEPGEMM","TRITON").
  • Other docs that use recipe-specific defaults (e.g., quick-start for Llama shows Default: TRTLLM) — consider adding per-recipe clarification there too.

Suggested rewrite (replace the two sentences shown):

-There are multiple MOE backends inside TRT-LLM. Here are the support matrix of the MOE backends.
+There are multiple MoE backends in TensorRT-LLM. Below is the support matrix for these backends.

...

-The default moe backend is `CUTLASS`, so for the combination which is not supported by `CUTLASS`, one must set the `moe_config.backend` explicitly to run the model.
+The default MoE backend for the LLM API is `CUTLASS` (see tensorrt_llm/_torch/model_config.py and tensorrt_llm/llmapi/llm_args.py). For combinations not supported by `CUTLASS`, set `moe_config.backend` explicitly (for example, to `TRTLLM`). Note: some model/recipe docs may recommend a different backend (e.g., `TRTLLM` for low‑latency on B200/GB200) — if so, call that out as a recipe recommendation rather than a global default.

Recognized backend names (for metadata/validation): CUTLASS, TRTLLM, TRITON, CUTEDSL, WIDEEP, DEEPGEMM (see llm_args.py).

Comment on lines +97 to +110
```shell
trtllm-serve openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 16384 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--tp_size 8 \
--ep_size 8 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Batch/sequence limits are inconsistent; fix --max_num_tokens.

For a 1024/1024 test with --max_batch_size 128 and --max_seq_len 2048, the per-batch token cap should be at least 128 × 2048 = 262144. With 16384, the scheduler may underutilize the batch or reject requests.

-    --max_batch_size 128 \
-    --max_num_tokens 16384 \
-    --max_seq_len 2048 \
+    --max_batch_size 128 \
+    --max_num_tokens 262144 \
+    --max_seq_len 2048 \

Alternatively, keep --max_num_tokens 16384 and lower --max_batch_size to 8 for a 2K sequence. Please also verify that all listed flags are supported by your targeted trtllm-serve version.


🏁 Script executed:

#!/bin/bash
# Find the trtllm-serve CLI and print its argparse help to ensure flags exist.
rg -nP 'def\s+main\(' -C1 tensorrt_llm/serve | cat
rg -nP '\.add_argument\((?:-+host|-+port|-+backend|-+max_batch_size|-+max_num_tokens|-+max_seq_len|-+kv_cache_free_gpu_memory_fraction|-+tp_size|-+ep_size|-+trust_remote_code|-+extra_llm_api_options)' -C2

Length of output: 468


🏁 Script executed:

#!/bin/bash
# List all Python scripts in the serve/scripts directory
find tensorrt_llm/serve/scripts -maxdepth 1 -type f -name '*.py' -print

# Search each for the ArgumentParser instantiation
rg -nP "ArgumentParser\s*\(" -C2 tensorrt_llm/serve/scripts/*.py

# Search each for .add_argument calls to confirm supported flags
rg -nP "\.add_argument\s*\(" -C2 tensorrt_llm/serve/scripts/*.py

Length of output: 23345


🏁 Script executed:

#!/bin/bash
# Search for max_batch_size and related flags across the repository
rg -n "max_batch_size" -C2 .
rg -n "max_num_tokens" -C2 .
rg -n "max_seq_len" -C2 .

Length of output: 273033


Fix docs: --max_num_tokens must cover max_batch_size × sequence length

The quick-start example sets --max_batch_size=128 and --max_seq_len=2048 but uses --max_num_tokens=16384, which is too small (128 × 2048 = 262144). Repository code and tests treat max_num_tokens as a per-batch token-cap, so the doc should be corrected to avoid underutilization or request rejection.

Files/lines verified:

  • docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (example snippet; lines cited in review).
  • tests/integration/defs/perf/build.py — build_config.max_num_tokens = build_config.max_batch_size * max(build_config.max_input_len, build_config.max_beam_width). (≈ lines 289–291)
  • tests/integration/defs/stress_test/stress_test.py — server_cmd includes "--max_batch_size" and "--max_num_tokens". (≈ lines 552–556)
  • tests/integration/defs/triton_server/build_model.sh — example using --max_batch_size=8 and --max_num_tokens=7392 (7392 = 8 × 924), showing pattern used across repo.

Recommended doc change (apply to the snippet in the PR):

-    --max_batch_size 128 \
-    --max_num_tokens 16384 \
-    --max_seq_len 2048 \
+    --max_batch_size 128 \
+    --max_num_tokens 262144 \
+    --max_seq_len 2048 \

Alternative (keep max_num_tokens=16384): reduce batch size to 8 for a 2K sequence:

-    --max_batch_size 128 \
-    --max_num_tokens 16384 \
+    --max_batch_size 8 \
+    --max_num_tokens 16384 \

Please update the doc example and verify the flags against the targeted trtllm-serve version (grep for parser.add_argument in the serve scripts or run trtllm-serve --help).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```shell
trtllm-serve openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 16384 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--tp_size 8 \
--ep_size 8 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
```
trtllm-serve openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 262144 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--tp_size 8 \
--ep_size 8 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
🤖 Prompt for AI Agents
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md lines
97-110: the example sets --max_batch_size=128 and --max_seq_len=2048 but
--max_num_tokens=16384 is too small (should be >= max_batch_size × max_seq_len =
262144) per repo tests; update the snippet so --max_num_tokens is increased to
at least 262144 (or alternatively reduce --max_batch_size to 8 if you want to
keep 16384), and verify the chosen values match the trtllm-serve version flags
(grep parser.add_argument or run trtllm-serve --help) before committing.

Comment on lines +131 to +151
#### `--backend pytorch`

* **Description:** Tells TensorRT-LLM to use the **pytorch** backend.

#### `--max_batch_size`

* **Description:** The maximum number of user requests that can be grouped into a single batch for processing.

#### `--max_num_tokens`

* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.

#### `--max_seq_len`

* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.

#### `--trust_remote_code`

* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Clarify --backend meaning vs moe_config.backend; add a security note for --trust_remote_code.

“backend” is overloaded: the CLI --backend pytorch selects the inference backend; moe_config.backend selects the MoE kernel backend. A short note avoids confusion. Also, trusting remote code has security implications.

 #### `--backend pytorch`
 
 * **Description:** Tells TensorRT-LLM to use the **pytorch** backend.
+  Note: This is separate from `moe_config.backend` in the YAML, which selects the MoE kernel implementation (e.g., `TRTLLM` or `CUTLASS`).
+
+#### Security note: `--trust_remote_code`
+Enabling this allows execution of model repository code. Use only with trusted sources (e.g., vetted orgs), and pin exact revisions for reproducibility and safety.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
around lines 131 to 151, clarify that the CLI flag `--backend` selects the
runtime inference backend (e.g., pytorch) while `moe_config.backend` selects the
Mixture-of-Experts kernel backend, and add a short security note to the
`--trust_remote_code` entry explaining it allows execution of model/tokenizer
code downloaded from Hugging Face and may run untrusted code so only enable it
for trusted models or in isolated environments; update the prose to explicitly
disambiguate the two backend meanings and append one concise sentence warning
about the security implications and recommended precautions for
`--trust_remote_code`.

💡 Verification agent

🧩 Analysis chain

Verify CLI and YAML options against current code to prevent drift.

Given how fast the serve CLI evolves, a quick repo check helps ensure the guide matches the actual argparse and TorchLlmArgs fields in this PR branch.


🏁 Script executed:

#!/bin/bash
# 1) Confirm TorchLlmArgs and moe_config fields exist
rg -nP 'class\s+TorchLlmArgs\b' -C3
rg -nP '\bmoe_config\b' -n -C2
rg -nP '\bcuda_graph_config\b' -n -C2

# 2) Confirm serve CLI accepts the documented flags
rg -nP '(argparse|click)\.ArgumentParser' -n -C2 tensorrt_llm/serve
rg -nP '\.add_argument\(\s*"--(host|port|backend|max_batch_size|max_num_tokens|max_seq_len|kv_cache_free_gpu_memory_fraction|tp_size|ep_size|trust_remote_code|extra_llm_api_options)"' -n -C2 tensorrt_llm

Length of output: 185037


Doc flags don’t match the trtllm-serve CLI implementation

A quick grep of tensorrt_llm/serve/scripts/benchmark_serving.py shows the serve command only defines:

  • --host
  • --port
  • --endpoint

The flags listed in the guide—
--backend, --max_batch_size, --max_num_tokens, --max_seq_len, --trust_remote_code— aren’t actually accepted by the CLI. To resolve this drift, please:

• Remove or correct those options in
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (lines 131–151).
• If these settings are meant to be passed via a YAML file (using --extra_llm_api_options), update the guide to show the proper YAML format and CLI flag.
• Otherwise, implement the missing CLI parameters to match the documentation.

🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
lines 131–151, the listed CLI flags (--backend, --max_batch_size,
--max_num_tokens, --max_seq_len, --trust_remote_code) do not exist on
trtllm-serve; update the document to reflect the actual CLI (--host, --port,
--endpoint) and remove the incorrect flags, and add a short example showing how
to pass those LLM-specific settings via a YAML payload with the
--extra_llm_api_options flag (showing keys backend, max_batch_size,
max_num_tokens, max_seq_len, trust_remote_code in YAML form) so the guide is
accurate without changing the serve implementation.


Here is an example response, showing that the TRT-LLM server reasons and answers the questions.

TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Typo: “Compeletions” → “Completions”.

-TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
+TODO: Use Chat Completions API / Responses API as the example after the PR is merged.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
TODO: Use Chat Completions API / Responses API as the example after the PR is merged.
🧰 Tools
🪛 LanguageTool

[grammar] ~215-~215: Ensure spelling is correct
Context: ... answers the questions. TODO: Use Chat Compeletions API / Responses API as the example afte...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
around line 215, fix the typo "Compeletions" to "Completions" in the TODO text
so it reads "Use Chat Completions API / Responses API as the example after the
PR is merged."; ensure capitalization and spacing are preserved.

Comment on lines +217 to +219
```json
{"id":"chatcmpl-c5bf51b5cab94e10ba5da5266d12ee59","object":"chat.completion","created":1755815898,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisThe user asks: \"Where is New York?\" Likely they want location info. Provide answer: New York State in northeastern US, New York City on the east coast, coordinates, etc. Provide context.assistantfinal**New York** can refer to two related places in the United States:\n\n| What it is | Where it is | Approx. coordinates | How to picture it |\n|------------|------------|--------------------|-------------------|\n| **New York State** | The northeastern corner of the United States, bordered by **Vermont, Massachusetts, Connecticut, New Jersey, Pennsylvania, and the Canadian provinces of Ontario and Quebec**. | 42.7° N, 75.5° W (roughly the state’s geographic centre) | A roughly rectangular state that stretches from the Atlantic Ocean in the southeast to the Adirondack Mountains and the Great Lakes region in the north. |\n| **New York City (NYC)** | The largest city in the state, located on the **southern tip of the state** where the **Hudson River meets the Atlantic Ocean**. It occupies five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. | 40.7128° N, 74.0060° W | A dense, world‑famous metropolis that sits on a series of islands (Manhattan, Staten Island, parts of the Bronx) and the mainland (Brooklyn and Queens). |\n\n### Quick geographic context\n- **On a map of the United States:** New York State is in the **Northeast** region, just east of the Great Lakes and north of Pennsylvania. \n- **From Washington, D.C.:** Travel roughly **225 mi (360 km) northeast**. \n- **From Boston, MA:** Travel about **215 mi (350 km) southwest**. \n- **From Toronto, Canada:** Travel about **500 mi (800 km) southeast**.\n\n### Travel tips\n- **By air:** Major airports include **John F. Kennedy International (JFK)**, **LaGuardia (LGA)**, and **Newark Liberty International (EWR)** (the latter is actually in New Jersey but serves the NYC metro area). \n- **By train:** Amtrak’s **Northeast Corridor** runs from **Boston → New York City → Washington, D.C.** \n- **By car:** Interstates **I‑87** (north‑south) and **I‑90** (east‑west) are the primary highways crossing the state.\n\n### Fun fact\n- The name “**New York**” was given by the English in 1664, honoring the Duke of York (later King James II). The city’s original Dutch name was **“New Amsterdam.”**\n\nIf you need more specific directions (e.g., how to get to a particular neighborhood, landmark, or the state capital **Albany**), just let me know!","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":72,"total_tokens":705,"completion_tokens":633},"prompt_token_ids":null}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace unrealistic sample JSON; remove internal “analysis” text and keep a concise, valid response.

The current payload looks like a captured debug dump with internal commentary. Provide a minimal, representative response to avoid confusion.

-```json
-{"id":"chatcmpl-c5bf51b5cab94e10ba5da5266d12ee59","object":"chat.completion","created":1755815898,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisThe user asks: \"Where is New York?\" Likely they want location info. Provide answer: New York State in northeastern US, New York City on the east coast, coordinates, etc. Provide context.assistantfinal**New York** can refer to two related places in the United States:\n\n| What it is | Where it is | Approx. coordinates | How to picture it |\n|------------|------------|--------------------|-------------------|\n| **New York State** | The northeastern corner of the United States, bordered by **Vermont, Massachusetts, Connecticut, New Jersey, Pennsylvania, and the Canadian provinces of Ontario and Quebec**. | 42.7° N, 75.5° W (roughly the state’s geographic centre) | A roughly rectangular state that stretches from the Atlantic Ocean in the southeast to the Adirondack Mountains and the Great Lakes region in the north. |\n| **New York City (NYC)** | The largest city in the state, located on the **southern tip of the state** where the **Hudson River meets the Atlantic Ocean**. It occupies five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. | 40.7128° N, 74.0060° W | A dense, world‑famous metropolis that sits on a series of islands (Manhattan, Staten Island, parts of the Bronx) and the mainland (Brooklyn and Queens). |\n\n### Quick geographic context\n- **On a map of the United States:** New York State is in the **Northeast** region, just east of the Great Lakes and north of Pennsylvania.  \n- **From Washington, D.C.:** Travel roughly **225 mi (360 km) northeast**.  \n- **From Boston, MA:** Travel about **215 mi (350 km) southwest**.  \n- **From Toronto, Canada:** Travel about **500 mi (800 km) southeast**.\n\n### Travel tips\n- **By air:** Major airports include **John F. Kennedy International (JFK)**, **LaGuardia (LGA)**, and **Newark Liberty International (EWR)** (the latter is actually in New Jersey but serves the NYC metro area).  \n- **By train:** Amtrak’s **Northeast Corridor** runs from **Boston → New York City → Washington, D.C.**  \n- **By car:** Interstates **I‑87** (north‑south) and **I‑90** (east‑west) are the primary highways crossing the state.\n\n### Fun fact\n- The name “**New York**” was given by the English in 1664, honoring the Duke of York (later King James II). The city’s original Dutch name was **“New Amsterdam.”**\n\nIf you need more specific directions (e.g., how to get to a particular neighborhood, landmark, or the state capital **Albany**), just let me know!","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":72,"total_tokens":705,"completion_tokens":633},"prompt_token_ids":null}
-```
+```json
+{
+  "id": "chatcmpl-123",
+  "object": "chat.completion",
+  "created": 1755815898,
+  "model": "openai/gpt-oss-120b",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "New York can refer to New York State in the northeastern U.S., or New York City at the mouth of the Hudson River on the Atlantic coast."
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 18,
+    "completion_tokens": 28,
+    "total_tokens": 46
+  }
+}
+```
🤖 Prompt for AI Agents
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md around
lines 217-219: the example JSON contains internal debug/"analysis" text and an
overly long assistant reply; replace the block with a concise, valid JSON
chat.completion object that removes any internal reasoning fields and internal
commentary, uses a short, user-facing assistant.content ("New York can refer to
New York State in the northeastern U.S., or New York City at the mouth of the
Hudson River on the Atlantic coast."), and keeps minimal realistic metadata (id,
object, created, model, choices array with index/message/role/content,
finish_reason, and a small usage object); ensure proper JSON syntax and no stray
markdown or analysis text.

@nv-guomingz nv-guomingz changed the title [None][doc] Add GPT-OSS Deployment Guide for VDR Task [None][doc] Add GPT-OSS Deployment Guide into official doc site Aug 22, 2025
@juney-nvidia juney-nvidia requested a review from litaotju August 22, 2025 07:45
@juney-nvidia juney-nvidia changed the title [None][doc] Add GPT-OSS Deployment Guide into official doc site [TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site Aug 22, 2025
@juney-nvidia
Copy link
Collaborator

/bot skip --comment "No need to run full CI"

@juney-nvidia juney-nvidia enabled auto-merge (squash) August 22, 2025 07:49
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16156 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16156 [ skip ] completed with state SUCCESS
Skipping testing for commit eb9e645

@juney-nvidia juney-nvidia disabled auto-merge August 22, 2025 08:16
@juney-nvidia juney-nvidia merged commit d94cc3f into NVIDIA:main Aug 22, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants