Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
# Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware

## Introduction

This deployment guide provides step-by-step instructions for running the GPT-OSS model using TensorRT-LLM, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT-LLM parameters, launching the server, and validating inference output.

The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT-LLM for model serving.

## Prerequisites

* GPU: NVIDIA Blackwell Architecture
* OS: Linux
* Drivers: CUDA Driver 575 or Later
* Docker with NVIDIA Container Toolkit installed
* Python3 and python3-pip (Optional, for accuracy evaluation only)

## Models

* MXFP4 model: [GPT-OSS-120B](https://huggingface.co/openai/gpt-oss-120b)


## MoE Backend Support Matrix

There are multiple MOE backends inside TRT-LLM. Here are the support matrix of the MOE backends.

| Device | Activation Type | MoE Weights Type | MoE Backend | Use Case |
|------------|------------------|------------------|-------------|----------------|
| B200/GB200 | MXFP8 | MXFP4 | TRTLLM | Low Latency |
| B200/GB200 | MXFP8 | MXFP4 | CUTLASS | Max Throughput |

The default moe backend is `CUTLASS`, so for the combination which is not supported by `CUTLASS`, one must set the `moe_config.backend` explicitly to run the model.

## Deployment Steps

### Run Docker Container

Run the docker container using the TensorRT-LLM NVIDIA NGC image.

```shell
docker run --rm -it \
--ipc=host \
--gpus all \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \
/bin/bash
```

Note:

* The command mounts your user `.cache` directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using `$ mkdir ~/.cache`.
* You can mount additional directories and paths using the `-v <host_path>:<container_path>` flag if needed, such as mounting the downloaded weight paths.
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.

If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.

### Creating the TRT-LLM Server config

We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.

For low-latency with `TRTLLM` MOE backend:

```shell
EXTRA_LLM_API_FILE=/tmp/config.yml

cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: false
cuda_graph_config:
enable_padding: true
max_batch_size: 128
moe_config:
backend: TRTLLM
EOF
```

For max-throughput with `CUTLASS` MOE backend:

```shell
EXTRA_LLM_API_FILE=/tmp/config.yml

cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: true
cuda_graph_config:
enable_padding: true
max_batch_size: 128
moe_config:
backend: CUTLASS
EOF
```

### Launch the TRT-LLM Server

Below is an example command to launch the TRT-LLM server with the GPT-OSS model from within the container. The command is specifically configured for the 1024/1024 Input/Output Sequence Length test. The explanation of each flag is shown in the “Configs and Parameters” section.

```shell
trtllm-serve openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 16384 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--tp_size 8 \
--ep_size 8 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
```
Comment on lines +97 to +110
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Batch/sequence limits are inconsistent; fix --max_num_tokens.

For a 1024/1024 test with --max_batch_size 128 and --max_seq_len 2048, the per-batch token cap should be at least 128 × 2048 = 262144. With 16384, the scheduler may underutilize the batch or reject requests.

-    --max_batch_size 128 \
-    --max_num_tokens 16384 \
-    --max_seq_len 2048 \
+    --max_batch_size 128 \
+    --max_num_tokens 262144 \
+    --max_seq_len 2048 \

Alternatively, keep --max_num_tokens 16384 and lower --max_batch_size to 8 for a 2K sequence. Please also verify that all listed flags are supported by your targeted trtllm-serve version.


🏁 Script executed:

#!/bin/bash
# Find the trtllm-serve CLI and print its argparse help to ensure flags exist.
rg -nP 'def\s+main\(' -C1 tensorrt_llm/serve | cat
rg -nP '\.add_argument\((?:-+host|-+port|-+backend|-+max_batch_size|-+max_num_tokens|-+max_seq_len|-+kv_cache_free_gpu_memory_fraction|-+tp_size|-+ep_size|-+trust_remote_code|-+extra_llm_api_options)' -C2

Length of output: 468


🏁 Script executed:

#!/bin/bash
# List all Python scripts in the serve/scripts directory
find tensorrt_llm/serve/scripts -maxdepth 1 -type f -name '*.py' -print

# Search each for the ArgumentParser instantiation
rg -nP "ArgumentParser\s*\(" -C2 tensorrt_llm/serve/scripts/*.py

# Search each for .add_argument calls to confirm supported flags
rg -nP "\.add_argument\s*\(" -C2 tensorrt_llm/serve/scripts/*.py

Length of output: 23345


🏁 Script executed:

#!/bin/bash
# Search for max_batch_size and related flags across the repository
rg -n "max_batch_size" -C2 .
rg -n "max_num_tokens" -C2 .
rg -n "max_seq_len" -C2 .

Length of output: 273033


Fix docs: --max_num_tokens must cover max_batch_size × sequence length

The quick-start example sets --max_batch_size=128 and --max_seq_len=2048 but uses --max_num_tokens=16384, which is too small (128 × 2048 = 262144). Repository code and tests treat max_num_tokens as a per-batch token-cap, so the doc should be corrected to avoid underutilization or request rejection.

Files/lines verified:

  • docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (example snippet; lines cited in review).
  • tests/integration/defs/perf/build.py — build_config.max_num_tokens = build_config.max_batch_size * max(build_config.max_input_len, build_config.max_beam_width). (≈ lines 289–291)
  • tests/integration/defs/stress_test/stress_test.py — server_cmd includes "--max_batch_size" and "--max_num_tokens". (≈ lines 552–556)
  • tests/integration/defs/triton_server/build_model.sh — example using --max_batch_size=8 and --max_num_tokens=7392 (7392 = 8 × 924), showing pattern used across repo.

Recommended doc change (apply to the snippet in the PR):

-    --max_batch_size 128 \
-    --max_num_tokens 16384 \
-    --max_seq_len 2048 \
+    --max_batch_size 128 \
+    --max_num_tokens 262144 \
+    --max_seq_len 2048 \

Alternative (keep max_num_tokens=16384): reduce batch size to 8 for a 2K sequence:

-    --max_batch_size 128 \
-    --max_num_tokens 16384 \
+    --max_batch_size 8 \
+    --max_num_tokens 16384 \

Please update the doc example and verify the flags against the targeted trtllm-serve version (grep for parser.add_argument in the serve scripts or run trtllm-serve --help).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```shell
trtllm-serve openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 16384 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--tp_size 8 \
--ep_size 8 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
```
trtllm-serve openai/gpt-oss-120b \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--max_num_tokens 262144 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.9 \
--tp_size 8 \
--ep_size 8 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
🤖 Prompt for AI Agents
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md lines
97-110: the example sets --max_batch_size=128 and --max_seq_len=2048 but
--max_num_tokens=16384 is too small (should be >= max_batch_size × max_seq_len =
262144) per repo tests; update the snippet so --max_num_tokens is increased to
at least 262144 (or alternatively reduce --max_batch_size to 8 if you want to
keep 16384), and verify the chosen values match the trtllm-serve version flags
(grep parser.add_argument or run trtllm-serve --help) before committing.


After the server is set up, the client can now send prompt requests to the server and receive results.

### Configs and Parameters

These options are used directly on the command line when you start the `trtllm-serve` process.

#### `--tp_size`

* **Description:** Sets the **tensor-parallel size**. This should typically match the number of GPUs you intend to use for a single model instance.

#### `--ep_size`

* **Description:** Sets the **expert-parallel size** for Mixture-of-Experts (MoE) models. Like `tp_size`, this should generally match the number of GPUs you're using. This setting has no effect on non-MoE models.

#### `--kv_cache_free_gpu_memory_fraction`

* **Description:** A value between `0.0` and `1.0` that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.
* **Recommendation:** If you experience OOM errors, try reducing this value to `0.7` or lower.

#### `--backend pytorch`

* **Description:** Tells TensorRT-LLM to use the **pytorch** backend.

#### `--max_batch_size`

* **Description:** The maximum number of user requests that can be grouped into a single batch for processing.

#### `--max_num_tokens`

* **Description:** The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.

#### `--max_seq_len`

* **Description:** The maximum possible sequence length for a single request, including both input and generated output tokens.

#### `--trust_remote_code`

* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.


Comment on lines +131 to +151
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Clarify --backend meaning vs moe_config.backend; add a security note for --trust_remote_code.

“backend” is overloaded: the CLI --backend pytorch selects the inference backend; moe_config.backend selects the MoE kernel backend. A short note avoids confusion. Also, trusting remote code has security implications.

 #### `--backend pytorch`
 
 * **Description:** Tells TensorRT-LLM to use the **pytorch** backend.
+  Note: This is separate from `moe_config.backend` in the YAML, which selects the MoE kernel implementation (e.g., `TRTLLM` or `CUTLASS`).
+
+#### Security note: `--trust_remote_code`
+Enabling this allows execution of model repository code. Use only with trusted sources (e.g., vetted orgs), and pin exact revisions for reproducibility and safety.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
around lines 131 to 151, clarify that the CLI flag `--backend` selects the
runtime inference backend (e.g., pytorch) while `moe_config.backend` selects the
Mixture-of-Experts kernel backend, and add a short security note to the
`--trust_remote_code` entry explaining it allows execution of model/tokenizer
code downloaded from Hugging Face and may run untrusted code so only enable it
for trusted models or in isolated environments; update the prose to explicitly
disambiguate the two backend meanings and append one concise sentence warning
about the security implications and recommended precautions for
`--trust_remote_code`.

💡 Verification agent

🧩 Analysis chain

Verify CLI and YAML options against current code to prevent drift.

Given how fast the serve CLI evolves, a quick repo check helps ensure the guide matches the actual argparse and TorchLlmArgs fields in this PR branch.


🏁 Script executed:

#!/bin/bash
# 1) Confirm TorchLlmArgs and moe_config fields exist
rg -nP 'class\s+TorchLlmArgs\b' -C3
rg -nP '\bmoe_config\b' -n -C2
rg -nP '\bcuda_graph_config\b' -n -C2

# 2) Confirm serve CLI accepts the documented flags
rg -nP '(argparse|click)\.ArgumentParser' -n -C2 tensorrt_llm/serve
rg -nP '\.add_argument\(\s*"--(host|port|backend|max_batch_size|max_num_tokens|max_seq_len|kv_cache_free_gpu_memory_fraction|tp_size|ep_size|trust_remote_code|extra_llm_api_options)"' -n -C2 tensorrt_llm

Length of output: 185037


Doc flags don’t match the trtllm-serve CLI implementation

A quick grep of tensorrt_llm/serve/scripts/benchmark_serving.py shows the serve command only defines:

  • --host
  • --port
  • --endpoint

The flags listed in the guide—
--backend, --max_batch_size, --max_num_tokens, --max_seq_len, --trust_remote_code— aren’t actually accepted by the CLI. To resolve this drift, please:

• Remove or correct those options in
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md (lines 131–151).
• If these settings are meant to be passed via a YAML file (using --extra_llm_api_options), update the guide to show the proper YAML format and CLI flag.
• Otherwise, implement the missing CLI parameters to match the documentation.

🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
lines 131–151, the listed CLI flags (--backend, --max_batch_size,
--max_num_tokens, --max_seq_len, --trust_remote_code) do not exist on
trtllm-serve; update the document to reflect the actual CLI (--host, --port,
--endpoint) and remove the incorrect flags, and add a short example showing how
to pass those LLM-specific settings via a YAML payload with the
--extra_llm_api_options flag (showing keys backend, max_batch_size,
max_num_tokens, max_seq_len, trust_remote_code in YAML form) so the guide is
accurate without changing the serve implementation.

#### Extra LLM API Options (YAML Configuration)

These options provide finer control over performance and are set within a YAML file passed to the `trtllm-serve` command via the `--extra_llm_api_options` argument.

#### `cuda_graph_config`

* **Description**: A section for configuring CUDA graphs to optimize performance.

* **Options**:

* `enable_padding`: If `"true"`, input batches are padded to the nearest `cuda_graph_batch_size`. This can significantly improve performance.

**Default**: `false`

* `max_batch_size`: Sets the maximum batch size for which a CUDA graph will be created.

**Default**: `0`

**Recommendation**: Set this to the same value as the `--max_batch_size` command-line option.

#### `moe_config`

* **Description**: Configuration for Mixture-of-Experts (MoE) models.

* **Options**:

* `backend`: The backend to use for MoE operations.
**Default**: `CUTLASS`

See the [`TorchLlmArgs` class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the `extra_llm_api_options`.

## Testing API Endpoint

### Basic Test

Start a new terminal on the host to test the TensorRT-LLM server you just launched.

You can query the health/readiness of the server using:

```shell
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
```

When the `Status: 200` code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.

After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server.

```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "openai/gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Where is New York?"
}
],
"max_tokens": 1024,
"top_p": 1.0
}' -w "\n"
```

Here is an example response, showing that the TRT-LLM server reasons and answers the questions.

TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Typo: “Compeletions” → “Completions”.

-TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
+TODO: Use Chat Completions API / Responses API as the example after the PR is merged.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TODO: Use Chat Compeletions API / Responses API as the example after the PR is merged.
TODO: Use Chat Completions API / Responses API as the example after the PR is merged.
🧰 Tools
🪛 LanguageTool

[grammar] ~215-~215: Ensure spelling is correct
Context: ... answers the questions. TODO: Use Chat Compeletions API / Responses API as the example afte...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md
around line 215, fix the typo "Compeletions" to "Completions" in the TODO text
so it reads "Use Chat Completions API / Responses API as the example after the
PR is merged."; ensure capitalization and spacing are preserved.


```json
{"id":"chatcmpl-c5bf51b5cab94e10ba5da5266d12ee59","object":"chat.completion","created":1755815898,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisThe user asks: \"Where is New York?\" Likely they want location info. Provide answer: New York State in northeastern US, New York City on the east coast, coordinates, etc. Provide context.assistantfinal**New York** can refer to two related places in the United States:\n\n| What it is | Where it is | Approx. coordinates | How to picture it |\n|------------|------------|--------------------|-------------------|\n| **New York State** | The northeastern corner of the United States, bordered by **Vermont, Massachusetts, Connecticut, New Jersey, Pennsylvania, and the Canadian provinces of Ontario and Quebec**. | 42.7° N, 75.5° W (roughly the state’s geographic centre) | A roughly rectangular state that stretches from the Atlantic Ocean in the southeast to the Adirondack Mountains and the Great Lakes region in the north. |\n| **New York City (NYC)** | The largest city in the state, located on the **southern tip of the state** where the **Hudson River meets the Atlantic Ocean**. It occupies five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. | 40.7128° N, 74.0060° W | A dense, world‑famous metropolis that sits on a series of islands (Manhattan, Staten Island, parts of the Bronx) and the mainland (Brooklyn and Queens). |\n\n### Quick geographic context\n- **On a map of the United States:** New York State is in the **Northeast** region, just east of the Great Lakes and north of Pennsylvania. \n- **From Washington, D.C.:** Travel roughly **225 mi (360 km) northeast**. \n- **From Boston, MA:** Travel about **215 mi (350 km) southwest**. \n- **From Toronto, Canada:** Travel about **500 mi (800 km) southeast**.\n\n### Travel tips\n- **By air:** Major airports include **John F. Kennedy International (JFK)**, **LaGuardia (LGA)**, and **Newark Liberty International (EWR)** (the latter is actually in New Jersey but serves the NYC metro area). \n- **By train:** Amtrak’s **Northeast Corridor** runs from **Boston → New York City → Washington, D.C.** \n- **By car:** Interstates **I‑87** (north‑south) and **I‑90** (east‑west) are the primary highways crossing the state.\n\n### Fun fact\n- The name “**New York**” was given by the English in 1664, honoring the Duke of York (later King James II). The city’s original Dutch name was **“New Amsterdam.”**\n\nIf you need more specific directions (e.g., how to get to a particular neighborhood, landmark, or the state capital **Albany**), just let me know!","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":72,"total_tokens":705,"completion_tokens":633},"prompt_token_ids":null}
```
Comment on lines +217 to +219
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace unrealistic sample JSON; remove internal “analysis” text and keep a concise, valid response.

The current payload looks like a captured debug dump with internal commentary. Provide a minimal, representative response to avoid confusion.

-```json
-{"id":"chatcmpl-c5bf51b5cab94e10ba5da5266d12ee59","object":"chat.completion","created":1755815898,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisThe user asks: \"Where is New York?\" Likely they want location info. Provide answer: New York State in northeastern US, New York City on the east coast, coordinates, etc. Provide context.assistantfinal**New York** can refer to two related places in the United States:\n\n| What it is | Where it is | Approx. coordinates | How to picture it |\n|------------|------------|--------------------|-------------------|\n| **New York State** | The northeastern corner of the United States, bordered by **Vermont, Massachusetts, Connecticut, New Jersey, Pennsylvania, and the Canadian provinces of Ontario and Quebec**. | 42.7° N, 75.5° W (roughly the state’s geographic centre) | A roughly rectangular state that stretches from the Atlantic Ocean in the southeast to the Adirondack Mountains and the Great Lakes region in the north. |\n| **New York City (NYC)** | The largest city in the state, located on the **southern tip of the state** where the **Hudson River meets the Atlantic Ocean**. It occupies five boroughs: Manhattan, Brooklyn, Queens, The Bronx, and Staten Island. | 40.7128° N, 74.0060° W | A dense, world‑famous metropolis that sits on a series of islands (Manhattan, Staten Island, parts of the Bronx) and the mainland (Brooklyn and Queens). |\n\n### Quick geographic context\n- **On a map of the United States:** New York State is in the **Northeast** region, just east of the Great Lakes and north of Pennsylvania.  \n- **From Washington, D.C.:** Travel roughly **225 mi (360 km) northeast**.  \n- **From Boston, MA:** Travel about **215 mi (350 km) southwest**.  \n- **From Toronto, Canada:** Travel about **500 mi (800 km) southeast**.\n\n### Travel tips\n- **By air:** Major airports include **John F. Kennedy International (JFK)**, **LaGuardia (LGA)**, and **Newark Liberty International (EWR)** (the latter is actually in New Jersey but serves the NYC metro area).  \n- **By train:** Amtrak’s **Northeast Corridor** runs from **Boston → New York City → Washington, D.C.**  \n- **By car:** Interstates **I‑87** (north‑south) and **I‑90** (east‑west) are the primary highways crossing the state.\n\n### Fun fact\n- The name “**New York**” was given by the English in 1664, honoring the Duke of York (later King James II). The city’s original Dutch name was **“New Amsterdam.”**\n\nIf you need more specific directions (e.g., how to get to a particular neighborhood, landmark, or the state capital **Albany**), just let me know!","reasoning_content":null,"tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null,"mm_embedding_handle":null,"disaggregated_params":null,"avg_decoded_tokens_per_iter":1.0}],"usage":{"prompt_tokens":72,"total_tokens":705,"completion_tokens":633},"prompt_token_ids":null}
-```
+```json
+{
+  "id": "chatcmpl-123",
+  "object": "chat.completion",
+  "created": 1755815898,
+  "model": "openai/gpt-oss-120b",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "New York can refer to New York State in the northeastern U.S., or New York City at the mouth of the Hudson River on the Atlantic coast."
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 18,
+    "completion_tokens": 28,
+    "total_tokens": 46
+  }
+}
+```
🤖 Prompt for AI Agents
docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md around
lines 217-219: the example JSON contains internal debug/"analysis" text and an
overly long assistant reply; replace the block with a concise, valid JSON
chat.completion object that removes any internal reasoning fields and internal
commentary, uses a short, user-facing assistant.content ("New York can refer to
New York State in the northeastern U.S., or New York City at the mouth of the
Hudson River on the Atlantic coast."), and keeps minimal realistic metadata (id,
object, created, model, choices array with index/message/role/content,
finish_reason, and a small usage object); ensure proper JSON syntax and no stray
markdown or analysis text.


### Troubleshooting Tips

* If you encounter CUDA out-of-memory errors, try reducing `max_batch_size` or `max_seq_len`.
* Ensure your model checkpoints are compatible with the expected format.
* For performance issues, check GPU utilization with nvidia-smi while the server is running.
* If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
* For connection issues, make sure the server port (`8000` in this guide) is not being used by another application.

### Running Evaluations to Verify Accuracy (Optional)

We use OpenAI's official evaluation tool to test the model's accuracy. For more information see [https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals](gpt-oss-eval).

TODO(@Binghan Chen): Add instructions for running gpt-oss-eval.

## Benchmarking Performance

To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.

```shell
cat <<'EOF' > bench.sh
#!/usr/bin/env bash
set -euo pipefail

concurrency_list="32 64 128 256 512 1024 2048 4096"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/gpt_oss_output

for concurrency in ${concurrency_list}; do
num_prompts=$((concurrency * multi_round))
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--model openai/gpt-oss-120b \
--backend openai \
--dataset-name "random" \
--random-input-len ${isl} \
--random-output-len ${osl} \
--random-prefix-len 0 \
--random-ids \
--num-prompts ${num_prompts} \
--max-concurrency ${concurrency} \
--ignore-eos \
--tokenize-on-client \
--percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
```

If you want to save the results to a file add the following options.

```shell
--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"
```

For more benchmarking options see <https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt\_llm/serve/scripts/benchmark\_serving.py>.

Run `bench.sh` to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above `bench.sh` script.

```shell
./bench.sh
```

Sample TensorRT-LLM serving benchmark output. Your results may vary due to ongoing software optimizations.

```
============ Serving Benchmark Result ============
Successful requests: 16
Benchmark duration (s): 17.66
Total input tokens: 16384
Total generated tokens: 16384
Request throughput (req/s): [result]
Output token throughput (tok/s): [result]
Total Token throughput (tok/s): [result]
User throughput (tok/s): [result]
---------------Time to First Token----------------
Mean TTFT (ms): [result]
Median TTFT (ms): [result]
P99 TTFT (ms): [result]
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): [result]
Median TPOT (ms): [result]
P99 TPOT (ms): [result]
---------------Inter-token Latency----------------
Mean ITL (ms): [result]
Median ITL (ms): [result]
P99 ITL (ms): [result]
----------------End-to-end Latency----------------
Mean E2EL (ms): [result]
Median E2EL (ms): [result]
P99 E2EL (ms): [result]
==================================================
```

### Key Metrics

* Median Time to First Token (TTFT)
* The typical time elapsed from when a request is sent until the first output token is generated.
* Median Time Per Output Token (TPOT)
* The typical time required to generate each token *after* the first one.
* Median Inter-Token Latency (ITL)
* The typical time delay between the completion of one token and the completion of the next.
* Median End-to-End Latency (E2EL)
* The typical total time from when a request is submitted until the final token of the response is received.
* Total Token Throughput
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Welcome to TensorRT-LLM's Documentation!
deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md
deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md


.. toctree::
Expand Down