Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
992adfb
fix: add better port logic (#2175) (#2192)
alec-flowers Jul 30, 2025
9a93f11
chore: fix install (#2191)
ishandhanani Jul 30, 2025
2a616da
chore: fix QA bugs in documentation/readmes (#2199)
athreesh Jul 30, 2025
d0de1a0
feat: Add trtllm deploy examples for k8s #2133 (#2207)
biswapanda Jul 31, 2025
edccbd5
fix(sglang): disagg yaml worker change and agg kv router fix (#2205)
ishandhanani Jul 31, 2025
54fbff3
fix: add curl and jq for health checks #2203 (#2209)
biswapanda Jul 31, 2025
a9b6b28
fix: Kprashanth/trtllm rc4 cherry pick (#2218)
KrishnanPrash Jul 31, 2025
65e89b3
chore: cleanup dead links (#2208)
nealvaidya Jul 31, 2025
c92dc98
chore: update nixl version to 0.4.1 (#2221) (#2228)
nv-anants Jul 31, 2025
eb58916
chore: Remove multimodal readme. (#2212) (#2234)
krishung5 Jul 31, 2025
e848cf5
fix: Cherry pick pr 2186 release 0.4.0 to fix docs/runtime/README.md …
keivenchang Aug 1, 2025
5e3586d
fix: drop cuda graph bs (batch size) on dsr1 h100 sgl (#2235)
ishandhanani Aug 1, 2025
4fbb4e5
fix: handle groveTerminationDelay and auto-detect grove installation …
julienmancuso Aug 1, 2025
dc13774
fix: Locked triton==3.3.1 since triton 3.4.0 breaks tensorrt-llm 1.0.…
dmitry-tokarev-nv Aug 1, 2025
e5e94ad
fix: sgl instructions point to new frontend (#2245)
ishandhanani Aug 1, 2025
92781d3
fix: Update disagg configs for trtllm 1.0.0rc4 changes (release/0.4.0…
rmccorm4 Aug 4, 2025
58ad4a2
fix: readme instruction (#2265)
ishandhanani Aug 4, 2025
039c061
fix: Update eagle_one configs with speculative_model_dir field (#2283)
rmccorm4 Aug 4, 2025
2a8e251
docs: Backport: Dyn 591 (#2247) to 0.4.0 (#2251)
atchernych Aug 4, 2025
2dc4a4b
fix: trtllm container - ENV var used before declaration (#2277)
dmitry-tokarev-nv Aug 5, 2025
85737ba
fix: Update the NIXL TRTLLM commit version to rc4 (#2285)
tanmayv25 Aug 5, 2025
27c8a97
docs: add instruction to deploy model with inference gateway #2257 (#…
biswapanda Aug 5, 2025
641e49d
fix: fix nil pointer deref in dynamo controller (#2293) (#2299)
mohammedabdulwahhab Aug 5, 2025
1b145bb
fix: fix broken doc links (#2308)
biswapanda Aug 5, 2025
4e4818f
fix: Copy cuda libraries from devel to runtime stage (#2298)
nv-tusharma Aug 5, 2025
c92c1f4
docs: update deploy readme (#2306)
atchernych Aug 5, 2025
6fce98a
fix: Add common and test dependencies to sglang runtime build (#2279)…
nv-tusharma Aug 5, 2025
035d6d8
fix: Revert the commit for DeepGEMM to fix vLLM WideEP (#2302) (#2325)
krishung5 Aug 6, 2025
167c793
fix: Backport/anish index rst into 0.4.0 - fix links in docs and more…
athreesh Aug 6, 2025
409aa9e
docs: Final fixes to links reported by QA (#2334)
athreesh Aug 6, 2025
71126c7
fix: nil pointer deref in dynamo controller (#2335)
mohammedabdulwahhab Aug 6, 2025
f342c30
docs: address sphinx build errors for docs.nvidia.com (#2346)
athreesh Aug 7, 2025
96d1f15
docs: Address vincent issue with trtllm symlink (#2351)
athreesh Aug 7, 2025
e8b37a6
fix: ARM Flashinfer Versioning for 0.4.0 Release (#2363)
zaristei Aug 8, 2025
b5c9278
fix: Pinned PyTorch version for vLLM container (#2356)
krishung5 Aug 8, 2025
b0c1a24
chore: ATTRIBUTIONS-Go.md (#2355)
dmitry-tokarev-nv Aug 8, 2025
0cf8041
Revert "adjust tag to accomodate flashinfer versioning typo" (#2364)
zaristei Aug 8, 2025
bd8e368
fix: use wheel files for installation in trtllm build (#2372) (#2375)
nv-anants Aug 8, 2025
73bcc3b
fix(build): Pin cuda-python>=12,<13 to avoid trtllm breakage (#2379)
rmccorm4 Aug 8, 2025
aa57c6b
fix: turn off kvbm for al2023 support (#2533)
saturley-hall Aug 21, 2025
3f0a725
docs: add trtllm known issue for al2023 (#2604) (#2612)
nv-anants Aug 21, 2025
d98a791
docs: update trtllm know issue message (#2639) (#2643)
nv-anants Aug 22, 2025
37fca1c
fix: prevent crash looping hello world (#2625)
biswapanda Aug 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix: Kprashanth/trtllm rc4 cherry pick (#2218)
  • Loading branch information
KrishnanPrash authored Jul 31, 2025
commit a9b6b28287d3833753f1fe3bc9bf38bb62eeb27a
5 changes: 4 additions & 1 deletion components/backends/trtllm/engine_configs/agg.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,7 @@ kv_cache_config:
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
use_cuda_graph: true


cuda_graph_config:
max_batch_size: 16
9 changes: 7 additions & 2 deletions components/backends/trtllm/engine_configs/decode.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,16 @@ tensor_parallel_size: 1
moe_expert_parallel_size: 1
enable_attention_dp: false
max_num_tokens: 8192
max_batch_size: 16
trust_remote_code: true
backend: pytorch
enable_chunked_prefill: true
disable_overlap_scheduler: false
use_cuda_graph: true

cuda_graph_config:
max_batch_size: 16

kv_cache_config:
free_gpu_memory_fraction: 0.95

cache_transceiver_config:
backend: default
Original file line number Diff line number Diff line change
Expand Up @@ -28,23 +28,24 @@ max_num_tokens: 8448
max_seq_len: 8448
kv_cache_config:
free_gpu_memory_fraction: 0.30
dtype: fp8

# Enable the MTP(Multi-Token Prediction) in the model engine
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1

use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256

print_iter_log: true
kv_cache_dtype: fp8
Original file line number Diff line number Diff line change
Expand Up @@ -31,23 +31,24 @@ max_num_tokens: 512
max_seq_len: 8704
kv_cache_config:
free_gpu_memory_fraction: 0.85
dtype: fp8

# Enable the MTP(Multi-Token Prediction) in decode model engine
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1

use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
print_iter_log: true
kv_cache_dtype: fp8
cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256

print_iter_log: true
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,9 @@ max_num_tokens: 8192
max_seq_len: 8192
kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8

print_iter_log: true
kv_cache_dtype: fp8
disable_overlap_scheduler: true

# Enable the MTP(Multi-Token Prediction) in the prefill model engine
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,24 +31,26 @@ kv_cache_config:
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
# free_gpu_memory_fraction: 0.30
dtype: fp8


# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_config:
enable_padding: true
# NOTE: For larger max batch size, you may want to add larger cuda graph
# batch sizes below to match.
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256

Comment on lines +41 to +55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

cuda_graph_config key naming is inconsistent with other configs.

This file uses enable_padding, while other configs in this PR (e.g., llama4/eagle_one_model/eagle_agg.yml) use padding_enabled. Mismatched keys can be silently ignored by parsers, leading to runtime behavior differences.

Apply this diff to align naming, and verify with a quick repo-wide check:

-cuda_graph_config:
-  enable_padding: true
+cuda_graph_config:
+  padding_enabled: true
   # NOTE: For larger max batch size, you may want to add larger cuda graph
   # batch sizes below to match.
   batch_sizes:
     - 1
     - 2
     - 4
     - 8
     - 16
     - 32
     - 64
     - 128
     - 256

Then run:


🏁 Script executed:

#!/bin/bash
# Check for both variants to normalize across repo
rg -nP 'cuda_graph_config:\s*(\n\s+.*)*?(padding_enabled|enable_padding)\s*:' -C2

Length of output: 83


🏁 Script executed:

#!/bin/bash
# Search for occurrences of both padding variants across YAML configs
rg -n --type-add 'yaml:*.yml,*.yaml' -t yaml 'enable_padding'
rg -n --type-add 'yaml:*.yml,*.yaml' -t yaml 'padding_enabled'

Length of output: 1147


Unify cuda_graph_config padding key across all configs

All YAML engine configs must use the same boolean key (padding_enabled) to avoid silent parser mismatches. Please rename every occurrence of enable_padding to padding_enabled in the following files:

  • components/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yaml
  • components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_decode.yaml
  • components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_agg.yaml
  • components/backends/trtllm/engine_configs/deepseek_r1/simple/decode.yaml
  • components/backends/trtllm/engine_configs/deepseek_r1/simple/agg.yaml
  • components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_agg.yaml
  • components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/dep16_agg.yaml
  • components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_decode.yaml

Example diff (e.g., for simple/agg.yaml):

 cuda_graph_config:
-  enable_padding: true
+  padding_enabled: true
   # NOTE: For larger max batch size, you may want to add larger cuda graph
   # batch sizes below to match.
   batch_sizes:

After applying these changes, please verify no enable_padding occurrences remain:

rg -n --type-add 'yaml:*.yml,*.yaml' -t yaml 'enable_padding'
🤖 Prompt for AI Agents
components/backends/trtllm/engine_configs/deepseek_r1/simple/agg.yaml lines
41-55: the yaml key "enable_padding" under cuda_graph_config must be renamed to
the canonical boolean key "padding_enabled"; update this file and the other
listed engine config files
(components/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yaml,
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_decode.yaml,
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_agg.yaml,
components/backends/trtllm/engine_configs/deepseek_r1/simple/decode.yaml,
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_agg.yaml,
components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/dep16_agg.yaml,
components/backends/trtllm/engine_configs/deepseek_r1/mtp/mtp_decode.yaml by
replacing every occurrence of "enable_padding:" with "padding_enabled:"
preserving the boolean value and indentation; after changes, run a recursive
search (e.g., rg -n --type-add 'yaml:*.yml,*.yaml' -t yaml 'enable_padding') to
confirm no occurrences remain.

print_iter_log: true
kv_cache_dtype: fp8
Original file line number Diff line number Diff line change
Expand Up @@ -31,25 +31,27 @@ kv_cache_config:
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
# free_gpu_memory_fraction: 0.30
dtype: fp8

# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler: false
use_cuda_graph: true
cuda_graph_padding_enabled: true
# NOTE: For larger max batch size, you may want to add larger cuda graph
# batch sizes below to match.
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256

cuda_graph_config:
enable_padding: true
# NOTE: For larger max batch size, you may want to
# add larger cuda graph batch sizes below to match.
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256

print_iter_log: true
kv_cache_dtype: fp8
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,11 @@ max_seq_len: 8192

kv_cache_config:
free_gpu_memory_fraction: 0.75
dtype: fp8 # NOTE: This dtype must match in both prefill/decode configs

# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler: true
print_iter_log: true
# NOTE: This dtype must match in both prefill/decode configs
kv_cache_dtype: fp8
print_iter_log: true
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,20 @@ enable_attention_dp: true
max_batch_size: 256
max_num_tokens: 256
max_seq_len: 8448

kv_cache_config:
free_gpu_memory_fraction: 0.7
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
kv_cache_dtype: fp8
dtype: fp8

cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
Original file line number Diff line number Diff line change
Expand Up @@ -3,33 +3,37 @@
backend: pytorch

# WideEP related settings
moe_backend: WideEP
# moe_max_num_tokens will default to max_num_tokens if left unspecified.
#
# If you want to set this value explicitly, one recommendation is below:
# moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
# 4096 = 256 * 16
# moe_max_num_tokens: 4096
moe_load_balancer: /mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml
moe_config:
backend: WIDEEP
# moe_max_num_tokens will default to max_num_tokens if left unspecified.
#
# If you want to set this value explicitly, one recommendation is below:
# moe_max_num_tokens = max_batch_size * moe_expert_parallel_size
# 4096 = 256 * 16
# moe_max_num_tokens: 4096
load_balancer: /mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml

tensor_parallel_size: 16
moe_expert_parallel_size: 16

enable_attention_dp: true
max_batch_size: 256
max_num_tokens: 256
max_seq_len: 8448

kv_cache_config:
free_gpu_memory_fraction: 0.7
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
kv_cache_dtype: fp8
free_gpu_memory_fraction: 0.3
dtype: fp8

cuda_graph_config:
enable_padding: true
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@
backend: pytorch

# WideEP related settings
moe_backend: WideEP
moe_load_balancer: /mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml
moe_config:
backend: WIDEEP
load_balancer: /mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml

# TP/EP/PP/DP
tensor_parallel_size: 16
Expand All @@ -35,25 +36,28 @@ kv_cache_config:
# With dp attention enabled: large ISL at high concurrency may need
# free_gpu_memory_fraction low to have enough available memory.
free_gpu_memory_fraction: 0.30
dtype: fp8


# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler: false
use_cuda_graph: true
cuda_graph_padding_enabled: true
# NOTE: For larger max batch size, you may want to add larger cuda graph
# batch sizes below to match.
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
cuda_graph_config:
enable_padding: true
# NOTE: For larger max batch size, you may want to
# add larger cuda graph batch sizes below to match.
batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256


print_iter_log: true
kv_cache_dtype: fp8
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@
backend: pytorch

# WideEP related settings
moe_backend: WideEP
moe_load_balancer: /mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml
moe_config:
backend: WIDEEP
load_balancer: /mnt/engine_configs/deepseek_r1/wide_ep/eplb.yaml

# TP/EP/PP/DP
tensor_parallel_size: 16
Expand All @@ -29,13 +30,12 @@ max_num_tokens: 8192
max_seq_len: 8192

kv_cache_config:
free_gpu_memory_fraction: 0.75
free_gpu_memory_fraction: 0.3
dtype: fp8 # NOTE: This dtype must match in both prefill/decode configs

# NOTE: pytorch_backend_config section flattened since: https://github.com/NVIDIA/TensorRT-LLM/pull/4603
# NOTE: overlap_scheduler enabled by default since this commit and changed
# config field from 'enable_overlap_scheduler' to 'disable_overlap_scheduler':
# https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0ee0024eda3eeb83a6ba822245a30ab428
disable_overlap_scheduler: true
print_iter_log: true
# NOTE: This dtype must match in both prefill/decode configs
kv_cache_dtype: fp8
print_iter_log: true
Loading