Skip to content

Commit 7add6eb

Browse files
committed
resolve comments
1 parent 7c803d1 commit 7add6eb

File tree

6 files changed

+22
-7
lines changed

6 files changed

+22
-7
lines changed

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@ moe_expert_parallel_size: 8
1919
max_batch_size: 8
2020
max_num_tokens: 4096
2121
disable_overlap_scheduler: true # disable_overlap_scheduler is having acc issue on both aggregated and disaggregated serving
22-
enable_autotuner: false
2322

2423
# Enable Speculative Decoding in the model engine
2524
speculative_config:

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ max_num_tokens: 1024
2121
# 8704 = 8192 ISL + 512 OSL
2222
max_seq_len: 8704
2323
disable_overlap_scheduler: true
24-
enable_autotuner: false
2524

2625
# Enable Speculative Decoding in the model engine
2726
speculative_config:

components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_prefill.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,6 @@ max_num_tokens: 8192
2121
max_seq_len: 8192
2222
print_iter_log: true
2323
disable_overlap_scheduler: true
24-
enable_autotuner: false
2524

2625
# Enable Speculative Decoding in the model engine
2726
speculative_config:

components/backends/trtllm/gemma3_sliding_window_attention.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,11 @@ This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding
2121
VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
2222

2323
## Notes
24-
* To run Gemma 3 with VSWA, ensure that the container has TensorRT-LLM v1.0.0rc4 installed.
25-
* To run Gemma 3 with VSWA and KV Routing, ensure that the container is built with the default experimental TRT-LLM commit.
24+
* To run Gemma 3 with VSWA and KV Routing with KV block reuse, ensure that the container is built using commit ID `c9eebcb4541d961ab390f0bd0a22e2c89f1bcc78` from Tensorrt-LLM.
2625
```bash
27-
./container/build.sh --framework TENSORRTLLM --use-default-experimental-tensorrtllm-commit
26+
./container/build.sh --framework TENSORRTLLM --tensorrtllm-commit c9eebcb4541d961ab390f0bd0a22e2c89f1bcc78
2827
```
28+
* The 1.0.0rc4 release version of TensorRT-LLM can also run Gemma 3 with VSWA, but KV block reuse cannot be turned on in that version.
2929

3030
### Aggregated Serving
3131
```bash

components/backends/trtllm/src/dynamo/trtllm/publisher.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -418,7 +418,25 @@ def update_max_window_size(self, event):
418418
f"kv events max_window_size has been updated to {self.max_window_size}"
419419
)
420420

421+
# The global attention layer will emit the KV event with the max_window_size.
422+
# We only want to keep the KV event that has the max_window_size to ensure
423+
# the accuracy of KV routing.
424+
# TRTLLM emits a "created" event at the very beginning when it creates the KV cache,
425+
# so we can use the "created" event to identify the max_window_size of the global
426+
# attention layer in the model engine.
421427
def should_drop_event(self, event):
428+
# There are two cases for KV event filtering:
429+
#
430+
# 1. If "window_size" is NOT in the KV event:
431+
# "window_size" was added to KV events only recently, so some older versions of TRTLLM
432+
# might not include it. In this case, the publisher will assume that all events are
433+
# from the global attention layer.
434+
#
435+
# 2. If "window_size" is present in the KV event:
436+
# The publisher will not drop any KV events until all initial "created" KV events
437+
# have been processed in order to capture the max_window_size.
438+
# After processing all "created" events, the publisher will only accept KV events
439+
# whose window_size is equal to the max_window_size to ensure accurate routing.
422440
if "window_size" not in event or self.processing_initial_created_events:
423441
return False
424442

container/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ TENSORRTLLM_PIP_WHEEL_DIR="/tmp/trtllm_wheel/"
8888
# TensorRT-LLM commit to use for building the trtllm wheel if not provided.
8989
# Important Note: This commit is not used in our CI pipeline. See the CI
9090
# variables to learn how to run a pipeline with a specific commit.
91-
DEFAULT_EXPERIMENTAL_TRTLLM_COMMIT="c9eebcb4541d961ab390f0bd0a22e2c89f1bcc78"
91+
DEFAULT_EXPERIMENTAL_TRTLLM_COMMIT="69e9f6d48944b2ae0124ff57aa59340aa4dfae15"
9292
TRTLLM_COMMIT=""
9393
TRTLLM_USE_NIXL_KVCACHE_EXPERIMENTAL="0"
9494
TRTLLM_GIT_URL=""

0 commit comments

Comments
 (0)