Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions examples/tensorrt_llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,16 @@ cd /workspace/examples/tensorrt_llm
dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
```

#### Aggregated serving with Multi-Token Prediction(MTP) and DeepSeek R1
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.disagg_router:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
```
Notes:
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- Please keep the `cuda_graph_padding_enabled` setting as `false` in the model engine's configuration. There is a known bug, and the fix will be included in the next release of TensorRT-LLM.
- Disaggregated support for MTP in Dynamo + TensorRT-LLM is coming soon.

#### Multi-Node Disaggregated Serving

In the following example, we will demonstrate how to run a Disaggregated Serving
Expand Down
29 changes: 29 additions & 0 deletions examples/tensorrt_llm/configs/deepseek_r1/mtp/mtp_agg.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Frontend:
served_model_name: "nvidia/DeepSeek-R1-FP4"
endpoint: dynamo.TensorRTLLMWorker.generate
port: 8000
router: round-robin

TensorRTLLMWorker:
served_model_name: "nvidia/DeepSeek-R1-FP4"
engine_args: "configs/deepseek_r1/mtp/mtp_agg_llm_api_config.yaml"
router: round-robin
ServiceArgs:
workers: 1
resources:
gpu: 4
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# NOTE: FP4 only supported starting with Blackwell GPUs.
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
# You can also specify the full path to locally downloaded weights
# instead of a HuggingFace ID here.

model_name: "nvidia/DeepSeek-R1-FP4"
backend: pytorch
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: true
max_batch_size: 256
# 8448 = 8192 ISL + 256 OSL
max_num_tokens: 8448
max_seq_len: 8448
kv_cache_config:
free_gpu_memory_fraction: 0.30

# Enable the MTP(Multi-Token Prediction) in the model engine
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1

use_cuda_graph: true
# Please keep cuda_graph_padding_enabled setting as 'false' when MTP is turned on.
# There is known bug with MTP and cuda_graph_padding_enabled.
# Tensorrt LLM team is working on a fix in the next release.
cuda_graph_padding_enabled: false
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
print_iter_log: true
kv_cache_dtype: fp8
Loading