Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion container/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ TENSORRTLLM_PIP_WHEEL_DIR="/tmp/trtllm_wheel/"
# TensorRT-LLM commit to use for building the trtllm wheel if not provided.
# Important Note: This commit is not used in our CI pipeline. See the CI
# variables to learn how to run a pipeline with a specific commit.
TRTLLM_COMMIT="8cb6163a57226e69d8a85788eff542a440ed9c89"
TRTLLM_COMMIT="137fe35539ea182f1495f5021bfda97c729e50c3"

# TensorRT-LLM PyPI index URL
TENSORRTLLM_INDEX_URL="https://pypi.python.org/simple"
Expand Down
44 changes: 40 additions & 4 deletions examples/tensorrt_llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,8 +125,6 @@ dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_agg.yaml
```
Notes:
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- Please keep the `cuda_graph_padding_enabled` setting as `false` in the model engine's configuration. There is a known bug, and the fix will be included in the next release of TensorRT-LLM.
- MTP support for Disaggregation in Dynamo + TensorRT-LLM is coming soon.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking

#### Multi-Node Disaggregated Serving
Expand Down Expand Up @@ -158,7 +156,7 @@ etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0
# helps to guarantee a clean and reproducible results.
```

Launch graph of Frontend, Processor, and TensorRTLLMWorker (decode) on head node:
Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:

```bash
cd /workspace/examples/tensorrt_llm
Expand Down Expand Up @@ -191,7 +189,7 @@ export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```

Deploy a Prefill worker:
```
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f ./configs/disagg.yaml --service-name TensorRTLLMPrefillWorker &
```
Expand Down Expand Up @@ -224,6 +222,44 @@ Notes:
unset SLURM_JOBID SLURM_JOB_ID SLURM_NODELIST
```

#### Multi-Node Disaggregated Serving with Multi-Token Prediction(MTP) and DeepSeek R1

Most of the steps remain the same as the above example, but this time we will have `dynamo serve` point to different config files that contains the MTP configurations

##### Head Node

Start nats/etcd
```bash
nats-server -js &
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --data-dir /tmp/etcd &
```

Launch graph of Frontend and TensorRTLLMWorker (decode) on head node:

```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f configs/deepseek_r1/mtp/mtp_disagg.yaml &
```

##### Worker Node(s)

Set environment variables pointing at the etcd/nats endpoints on the head node.
```bash
export HEAD_NODE_IP="<head-node-ip>"
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```

Deploy a Prefill worker:
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve components.prefill_worker:TensorRTLLMPrefillWorker -f configs/deepseek_r1/mtp/mtp_disagg.yaml --service-name TensorRTLLMPrefillWorker &
```

Notes:
- There is a noticeable latency for the first four inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking

### Client

See [client](../llm/README.md#client) section to learn how to send request to the deployment.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,7 @@ speculative_config:
num_nextn_predict_layers: 1

use_cuda_graph: true
# Please keep cuda_graph_padding_enabled setting as 'false' when MTP is turned on.
# There is known bug with MTP and cuda_graph_padding_enabled.
# Tensorrt LLM team is working on a fix in the next release.
cuda_graph_padding_enabled: false
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
Expand Down
41 changes: 41 additions & 0 deletions examples/tensorrt_llm/configs/deepseek_r1/mtp/mtp_disagg.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Frontend:
served_model_name: "nvidia/DeepSeek-R1-FP4"
endpoint: dynamo.TensorRTLLMWorker.generate
port: 8000
router: round-robin

TensorRTLLMWorker:
served_model_name: "nvidia/DeepSeek-R1-FP4"
engine_args: "configs/deepseek_r1/agg_llm_api_config.yaml"
llmapi-disaggregated-config: "configs/deepseek_r1/mtp/mtp_disagg_llm_api_config.yaml"
router: round-robin
remote-prefill: true
min-prefill-workers: 1
ServiceArgs:
workers: 1
resources:
gpu: 4

TensorRTLLMPrefillWorker:
engine_args: "configs/deepseek_r1/agg_llm_api_config.yaml"
llmapi-disaggregated-config: "configs/deepseek_r1/mtp/mtp_disagg_llm_api_config.yaml"
router: round-robin
ServiceArgs:
workers: 1
resources:
gpu: 4
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# NOTE: FP4 only supported starting with Blackwell GPUs.
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
# You can also specify the full path to locally downloaded weights
# instead of a HuggingFace ID here.

backend: pytorch

context_servers:
num_instances: 1
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: true
max_batch_size: 1
max_num_tokens: 8192
max_seq_len: 8192
kv_cache_config:
free_gpu_memory_fraction: 0.75
print_iter_log: true
kv_cache_dtype: fp8
disable_overlap_scheduler: true
# Enable the MTP(Multi-Token Prediction) in the prefill model engine
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1

generation_servers:
num_instances: 1
tensor_parallel_size: 4
moe_expert_parallel_size: 4
enable_attention_dp: false
max_batch_size: 256
# Note: When MPT is enabled and `cuda_graph_batch_sizes` is specified, `max_num_tokens` must satisfy the following formula:
# max_num_tokens >= max(cuda_graph_batch_sizes) * (num_nextn_predict_layers + 1)
# This is a known issue in TensorRT-LLM and will be resolved in the next release.
max_num_tokens: 512
# 8704 = 8192 ISL + 512 OSL
max_seq_len: 8704
kv_cache_config:
free_gpu_memory_fraction: 0.85
# Enable the MTP(Multi-Token Prediction) in the decode model engine
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
print_iter_log: true
kv_cache_dtype: fp8
Loading