ai-dynamo · tanmayv25 · Aug 12, 2025 · Aug 6, 2025 · Aug 11, 2025 · Aug 11, 2025
diff --git a/components/backends/trtllm/performance_sweeps/README.md b/components/backends/trtllm/performance_sweeps/README.md
@@ -0,0 +1,148 @@
+<!--
+SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+SPDX-License-Identifier: Apache-2.0
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# TensorRT-LLM Benchmark Scripts for DeepSeek R1 model
+
+This directory contains scripts for benchmarking TensorRT-LLM performance with Dynamo using SLURM job scheduler.
+
+## ⚠️ DISCLAIMER
+**These scripts are currently not QA'ed and are provided for demonstration purposes only.**
+
+Please note that:
+
+- These scripts have not undergone formal quality assurance testing
+- They were executed on GB200 systems
+- They are intended for demonstration and educational purposes
+- Use at your own risk in production environments
+- Always review and test scripts thoroughly before running in your specific environment
+- We are actively working on refining the configuration sweeps.
+
+## Scripts Overview
+
+### Core Scripts
+
+1. `submit.sh` - Main entry point for submitting benchmark jobs for disaggregated configurations. This includes WideEP optimization for DEP>=16.
+2. `submit_agg.sh` - Main entry point for submitting benchmark jobs for aggregated configurations.
+3. `post_process.py` - Scan the genai-perf results to produce a json with entries to each config point.
+4. `plot_performance_comparison.py` - Takes the json result file for disaggregated and/or aggregated configuration sweeps and plots a pareto line for better visualization.
+
+For more finer grained details on how to launch TRTLLM backend workers with DeepSeek R1 on GB200 slurm, please refer [multinode-examples.md](../multinode/multinode-examples.md). This guide shares similar assumption to the multinode examples guide.
+
+## Usage
+
+### Prerequisites
+
+Before running the scripts, ensure you have:
+1. Access to a SLURM cluster
+2. Container image of Dynamo with TensorRT-LLM built using instructions from [here](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-docker).
+3. Model files accessible on the cluster
+4. Required environment variables set
+
+### Setup
+
+Within the login node of the cluster, set the following variables
+
+```bash
+# Set partition manually based on your slurm cluster's partition names
+export SLURM_PARTITION=""
+
+# Set account manually if this command doesn't work on your cluster
+export SLURM_ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
+
+# Set a job name for your benchmarking runs
+export SLURM_JOB_NAME=""
+
+# NOTE: IMAGE must be set manually for now
+# To build an iamge, see the steps here:
+# https://github.com/ai-dynamo/dynamo/tree/main/components/backends/trtllm#build-docker
+export IMAGE="<dynamo_trtllm_image>"
+
+# NOTE: In general, Deepseek R1 is very large, so it is recommended to
+# pre-download the model weights and save them in some shared location,
+# NFS storage, HF_CACHE, etc. and modify the `--model-path` below
+# to reuse the pre-downloaded weights instead.
+#
+# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
+# https://huggingface.co/nvidia/DeepSeek-R1-FP4
+#
+# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
+# https://huggingface.co/deepseek-ai/DeepSeek-R1
+export MODEL_PATH="<path_to_model_weights>"
+
+# The name the model will be served/queried under, matching what's
+# returned by the /v1/models endpoint.
+#
+# By default this is inferred from MODEL_PATH, but when using locally downloaded
+# model weights, it can be nice to have explicit control over the name.
+export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
+```
+
+## Launching benchmarking sweeps for different configurations
+
+### Aggregated
+
+```bash
+# Queues the SLURM jobs for aggregated configurations for DeepSeek R1.
+./submit_agg.sh
+```
+
+### Disaggregated (Includes WideEP) - MTP off
+
+```bash
+# Queues the SLURM jobs for disaggregated configurations for DeepSeek R1 without MTP
+./submit.sh mtp0 all
+```
+
+### Disaggregated (Includes WideEP) - MTP on
+
+```bash
+# Queues the SLURM jobs for disaggregated configurations for DeepSeek R1 with MTP
+./submit.sh mtp all
+```
+
+## Post-Processing Results
+
+The above jobs use genAI-perf tool to benchmark each configuration point across different concurrency values. These get stored in `dynamo_disagg-bm-{ISL}-{OSL}/<config-setup>/genai_perf_artifacts` and `dynamo_agg-bm-{ISL}-{OSL}/<config-setup>/genai_perf_artifacts` for disaggregated and aggregated respectively.
+
+After your benchmarking jobs have completed, you can use the `post_process.py` script to aggregate and summarize the results from the generated genai_perf_artifacts.
+
+To run the post-processing script, use:
+
+### Aggregated
+
+```bash
+python3 post_process.py dynamo_agg-bm-8150-1024 --output-file agg_result.json
+```
+
+### Disaggregated
+
+```bash
+python3 post_process.py dynamo_disagg-bm-8150-1024 --output-file disagg_result.json
+```
+
+## Ploting Performance
+
+You can now use the `plot_performance_comparison.py` like below to observe the performance.
+
+```bash
+python3 plot_performance_comparison.py dynamo_agg-bm-8150-1024/agg_result.json dynamo_disagg-bm-8150-1024/disagg_result.js
+on -o performance_plot.png
+```
+
+This script will produce a scatter plot of all the configuration points with each concurrency on a Output Throughput per GPU vs Output Throughput per User. It will also include the roofline pareto line for both aggregated and disaggregated setups.
+
+Refer to [Beyond the Buzz: A Pragmatic Take on Inference Disaggregation](https://arxiv.org/html/2506.05508v1) to learn how to interpret these plots.
diff --git a/components/backends/trtllm/performance_sweeps/benchmark.slurm b/components/backends/trtllm/performance_sweeps/benchmark.slurm
@@ -0,0 +1,212 @@
+#!/bin/bash
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+# Get partition, account, and job name from command line arguments
+SLURM_PARTITION=$1
+SLURM_ACCOUNT=$2
+SLURM_JOB_NAME=$3
+
+# Shift arguments so the rest of the script gets the correct parameters
+shift 3
+
+#SBATCH --partition=${SLURM_PARTITION}
+#SBATCH --account=${SLURM_ACCOUNT}
+#SBATCH --time=04:00:00
+#SBATCH --job-name="${SLURM_JOB_NAME}:disaggregated"
+
+if [ -z "$SLURM_PARTITION" ] || [ -z "$SLURM_ACCOUNT" ] || [ -z "$SLURM_JOB_NAME" ]; then
+    echo "Error: Required parameters not provided:"
+    echo "  SLURM_PARTITION: $SLURM_PARTITION"
+    echo "  SLURM_ACCOUNT: $SLURM_ACCOUNT"
+    echo "  SLURM_JOB_NAME: $SLURM_JOB_NAME"
+    echo "Usage: $0 <partition> <account> <job_name> [other_args...]"
+    exit 1
+fi
+
+MULTI_ROUND="${MULTI_ROUND:-8}"
+
+# set MOUNT_DIR
+MOUNT_DIR="${MOUNT_DIR:-${PWD}}"
+CONTAINER_NAME=disaggr-test
+
+
+STREAMING=true
+CTX_GPU_FRAC=0.75
+CACHE_TRANSCEIVER_MAX_NUM_TOKENS=8448
+
+num_ctx_servers=$1
+ctx_tp_size=$2
+ctx_batch_size=$3
+ctx_max_num_tokens=$4
+ctx_enable_attention_dp=$5
+num_gen_servers=$6
+gen_tp_size=$7
+gen_batch_size=$8
+gen_max_num_tokens=$9
+gen_enable_attention_dp=${10}
+gen_gpu_memory_fraction=${11}
+eplb_num_slots=${12}
+mtp_size=${13}
+concurrency_list=${14}
+gen_nodes=${15}
+kind=${16}
+model_path=${17}
+served_model_name=${18}
+image=${19}
+isl=${20}
+osl=${21}
+
+ctx_max_seq_len=$((${isl} + 203))
+gen_max_seq_len=$((${isl} + ${osl} + 203))
+
+WORK_DIR=${MOUNT_DIR}
+LOG_DIR=$WORK_DIR/${kind}-bm-${isl}-${osl}
+SCRIPTS_DIR=${WORK_DIR}/
+set_clock_cmd="bash ${SCRIPTS_DIR}/set_clock.sh"
+mkdir -p ${LOG_DIR}
+echo "trying to submit job"
+
+sub_dir=${LOG_DIR}/ctx${num_ctx_servers}_gen${num_gen_servers}_dep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
+
+echo "concurrency_list: ${concurrency_list}"
+
+ctx_gpus=$((num_ctx_servers * ctx_tp_size))
+gen_gpus=$((num_gen_servers * gen_tp_size))
+
+echo "enable_attention_dp: ${ctx_enable_attention_dp}, ${gen_enable_attention_dp}, gpu_memory_fraction: ${gen_gpu_memory_fraction}"
+
+enable_pdl=false
+if [ "${gen_enable_attention_dp}" = "false" ]; then
+    enable_pdl=true
+    echo "enable_pdl: ${enable_pdl}"
+    sub_dir=${LOG_DIR}/ctx${num_ctx_servers}_gen${num_gen_servers}_tep${gen_tp_size}_batch${gen_batch_size}_eplb${eplb_num_slots}_mtp${mtp_size}
+fi
+
+full_logdir=${sub_dir}
+artifacts_dir=${full_logdir}/genai_perf_artifacts
+mkdir -p ${artifacts_dir}
+
+
+# Set clock
+srun ${set_clock_cmd}
+
+container_mounts=${MOUNT_DIR}:${MOUNT_DIR},${model_path}:${model_path}
+
+# start the container
+srun -l --container-image=${image} \
+        --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix \
+        echo "Container up."
+
+# generate the yaml file
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix --overlap \
+	-n 1 -N 1 \
+        python3 ${SCRIPTS_DIR}/scripts/gen_yaml.py --config ${full_logdir}/config.yaml \
+            --model ${model_path} \
+            --num_ctx_servers ${num_ctx_servers} \
+            --ctx_tp_size ${ctx_tp_size} \
+            --ctx_batch_size ${ctx_batch_size} \
+            --ctx_max_num_tokens ${ctx_max_num_tokens} \
+            --ctx_max_seq_len ${ctx_max_seq_len} \
+            --ctx_free_gpu_memory_fraction ${CTX_GPU_FRAC} \
+            --cache_transceiver_max_num_tokens ${CACHE_TRANSCEIVER_MAX_NUM_TOKENS} \
+            --num_gen_servers ${num_gen_servers} \
+            --gen_tp_size ${gen_tp_size} \
+            --gen_batch_size ${gen_batch_size} \
+            --gen_max_num_tokens ${gen_max_num_tokens} \
+            --gen_max_seq_len ${gen_max_seq_len} \
+            --gen_gpu_memory_fraction ${gen_gpu_memory_fraction} \
+            --eplb_num_slots ${eplb_num_slots} \
+            $(if [ "${gen_enable_attention_dp}" = "true" ]; then echo "--gen_enable_attention_dp"; fi) \
+            $(if [ "${ctx_enable_attention_dp}" = "true" ]; then echo "--ctx_enable_attention_dp"; fi) \
+            $(if [ "${mtp_size}" -gt 0 ]; then echo "--mtp_size ${mtp_size}"; fi)
+
+echo "YAML file generated."
+
+nsys_on=""
+# nsys_on=${full_logdir}
+
+nodes=($(scontrol show hostnames "$SLURM_JOB_NODELIST"))
+
+export HEAD_NODE="${nodes[0]}"
+export HEAD_NODE_IP="$(hostname -i)"
+export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
+export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
+
+# start the server
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix --overlap -N 1 -n 1 \
+	--oversubscribe \
+	--overlap \
+	--container-env ETCD_ENDPOINTS,NATS_SERVER,HEAD_NODE_IP,HEAD_NODE \
+        -w ${nodes[0]} \
+        bash ${SCRIPTS_DIR}/scripts/start_frontend.sh &> ${full_logdir}/output_server.log &
+
+# wait for the server to start
+sleep 10
+
+PREFILL_COUNT=$(grep 'prefill_count:' "${full_logdir}/instance_config.yaml" | awk '{print $2}')
+echo "Prefill Count: $PREFILL_COUNT"
+
+# start the prefill workers
+prefill_pids=()
+for ((i=1; i<=PREFILL_COUNT; i++)); do
+  echo "Running Prefill Worker: ${i}"
+  node_idx=$((i-1))
+  echo "Running Prefill Nodes: ${nodes[node_idx]}"
+  srun -l --container-name=${CONTAINER_NAME} \
+      --container-mounts=${container_mounts} \
+      --mpi=pmix --overlap -w ${nodes[node_idx]} \
+      --oversubscribe \
+      --overlap \
+      --ntasks 4 \
+      --nodes 1 \
+      bash ${SCRIPTS_DIR}/scripts/start_worker.sh ${full_logdir}/prefill_config.yaml "${enable_pdl}" ${ctx_gpus} ${nsys_on} ${served_model_name} ${model_path} 'prefill' &> ${full_logdir}/output_workers.log &
+  prefill_pids+=($!)
+done
+
+DECODE_COUNT=$(grep 'decode_count:' "${full_logdir}/instance_config.yaml" | awk '{print $2}')
+echo "Decode Count: $DECODE_COUNT"
+
+num_gen_nodes=$((gen_nodes/num_gen_servers))
+decode_start_idx=$PREFILL_COUNT
+for ((i=1; i<=DECODE_COUNT; i++)); do
+  echo "Running Decode Worker: ${i}"
+  decode_node_list=()
+  for ((j=0; j<num_gen_nodes; j++)); do
+    node_idx=$((decode_start_idx + (i-1)*num_gen_nodes + j))
+    decode_node_list+=("${nodes[node_idx]}")
+  done
+  decode_nodes_csv=$(IFS=, ; echo "${decode_node_list[*]}")
+  echo "Running Decode Nodes: ${decode_nodes_csv}"
+  srun -l --container-name=${CONTAINER_NAME} \
+      --container-mounts=${container_mounts} \
+      --mpi=pmix \
+      -w ${decode_nodes_csv} \
+      --nodes ${num_gen_nodes} \
+      --ntasks $gen_tp_size \
+      --oversubscribe \
+      --overlap \
+      bash ${SCRIPTS_DIR}/scripts/start_worker.sh ${full_logdir}/decode_config.yaml "${enable_pdl}" ${ctx_gpus} ${nsys_on} ${served_model_name} ${model_path} 'decode' &> ${full_logdir}/output_workers.log &
+done
+
+total_gpus=$((ctx_gpus + gen_gpus))
+
+# start the loadgen
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts},${artifacts_dir}:${artifacts_dir} \
+        --mpi=pmix --overlap -N 1 -n 1 \
+	-w ${nodes[0]} \
+        bash ${SCRIPTS_DIR}/scripts/bench.sh ${served_model_name} ${MULTI_ROUND} ${num_gen_servers} "${concurrency_list}" ${STREAMING} ${full_logdir} ${total_gpus} ${artifacts_dir} ${model_path} ${isl} ${osl} ${kind} > ${full_logdir}/bench.log 2>&1
+
+# try to kill the server and workers
+srun -l --container-name=${CONTAINER_NAME} \
+        --container-mounts=${container_mounts} \
+        --mpi=pmix --overlap \
+        kill -9 $(ps aux | grep '[p]ython3' | awk '{print $2}') >/dev/null 2>&1 || true
+wait