NVIDIA · chang-l · May 23, 2025 · Jun 6, 2025 · Jun 7, 2025 · Jun 7, 2025
@@ -0,0 +1,174 @@
+# Multimodal Disaggregated Serving (Experimental)
+
+This example demonstrates how to set up disaggregated multimodal serving with TensorRT-LLM, where the vision encoder and language model decoder run as separate services for improved scalability and resource utilization.
+
+## ⚠️ Disclaimer
+
+**This is a Proof-of-Concept (POC) and early demonstration with several limitations:**
+1. **Model Support**: Limited to LLaVA-Next models only
+2. **Modality Support**: Image modality only (no video support yet)
+3. **Server Configuration**: Only supports 1 encoder server and 1 LLM server (though the LLM server can have multiple workers via tensor parallelism)
+
+## Overview
+
+Disaggregated multimodal serving separates the multimodal pipeline into distinct components:
+
+- **Encoder Server**: Handles vision processing (images), including pre-processing and encoding, using the multimodal encoder
+- **LLM Decoder Server**: Processes text generation using the language model
+- **Disaggregated Server**: Orchestrates requests between encoder and decoder services
+
+This architecture enables better resource utilization and scalability by allowing independent scaling of vision and language processing components.
+
+## Setup Instructions
+
+### Step 1: Prepare Configuration Files
+
+Create the required configuration files in your working directory:
+
+#### LLM API Configuration (`extra-llm-api-config.yml`)
+```bash
+# Note: Current multimodal implementation does not support KV cache reuse,
+# so we disable it for all cases
+cat > ./extra-llm-api-config.yml << EOF
+kv_cache_config:
+    enable_block_reuse: false
+EOF
+```
+
+#### Disaggregated Server Configuration (`disagg_config.yaml`)
+```bash
+cat > ./disagg_config.yaml << EOF
+hostname: localhost
+port: 8000
+backend: pytorch
+multimodal_servers:
+  num_instances: 1
+  urls:
+      - "localhost:8001"
+generation_servers:
+  num_instances: 1
+  urls:
+      - "localhost:8002"
+EOF
+```
+
+### Step 2: Start the Encoder Server
+
+Launch the multimodal encoder server on GPU 0:
+
+```bash
+mkdir -p Logs/
+CUDA_VISIBLE_DEVICES=0 trtllm-serve encoder llava-hf/llava-v1.6-mistral-7b-hf \
+    --host localhost \
+    --port 8001 \
+    --backend pytorch \
+    &> Logs/log_encoder_0 &
+```
+
+### Step 3: Start the LLM Decoder Server
+
+Launch the language model decoder server on GPU 1:
+
+```bash
+CUDA_VISIBLE_DEVICES=1 trtllm-serve llava-hf/llava-v1.6-mistral-7b-hf \
+    --host localhost \
+    --port 8002 \
+    --backend pytorch \
+    --extra_llm_api_options ./extra-llm-api-config.yml \
+    &> Logs/log_pd_tp1 &
+```
+
+### Step 4: Start the Disaggregated Orchestrator
+
+Launch the disaggregated server that coordinates between encoder and decoder:
+
+```bash
+trtllm-serve disaggregated_mm -c disagg_config.yaml &> Logs/log_disagg_server &
+```
+
+## Alternative Setup
+
+Instead of running Steps 2-4 manually, you can start all services at once using the provided script:
+
+```bash
+./start_disagg_mm.sh
+```
+
+This script will start the encoder server, LLM decoder server, and disaggregated orchestrator automatically with the same configuration as the manual steps above.
+
+## Multi-GPU Decoder Configuration
+
+For larger models and higher throughput, you can run the decoder with tensor parallelism (TP>1) across multiple GPUs:
+
+```bash
+CUDA_VISIBLE_DEVICES=1,2 trtllm-serve llava-hf/llava-v1.6-mistral-7b-hf \
+    --host localhost \
+    --port 8002 \
+    --backend pytorch \
+    --tp_size 2 \
+    --extra_llm_api_options ./extra-llm-api-config.yml \
+    &> Logs/log_pd_tp2 &
+```
+
+## Testing the Setup
+
+### Basic Functionality Test
+
+Test the setup with a multimodal chat completion request:
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json"  \
+    -d '{
+        "model": "llava-hf/llava-v1.6-mistral-7b-hf",
+        "messages":[{
+            "role": "system",
+            "content": "You are a helpful assistant."
+        }, {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Describe the natural environment in the image."
+                },
+                {
+                    "type":"image_url",
+                    "image_url": {
+                        "url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
+                    }
+                }
+            ]
+        }],
+        "max_tokens": 64,
+        "temperature": 0
+    }'
+```
+
+### Performance Testing
+
+Use the provided performance testing script for load testing (assuming you've already set up the multimodal disaggregated server):
+
+#### Prerequisites
+```bash
+pip install genai_perf
+```
+
+#### Concurrency Testing
+```bash
+./test_client_disag_mm.sh --concurrency 1 --port 8000
+```
+
+#### Request Rate Testing
+```bash
+./test_client_disag_mm.sh --request-rate 10 --port 8000
+```
+
+
+## Roadmap & Future Improvements
+
+- **Model Support**: Add support for more multimodal models beyond LLaVA-Next
+- **Communication**: NIXL integration for transferring multimodal embeddings between servers
+- **Scalability**: Enable support for multiple LLM servers and multimodal servers with a routing manager
+- **Parallelism**: Enable data parallelism (DP) in multimodal server
+- **Configuration**: Test/verify/enable major parallel configurations in LLM decoder server
+- **Optimization**: Performance optimization and tuning
@@ -0,0 +1,14 @@
+#!/bin/bash
+mkdir -p Logs/
+CUDA_VISIBLE_DEVICES=0 trtllm-serve encoder llava-hf/llava-v1.6-mistral-7b-hf \
+    --host localhost \
+    --port 8001 \
+    --backend pytorch \
+    &> Logs/log_encoder_0 &
+CUDA_VISIBLE_DEVICES=1 trtllm-serve llava-hf/llava-v1.6-mistral-7b-hf \
+    --host localhost \
+    --port 8002 \
+    --backend pytorch \
+    --extra_llm_api_options ./extra-llm-api-config.yml \
+    &> Logs/log_pd_tp1 &
+trtllm-serve disaggregated_mm -c disagg_config.yaml &> Logs/log_disagg_server &
@@ -0,0 +1,174 @@
+#!/bin/bash
+
+# This script runs genai-perf to profile a multimodal model.
+# Supports two modes: concurrency or request_rate
+
+# --- Command Line Arguments Parsing ---
+usage() {
+    echo "Usage: $0 [--concurrency <value> | --request-rate <value>] --port <port>"
+    echo ""
+    echo "Options:"
+    echo "  --concurrency <value>    Run in concurrency mode with specified concurrency level"
+    echo "  --request-rate <value>   Run in request rate mode with specified rate (requests/sec)"
+    echo "  --port <port>           Server port number (e.g., 8001, 8003)"
+    echo ""
+    echo "Examples:"
+    echo "  $0 --concurrency 2 --port 8003"
+    echo "  $0 --request-rate 15 --port 8001"
+    echo "  $0 --concurrency 1 --port 9000"
+    exit 1
+}
+
+# Initialize variables
+MODE=""
+CONCURRENCY=""
+REQUEST_RATE=""
+PORT=""
+
+# Parse command line arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --concurrency)
+            MODE="concurrency"
+            CONCURRENCY="$2"
+            shift 2
+            ;;
+        --request-rate)
+            MODE="request_rate"
+            REQUEST_RATE="$2"
+            shift 2
+            ;;
+        --port)
+            PORT="$2"
+            shift 2
+            ;;
+        -h|--help)
+            usage
+            ;;
+        *)
+            echo "Unknown option: $1"
+            usage
+            ;;
+    esac
+done
+
+# Validate arguments
+if [ -z "$MODE" ]; then
+    echo "Error: Must specify either --concurrency or --request-rate"
+    usage
+fi
+
+if [ -z "$PORT" ]; then
+    echo "Error: Must specify --port"
+    usage
+fi
+
+# Validate PORT
+if ! [[ "${PORT}" =~ ^[0-9]+$ ]] || [ "${PORT}" -lt 1 ] || [ "${PORT}" -gt 65535 ]; then
+    echo "Error: PORT must be a valid port number (1-65535)"
+    echo "You provided: '${PORT}'"
+    exit 1
+fi
+
+# Validate and set mode-specific values
+if [ "${MODE}" = "concurrency" ]; then
+    if ! [[ "${CONCURRENCY}" =~ ^[0-9]+$ ]] || [ "${CONCURRENCY}" -lt 1 ]; then
+        echo "Error: CONCURRENCY must be a positive integer"
+        echo "You provided: '${CONCURRENCY}'"
+        exit 1
+    fi
+    if [ "${CONCURRENCY}" -gt 1 ]; then
+        REQUEST_COUNT=$((CONCURRENCY*5))
+    else
+        REQUEST_COUNT=$((CONCURRENCY*50))
+    fi
+    echo "Running in CONCURRENCY mode: CONCURRENCY=${CONCURRENCY}, REQUEST_COUNT=${REQUEST_COUNT}, PORT=${PORT}"
+elif [ "${MODE}" = "request_rate" ]; then
+    if ! [[ "${REQUEST_RATE}" =~ ^[0-9]+$ ]] || [ "${REQUEST_RATE}" -lt 1 ]; then
+        echo "Error: REQUEST_RATE must be a positive integer"
+        echo "You provided: '${REQUEST_RATE}'"
+        exit 1
+    fi
+    REQUEST_COUNT=$((REQUEST_RATE*10))
+    echo "Running in REQUEST_RATE mode: REQUEST_RATE=${REQUEST_RATE}, REQUEST_COUNT=${REQUEST_COUNT}, PORT=${PORT}"
+fi
+
+ISL=64
+OSL=64
+
+# --- Configuration for genai-perf ---
+MODEL_NAME="llava-hf/llava-v1.6-mistral-7b-hf"
+TOKENIZER_NAME="llava-hf/llava-v1.6-mistral-7b-hf"
+SERVICE_KIND="openai"
+ENDPOINT_TYPE="multimodal"
+INPUT_FILE="./mm_data_oai.json"
+SERVER_URL="localhost:${PORT}"
+
+# Set append name based on port
+if [ "${PORT}" = "8000" ]; then
+    APPEND_NAME="disagg"
+elif [ "${PORT}" = "8002" ]; then
+    APPEND_NAME="agg"
+else
+    APPEND_NAME="port${PORT}"
+fi
+
+if [ "${MODE}" = "concurrency" ]; then
+    PROFILE_EXPORT_FILE="ISL_${ISL}_OSL_${OSL}_CONCURRENCY_${CONCURRENCY}_${APPEND_NAME}.json"
+else
+    PROFILE_EXPORT_FILE="ISL_${ISL}_OSL_${OSL}_RATE_${REQUEST_RATE}_${APPEND_NAME}.json"
+fi
+
+RANDOM_SEED=123
+# Set to true if your endpoint supports streaming and you want to test it
+ADD_STREAMING_FLAG=true # or true
+
+# --- Build the genai-perf command ---
+CMD="genai-perf profile"
+CMD="${CMD} -m \"${MODEL_NAME}\""
+CMD="${CMD} --tokenizer \"${TOKENIZER_NAME}\""
+#CMD="${CMD} --service-kind \"${SERVICE_KIND}\""
+CMD="${CMD} --endpoint-type \"${ENDPOINT_TYPE}\""
+#CMD="${CMD} --input-file \"${INPUT_FILE}\""
+CMD="${CMD} --output-tokens-mean ${OSL}"
+#CMD="${CMD} --output-tokens-stddev ${OUTPUT_TOKENS_STDDEV}"
+CMD="${CMD} --request-count ${REQUEST_COUNT}"
+CMD="${CMD} --profile-export-file \"${PROFILE_EXPORT_FILE}\""
+CMD="${CMD} --url \"${SERVER_URL}\""
+CMD="${CMD} --random-seed ${RANDOM_SEED}"
+
+# --- Mode-specific flags ---
+if [ "${MODE}" = "concurrency" ]; then
+    CMD="${CMD} --num-prompts ${CONCURRENCY}"
+    CMD="${CMD} --concurrency ${CONCURRENCY}"
+    echo "Added concurrency flags: --num-prompts ${CONCURRENCY} --concurrency ${CONCURRENCY}"
+elif [ "${MODE}" = "request_rate" ]; then
+    CMD="${CMD} --request-rate ${REQUEST_RATE}"
+    echo "Added request rate flag: --request-rate ${REQUEST_RATE}"
+fi
+
+CMD="${CMD} --image-width-mean 512"
+CMD="${CMD} --image-width-stddev 0"
+CMD="${CMD} --image-height-mean 512"
+CMD="${CMD} --image-height-stddev 0"
+CMD="${CMD} --image-format png"
+CMD="${CMD} --synthetic-input-tokens-mean ${ISL}"
+CMD="${CMD} --synthetic-input-tokens-stddev 0"
+
+if [ "${ADD_STREAMING_FLAG}" = true ] ; then
+    CMD="${CMD} --streaming"
+fi
+CMD="${CMD} --extra-inputs \"max_tokens:${OSL}\""
+CMD="${CMD} --extra-inputs \"min_tokens:${OSL}\""
+CMD="${CMD} --extra-inputs \"ignore_eos:true\""
+CMD="${CMD} -- -v"
+CMD="${CMD} --max-threads 1"
+
+# --- Execute the command ---
+echo "Executing command:"
+echo "${CMD}"
+eval "${CMD}"
+
+# Example usage:
+# ./test_client_disag_mm.sh --concurrency 2 --port 8003
+# ./test_client_disag_mm.sh --request-rate 15 --port 8001
diff --git a/tensorrt_llm/__init__.py b/tensorrt_llm/__init__.py
@@ -44,6 +44,7 @@ def _add_trt_llm_dll_directory():
 from .auto_parallel import AutoParallelConfig, auto_parallel
 from .builder import BuildConfig, Builder, BuilderConfig, build
 from .disaggregated_params import DisaggregatedParams
+from .multimodal_params import MultimodalParams
 from .functional import Tensor, constant
 from .llmapi import LLM, LlmArgs
 from .logger import logger
@@ -101,6 +102,7 @@ def _add_trt_llm_dll_directory():
     'SamplingParams',
     'DisaggregatedParams',
     'KvCacheConfig',
+    'MultimodalParams',
     '__version__',
 ]
 

@@ -1,6 +1,6 @@
 from tensorrt_llm.functional import AllReduceFusionOp
 
-from .communicator import Distributed, MPIDist, PPComm, TorchDist
+from .communicator import Distributed, MPIDist, PPComm, TorchDist, MMEmbeddingComm
 from .ops import (AllReduce, AllReduceParams, AllReduceStrategy, MoEAllReduce,
                   allgather, reducescatter, userbuffers_allreduce_finalize)
 
@@ -17,4 +17,5 @@
     "PPComm",
     "MPIDist",
     "Distributed",
+    "MMEmbeddingComm",
 ]