Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions examples/multimodal_disaggregated/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Multimodal Disaggregated Serving (Experimental)

This example demonstrates how to set up disaggregated multimodal serving with TensorRT-LLM, where the vision encoder and language model decoder run as separate services for improved scalability and resource utilization.

## ⚠️ Disclaimer

**This is a Proof-of-Concept (POC) and early demonstration with several limitations:**
1. **Model Support**: Limited to LLaVA-Next models only
2. **Modality Support**: Image modality only (no video support yet)
3. **Server Configuration**: Only supports 1 encoder server and 1 LLM server (though the LLM server can have multiple workers via tensor parallelism)

## Overview

Disaggregated multimodal serving separates the multimodal pipeline into distinct components:

- **Encoder Server**: Handles vision processing (images), including pre-processing and encoding, using the multimodal encoder
- **LLM Decoder Server**: Processes text generation using the language model
- **Disaggregated Server**: Orchestrates requests between encoder and decoder services

This architecture enables better resource utilization and scalability by allowing independent scaling of vision and language processing components.

## Setup Instructions

### Step 1: Prepare Configuration Files

Create the required configuration files in your working directory:

#### LLM API Configuration (`extra-llm-api-config.yml`)
```bash
# Note: Current multimodal implementation does not support KV cache reuse,
# so we disable it for all cases
cat > ./extra-llm-api-config.yml << EOF
kv_cache_config:
enable_block_reuse: false
EOF
```

#### Disaggregated Server Configuration (`disagg_config.yaml`)
```bash
cat > ./disagg_config.yaml << EOF
hostname: localhost
port: 8000
backend: pytorch
multimodal_servers:
num_instances: 1
urls:
- "localhost:8001"
generation_servers:
num_instances: 1
urls:
- "localhost:8002"
EOF
```

### Step 2: Start the Encoder Server

Launch the multimodal encoder server on GPU 0:

```bash
mkdir -p Logs/
CUDA_VISIBLE_DEVICES=0 trtllm-serve encoder llava-hf/llava-v1.6-mistral-7b-hf \
--host localhost \
--port 8001 \
--backend pytorch \
&> Logs/log_encoder_0 &
```

### Step 3: Start the LLM Decoder Server

Launch the language model decoder server on GPU 1:

```bash
CUDA_VISIBLE_DEVICES=1 trtllm-serve llava-hf/llava-v1.6-mistral-7b-hf \
--host localhost \
--port 8002 \
--backend pytorch \
--extra_llm_api_options ./extra-llm-api-config.yml \
&> Logs/log_pd_tp1 &
```

### Step 4: Start the Disaggregated Orchestrator

Launch the disaggregated server that coordinates between encoder and decoder:

```bash
trtllm-serve disaggregated_mm -c disagg_config.yaml &> Logs/log_disagg_server &
```

## Alternative Setup

Instead of running Steps 2-4 manually, you can start all services at once using the provided script:

```bash
./start_disagg_mm.sh
```

This script will start the encoder server, LLM decoder server, and disaggregated orchestrator automatically with the same configuration as the manual steps above.

## Multi-GPU Decoder Configuration

For larger models and higher throughput, you can run the decoder with tensor parallelism (TP>1) across multiple GPUs:

```bash
CUDA_VISIBLE_DEVICES=1,2 trtllm-serve llava-hf/llava-v1.6-mistral-7b-hf \
--host localhost \
--port 8002 \
--backend pytorch \
--tp_size 2 \
--extra_llm_api_options ./extra-llm-api-config.yml \
&> Logs/log_pd_tp2 &
```

## Testing the Setup

### Basic Functionality Test

Test the setup with a multimodal chat completion request:

```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llava-hf/llava-v1.6-mistral-7b-hf",
"messages":[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the natural environment in the image."
},
{
"type":"image_url",
"image_url": {
"url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
}
}
]
}],
"max_tokens": 64,
"temperature": 0
}'
```

### Performance Testing

Use the provided performance testing script for load testing (assuming you've already set up the multimodal disaggregated server):

#### Prerequisites
```bash
pip install genai_perf
```

#### Concurrency Testing
```bash
./test_client_disag_mm.sh --concurrency 1 --port 8000
```

#### Request Rate Testing
```bash
./test_client_disag_mm.sh --request-rate 10 --port 8000
```


## Roadmap & Future Improvements

- **Model Support**: Add support for more multimodal models beyond LLaVA-Next
- **Communication**: NIXL integration for transferring multimodal embeddings between servers
- **Scalability**: Enable support for multiple LLM servers and multimodal servers with a routing manager
- **Parallelism**: Enable data parallelism (DP) in multimodal server
- **Configuration**: Test/verify/enable major parallel configurations in LLM decoder server
- **Optimization**: Performance optimization and tuning
14 changes: 14 additions & 0 deletions examples/multimodal_disaggregated/start_disagg_mm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
mkdir -p Logs/
CUDA_VISIBLE_DEVICES=0 trtllm-serve encoder llava-hf/llava-v1.6-mistral-7b-hf \
--host localhost \
--port 8001 \
--backend pytorch \
&> Logs/log_encoder_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve llava-hf/llava-v1.6-mistral-7b-hf \
--host localhost \
--port 8002 \
--backend pytorch \
--extra_llm_api_options ./extra-llm-api-config.yml \
&> Logs/log_pd_tp1 &
trtllm-serve disaggregated_mm -c disagg_config.yaml &> Logs/log_disagg_server &
174 changes: 174 additions & 0 deletions examples/multimodal_disaggregated/test_client_disag_mm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
#!/bin/bash

# This script runs genai-perf to profile a multimodal model.
# Supports two modes: concurrency or request_rate

# --- Command Line Arguments Parsing ---
usage() {
echo "Usage: $0 [--concurrency <value> | --request-rate <value>] --port <port>"
echo ""
echo "Options:"
echo " --concurrency <value> Run in concurrency mode with specified concurrency level"
echo " --request-rate <value> Run in request rate mode with specified rate (requests/sec)"
echo " --port <port> Server port number (e.g., 8001, 8003)"
echo ""
echo "Examples:"
echo " $0 --concurrency 2 --port 8003"
echo " $0 --request-rate 15 --port 8001"
echo " $0 --concurrency 1 --port 9000"
exit 1
}

# Initialize variables
MODE=""
CONCURRENCY=""
REQUEST_RATE=""
PORT=""

# Parse command line arguments
while [[ $# -gt 0 ]]; do
case $1 in
--concurrency)
MODE="concurrency"
CONCURRENCY="$2"
shift 2
;;
--request-rate)
MODE="request_rate"
REQUEST_RATE="$2"
shift 2
;;
--port)
PORT="$2"
shift 2
;;
-h|--help)
usage
;;
*)
echo "Unknown option: $1"
usage
;;
esac
done

# Validate arguments
if [ -z "$MODE" ]; then
echo "Error: Must specify either --concurrency or --request-rate"
usage
fi

if [ -z "$PORT" ]; then
echo "Error: Must specify --port"
usage
fi

# Validate PORT
if ! [[ "${PORT}" =~ ^[0-9]+$ ]] || [ "${PORT}" -lt 1 ] || [ "${PORT}" -gt 65535 ]; then
echo "Error: PORT must be a valid port number (1-65535)"
echo "You provided: '${PORT}'"
exit 1
fi

# Validate and set mode-specific values
if [ "${MODE}" = "concurrency" ]; then
if ! [[ "${CONCURRENCY}" =~ ^[0-9]+$ ]] || [ "${CONCURRENCY}" -lt 1 ]; then
echo "Error: CONCURRENCY must be a positive integer"
echo "You provided: '${CONCURRENCY}'"
exit 1
fi
if [ "${CONCURRENCY}" -gt 1 ]; then
REQUEST_COUNT=$((CONCURRENCY*5))
else
REQUEST_COUNT=$((CONCURRENCY*50))
fi
echo "Running in CONCURRENCY mode: CONCURRENCY=${CONCURRENCY}, REQUEST_COUNT=${REQUEST_COUNT}, PORT=${PORT}"
elif [ "${MODE}" = "request_rate" ]; then
if ! [[ "${REQUEST_RATE}" =~ ^[0-9]+$ ]] || [ "${REQUEST_RATE}" -lt 1 ]; then
echo "Error: REQUEST_RATE must be a positive integer"
echo "You provided: '${REQUEST_RATE}'"
exit 1
fi
REQUEST_COUNT=$((REQUEST_RATE*10))
echo "Running in REQUEST_RATE mode: REQUEST_RATE=${REQUEST_RATE}, REQUEST_COUNT=${REQUEST_COUNT}, PORT=${PORT}"
fi

ISL=64
OSL=64

# --- Configuration for genai-perf ---
MODEL_NAME="llava-hf/llava-v1.6-mistral-7b-hf"
TOKENIZER_NAME="llava-hf/llava-v1.6-mistral-7b-hf"
SERVICE_KIND="openai"
ENDPOINT_TYPE="multimodal"
INPUT_FILE="./mm_data_oai.json"
SERVER_URL="localhost:${PORT}"

# Set append name based on port
if [ "${PORT}" = "8000" ]; then
APPEND_NAME="disagg"
elif [ "${PORT}" = "8002" ]; then
APPEND_NAME="agg"
else
APPEND_NAME="port${PORT}"
fi

if [ "${MODE}" = "concurrency" ]; then
PROFILE_EXPORT_FILE="ISL_${ISL}_OSL_${OSL}_CONCURRENCY_${CONCURRENCY}_${APPEND_NAME}.json"
else
PROFILE_EXPORT_FILE="ISL_${ISL}_OSL_${OSL}_RATE_${REQUEST_RATE}_${APPEND_NAME}.json"
fi

RANDOM_SEED=123
# Set to true if your endpoint supports streaming and you want to test it
ADD_STREAMING_FLAG=true # or true

# --- Build the genai-perf command ---
CMD="genai-perf profile"
CMD="${CMD} -m \"${MODEL_NAME}\""
CMD="${CMD} --tokenizer \"${TOKENIZER_NAME}\""
#CMD="${CMD} --service-kind \"${SERVICE_KIND}\""
CMD="${CMD} --endpoint-type \"${ENDPOINT_TYPE}\""
#CMD="${CMD} --input-file \"${INPUT_FILE}\""
CMD="${CMD} --output-tokens-mean ${OSL}"
#CMD="${CMD} --output-tokens-stddev ${OUTPUT_TOKENS_STDDEV}"
CMD="${CMD} --request-count ${REQUEST_COUNT}"
CMD="${CMD} --profile-export-file \"${PROFILE_EXPORT_FILE}\""
CMD="${CMD} --url \"${SERVER_URL}\""
CMD="${CMD} --random-seed ${RANDOM_SEED}"

# --- Mode-specific flags ---
if [ "${MODE}" = "concurrency" ]; then
CMD="${CMD} --num-prompts ${CONCURRENCY}"
CMD="${CMD} --concurrency ${CONCURRENCY}"
echo "Added concurrency flags: --num-prompts ${CONCURRENCY} --concurrency ${CONCURRENCY}"
elif [ "${MODE}" = "request_rate" ]; then
CMD="${CMD} --request-rate ${REQUEST_RATE}"
echo "Added request rate flag: --request-rate ${REQUEST_RATE}"
fi

CMD="${CMD} --image-width-mean 512"
CMD="${CMD} --image-width-stddev 0"
CMD="${CMD} --image-height-mean 512"
CMD="${CMD} --image-height-stddev 0"
CMD="${CMD} --image-format png"
CMD="${CMD} --synthetic-input-tokens-mean ${ISL}"
CMD="${CMD} --synthetic-input-tokens-stddev 0"

if [ "${ADD_STREAMING_FLAG}" = true ] ; then
CMD="${CMD} --streaming"
fi
CMD="${CMD} --extra-inputs \"max_tokens:${OSL}\""
CMD="${CMD} --extra-inputs \"min_tokens:${OSL}\""
CMD="${CMD} --extra-inputs \"ignore_eos:true\""
CMD="${CMD} -- -v"
CMD="${CMD} --max-threads 1"

# --- Execute the command ---
echo "Executing command:"
echo "${CMD}"
eval "${CMD}"

# Example usage:
# ./test_client_disag_mm.sh --concurrency 2 --port 8003
# ./test_client_disag_mm.sh --request-rate 15 --port 8001
2 changes: 2 additions & 0 deletions tensorrt_llm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ def _add_trt_llm_dll_directory():
from .auto_parallel import AutoParallelConfig, auto_parallel
from .builder import BuildConfig, Builder, BuilderConfig, build
from .disaggregated_params import DisaggregatedParams
from .multimodal_params import MultimodalParams
from .functional import Tensor, constant
from .llmapi import LLM, LlmArgs
from .logger import logger
Expand Down Expand Up @@ -101,6 +102,7 @@ def _add_trt_llm_dll_directory():
'SamplingParams',
'DisaggregatedParams',
'KvCacheConfig',
'MultimodalParams',
'__version__',
]

Expand Down
3 changes: 2 additions & 1 deletion tensorrt_llm/_torch/distributed/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from tensorrt_llm.functional import AllReduceFusionOp

from .communicator import Distributed, MPIDist, PPComm, TorchDist
from .communicator import Distributed, MPIDist, PPComm, TorchDist, MMEmbeddingComm
from .ops import (AllReduce, AllReduceParams, AllReduceStrategy, MoEAllReduce,
allgather, reducescatter, userbuffers_allreduce_finalize)

Expand All @@ -17,4 +17,5 @@
"PPComm",
"MPIDist",
"Distributed",
"MMEmbeddingComm",
]
Loading