Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Coderabbit comments
  • Loading branch information
krishung5 committed Dec 5, 2025
commit 3aa5f29b5e628e98f18530549d64443e05687240
18 changes: 9 additions & 9 deletions docs/backends/sglang/multimodal_sglang_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ This document provides a comprehensive guide for multimodal inference using SGLa

SGLang multimodal supports two deployment patterns:

```
```text
AGGREGATED (E->PD):
Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response
• 3 components • Vision encoder in Python • NIXL embeddings transfer
Expand All @@ -48,7 +48,7 @@ In aggregated mode, encoding happens in a separate worker, but prefill and decod

### Architecture

```
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text - REGISTERED)
Expand Down Expand Up @@ -81,7 +81,7 @@ In disaggregated mode, encoding, prefill, and decode are handled by separate wor

### Architecture

```
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text - REGISTERED)
Expand Down Expand Up @@ -109,7 +109,7 @@ Response → Processor → Frontend
SGLang disaggregation uses a bootstrap mechanism for P->D coordination:

**Request Flow (Important):**
```
```text
Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
Entry point for disaggregation!
Expand Down Expand Up @@ -174,13 +174,13 @@ prefill_client = (
All component-to-component communication happens via NATS:

**Aggregated Mode (E→PD):**
```
```text
Processor → Encode Worker → PD Worker
(NATS) (NATS + NIXL embeddings)
```

**Disaggregated Mode (E→P→D):**
```
```text
Processor → Encode Worker → DECODE Worker → Prefill Worker
(NATS) (NATS) (NATS)
Expand All @@ -195,7 +195,7 @@ Processor → Encode Worker → DECODE Worker → Prefill Worker

**Detailed Message Flow:**

```
```text
Processor → Encode Worker:
- NATS round_robin with SglangMultimodalRequest
- Contains: tokenized input_ids, image URL, sampling params
Expand All @@ -219,7 +219,7 @@ Prefill ↔ Decode (via bootstrap):

NIXL is used only for embedding transfer:

```
```python
Encode Worker:
descriptor = connect.Descriptor(precomputed_embeddings)
with connector.create_readable(descriptor) as readable:
Expand Down Expand Up @@ -372,7 +372,7 @@ if not request.multimodal_input.image_url:
- Request path: `Encode → Decode → Prefill` (Decode calls Prefill)

**Architectural Pattern:**
```
```text
Encode Worker → pd_worker_client → DECODE Worker
prefill_client → PREFILL Worker
Expand Down
18 changes: 10 additions & 8 deletions docs/backends/trtllm/multimodal_trtllm_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ This document provides a comprehensive guide for multimodal inference using Tens

TRT-LLM multimodal supports three deployment patterns:

```
```text
SIMPLE AGGREGATED (agg.sh):
Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
• 2 components • worker flag `--modality multimodal` • Easiest setup
Expand Down Expand Up @@ -59,7 +59,7 @@ In aggregated mode, all processing (image loading, encoding, prefill, decode) ha

### Architecture

```
```text
HTTP Frontend (Rust)
TRT-LLM Worker (Python - ModelInput.Tokens)
Expand All @@ -83,7 +83,7 @@ In disaggregated mode, prefill and decode are handled by separate workers. The p

### Architecture

```
```text
HTTP Frontend (Rust)
Prefill Worker (Python - ModelInput.Tokens)
Expand Down Expand Up @@ -112,7 +112,7 @@ In EPD mode, encoding, prefill, and decode are handled by separate workers. The

### Architecture

```
```text
HTTP Frontend (Rust)
Encode Worker (Python - NOT registered, uses MultimodalEncoder)
Expand Down Expand Up @@ -172,7 +172,7 @@ TRT-LLM components communicate using NATS messaging:
| Transfer Stage | NATS Message | NIXL Transfer |
|----------------|--------------|---------------|
| **Frontend → Prefill** | Request with image URL or embedding path | No |
| **Encode → Prefill (Precomputed Embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) |
| **Encode → Prefill (pre-computed embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) |
| **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) |
| **Prefill → Decode** | Disaggregated params | Yes/No (KV cache - UCX or NIXL) |

Expand All @@ -183,7 +183,7 @@ TRT-LLM components communicate using NATS messaging:
|----------|--------|------------|---------------|
| Simple Aggregated | [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) | ❌ No | All in one worker |
| P->D Disaggregated | [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) |
| E->P->D Disaggregated (Precomputed Embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) |
| E->P->D Disaggregated (pre-computed embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) |
| E->P->D Disaggregated (WIP) | `examples/backends/trtllm/launch/url_epd_disagg.sh` | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)<br>Prefill → Decode (KV cache via UCX/NIXL) |

**Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
Expand Down Expand Up @@ -221,7 +221,8 @@ TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding

TRT-LLM supports two formats for embedding files:

**1. Simple Tensor Format**
#### 1. Simple Tensor Format

- Direct tensor saved as `.pt` file
- Example: `llava_next_mm_embed_seashore.pt`
- Contains only the embedding tensor
Expand All @@ -232,7 +233,8 @@ embedding_tensor = torch.rand(1, 576, 4096) # [batch, seq_len, hidden_dim]
torch.save(embedding_tensor, "embedding.pt")
```

**2. Dictionary Format with Auxiliary Data**
#### 2. Dictionary Format with Auxiliary Data

- Dictionary containing multiple keys
- Used by models like Llama-4 that require additional metadata
- Must contain `mm_embeddings` key with the main tensor
Expand Down
10 changes: 5 additions & 5 deletions docs/backends/vllm/multimodal_vllm_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ This document provides a comprehensive guide for multimodal inference using vLLM

vLLM multimodal supports three deployment patterns:

```
```text
SIMPLE AGGREGATED ([examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)):
Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response
• 2 components • --connector none • Easiest setup
Expand Down Expand Up @@ -61,7 +61,7 @@ In simple aggregated mode, encoding, prefill, and decode happen within the same

### Architecture

```
```text
HTTP Frontend with Rust processor
Worker (Python - ModelInput.Tokens)
Expand All @@ -75,7 +75,7 @@ In EPD aggregated mode, encoding happens in a separate worker and prefill and de

### Architecture

```
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text)
Expand All @@ -101,7 +101,7 @@ In EPD disaggregated mode, encoding, prefill, and decode are handled by separate

### Architecture

```
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text)
Expand Down Expand Up @@ -130,7 +130,7 @@ Llama 4 models don't support pre-computed embeddings, so they use a combined Enc

### Architecture

```
```text
HTTP Frontend (Rust)
Processor (Python - ModelInput.Text)
Expand Down
Loading