Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Address comments
  • Loading branch information
krishung5 committed Dec 6, 2025
commit 3ec390091bbadf64bbe93dc2ee513436b00dfca5
193 changes: 9 additions & 184 deletions docs/backends/sglang/multimodal_sglang_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,15 @@ limitations under the License.

# SGLang Multimodal Guide

This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo.
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_epd.md).

## Multimodal Support Matrix

| Modality | Input Format | Aggregated | Disaggregated | Notes |
|----------|--------------|------------|---------------|-------|
| **Image** | HTTP/HTTPS URL | ✅ Yes | ✅ Yes | Vision encoder generates embeddings |
| **Image** | Data URL (Base64) | ❌ No | ❌ No | Not supported |
| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Protocol accepts, but encode worker doesn't process |
| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
| **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |

## Architecture Comparison
Expand Down Expand Up @@ -290,189 +290,14 @@ Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.

**Key Difference:** SGLang P→D uses bootstrap mechanism, not NIXL for KV cache like vLLM.

## GAPS and Known Limitations

### 1. No Base64 (Data URL) Support

**Current State:**
- Only HTTP/HTTPS URLs supported for images
- Data URLs (`data:image/jpeg;base64,...`) are **not supported**
- vLLM and TRT-LLM support data URLs, SGLang does not

**Impact:**
- Cannot send embedded images in requests
- Requires external image hosting for all images

### 2. No Pre-computed Embeddings Support

**Current State:**
- No support for pre-computed embeddings (`.pt`, `.pth`, `.bin` files)
- Vision encoder must run for every request
- Cannot bypass encoding like TRT-LLM legacy flow

**Impact:**
- Higher latency for repeated images
- Cannot optimize by pre-computing embeddings offline

### 3. Only Processor Registers with Rust

**Current State:**
- Only the Processor component registers with Dynamo Rust using `ModelInput.Text`
- All workers are internal and do not register
- Different from vLLM/TRT-LLM where workers also register

**Implications:**
- Frontend always routes to Processor (cannot route directly to workers)
- No token-based entry point (no `ModelInput.Tokens` registration for workers)
- More complex multi-component setup required for all multimodal requests

### 4. All Processing Happens in Python Workers

**Current State:**
- No Rust-based image decoding or preprocessing
- No Rust tokenization (all tokenization in Python Processor)
- Frontend only handles HTTP routing

**Impact:**
- Cannot leverage Rust performance for preprocessing
- All multimodal logic in Python components
- Similar limitation to TRT-LLM

### 5. No Video/Audio Model Support

**Current State:**
- **Video models are NOT supported** - Encode worker only implements image loading and processing
- **Audio models are NOT supported** - No audio encoder implementation
- Only **image modality** is production-ready
- Protocol accepts `video_url` and Processor can forward it, but Encode Worker **only processes `image_url`**

**Why:**
```python
# encode_worker_handler.py only checks for image_url
if not request.multimodal_input.image_url:
raise ValueError("image_url is required for the encode worker.")
```

**Impact:**
- Cannot run video models like `LLaVA-NeXT-Video-7B-hf`
- Cannot run audio models like `Qwen2-Audio-7B-Instruct`
- Use **vLLM backend** for video/audio support (has full implementation)

**Workaround:**
- For video models: Use vLLM ([`examples/multimodal/launch/video_agg.sh`](../../../examples/multimodal/launch/video_agg.sh))
- For audio models: Use vLLM ([`examples/multimodal/launch/audio_agg.sh`](../../../examples/multimodal/launch/audio_agg.sh))
- Or implement custom video/audio encode worker for SGLang

### 6. Bootstrap Coordination and Routing Complexity

**Current State:**
- Disaggregated mode requires bootstrap coordination between P and D workers
- Uses host/port/room mechanism from SGLang
- **Decode Worker is the entry point** (not Prefill like vLLM)
- Request path: `Encode → Decode → Prefill` (Decode calls Prefill)

**Architectural Pattern:**
```text
Encode Worker → pd_worker_client → DECODE Worker
prefill_client → PREFILL Worker
```

**Impact:**
- More complex P→D coordination than vLLM
- Requires network connectivity between P and D workers
- Different debugging model than vLLM

**Routing Implications:**

**Cannot Route Directly to Prefill:**
- Prefill Worker does NOT register with Dynamo
- Frontend cannot route requests to Prefill directly
- All disaggregated requests MUST go through Decode Worker first
- Decode Worker initiates bootstrap coordination with Prefill

**Load Balancing Constraints:**
- Cannot distribute load directly to Prefill workers
- Must load balance at Decode Worker level
- Decode Worker becomes bottleneck for prefill requests
- Different from vLLM where frontend can route to prefill workers directly

**Multi-Instance Limitations:**
- If you scale Prefill workers, Decode must discover them
- Cannot use frontend routing to select specific Prefill worker
- Decode Worker uses `prefill_client.generate()` (round-robin to any prefill)
- Less control over prefill worker selection compared to vLLM

### 7. Manual Token Expansion in Encode Worker

**Current State:**
- Encode worker **manually** expands image tokens from 1 → N based on embedding shape
- Token expansion happens in Python code, not handled by SGLang engine
- Hard-coded logic specific to model architecture

**Code Location:**
```python
# encode_worker_handler.py:144-157
# Find single image token in sequence
image_token_id_index = request.request.token_ids.index(self.image_token_id)

# Get number of patches from embedding shape
num_image_tokens = precomputed_embeddings.shape[1] # e.g., 576 patches

# Replace 1 token with N tokens
request.request.token_ids = (
request.request.token_ids[:image_token_id_index]
+ [self.image_token_id] * num_image_tokens
+ request.request.token_ids[image_token_id_index + 1:]
)
```

**Why This Is Error-Prone:**

1. **Model-Specific Logic Required:**
- Different models have different patch sizes and embedding dimensions
- Number of tokens depends on: image resolution, patch size, pooling strategy
- Must update code for each new model architecture
- Example: Qwen2-VL uses 576 patches, but other models may use different counts

2. **Assumes Shape[1] is Patch Count:**
- Hard-coded assumption: `num_image_tokens = precomputed_embeddings.shape[1]`
- Works for: `(batch, patches, hidden_dim)` format
- Breaks for: Different embedding formats (e.g., pooled, multi-scale)
- No validation that shape[1] is actually the patch dimension

3. **Single Image Token Assumption:**
- Assumes exactly one image token in sequence
- Fails for: Multiple images, video frames, complex layouts
- `token_ids.index()` throws error if token not found or multiple tokens

4. **No Dynamic Resolution Support:**
- Fixed expansion based on embedding shape
- Cannot handle dynamic image resolutions without code changes
- Models with resolution-dependent patch counts need special handling

5. **Tight Coupling with Chat Template:**
- Must know exact image token string from chat template
- Hard-coded token extraction logic (lines 72-87)
- Different templates may use different token formats

**Impact:**
- **Maintenance burden**: Must update encode worker for each new model
- **Error-prone**: Easy to miscalculate token counts for new architectures
- **No abstraction**: Token expansion logic embedded in handler, not engine
- **Limited flexibility**: Cannot easily support models with variable patch counts
- **Debugging difficulty**: Token count mismatches hard to diagnose

**Comparison with vLLM:**
- vLLM handles token expansion **internally in the engine**
- vLLM workers just pass image data, engine figures out tokens
- More robust and less prone to manual errors

**Workaround:**
- Carefully study each new model's architecture
- Test token expansion with known inputs
- Add extensive logging for token count validation
## Known Limitations

- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request
- **No video support** - No video encoder implementation
- **No audio support** - No audio encoder implementation
- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only
- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers

## Supported Models

Expand Down
139 changes: 0 additions & 139 deletions docs/backends/trtllm/multimodal_epd.md

This file was deleted.

Loading