Address comments

ai-dynamo · krishung5 · Dec 9, 2025 · Nov 20, 2025 · Nov 21, 2025 · Dec 4, 2025
commit 3ec390091bbadf64bbe93dc2ee513436b00dfca5
diff --git a/docs/backends/sglang/multimodal_sglang_guide.md b/docs/backends/sglang/multimodal_sglang_guide.md
@@ -17,15 +17,15 @@ limitations under the License.
 
 # SGLang Multimodal Guide
 
-This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo.
+This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. For more details on the multimodal examples, see [Multimodal Examples Documentation](./multimodal_epd.md).
 
 ## Multimodal Support Matrix
 
 | Modality | Input Format | Aggregated | Disaggregated | Notes |
 |----------|--------------|------------|---------------|-------|
 | **Image** | HTTP/HTTPS URL | ✅ Yes | ✅ Yes | Vision encoder generates embeddings |
 | **Image** | Data URL (Base64) | ❌ No | ❌ No | Not supported |
-| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Protocol accepts, but encode worker doesn't process |
+| **Video** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
 | **Audio** | HTTP/HTTPS URL | ❌ No | ❌ No | Not implemented |
 
 ## Architecture Comparison
@@ -290,189 +290,14 @@ Supported templates: `qwen2-vl`, `llama-3`, `vicuna`, etc.
 
 **Key Difference:** SGLang P→D uses bootstrap mechanism, not NIXL for KV cache like vLLM.
 
-## GAPS and Known Limitations
-
-### 1. No Base64 (Data URL) Support
-
-**Current State:**
-- Only HTTP/HTTPS URLs supported for images
-- Data URLs (`data:image/jpeg;base64,...`) are **not supported**
-- vLLM and TRT-LLM support data URLs, SGLang does not
-
-**Impact:**
-- Cannot send embedded images in requests
-- Requires external image hosting for all images
-
-### 2. No Pre-computed Embeddings Support
-
-**Current State:**
-- No support for pre-computed embeddings (`.pt`, `.pth`, `.bin` files)
-- Vision encoder must run for every request
-- Cannot bypass encoding like TRT-LLM legacy flow
-
-**Impact:**
-- Higher latency for repeated images
-- Cannot optimize by pre-computing embeddings offline
-
-### 3. Only Processor Registers with Rust
-
-**Current State:**
-- Only the Processor component registers with Dynamo Rust using `ModelInput.Text`
-- All workers are internal and do not register
-- Different from vLLM/TRT-LLM where workers also register
-
-**Implications:**
-- Frontend always routes to Processor (cannot route directly to workers)
-- No token-based entry point (no `ModelInput.Tokens` registration for workers)
-- More complex multi-component setup required for all multimodal requests
-
-### 4. All Processing Happens in Python Workers
-
-**Current State:**
-- No Rust-based image decoding or preprocessing
-- No Rust tokenization (all tokenization in Python Processor)
-- Frontend only handles HTTP routing
-
-**Impact:**
-- Cannot leverage Rust performance for preprocessing
-- All multimodal logic in Python components
-- Similar limitation to TRT-LLM
-
-### 5. No Video/Audio Model Support
-
-**Current State:**
-- **Video models are NOT supported** - Encode worker only implements image loading and processing
-- **Audio models are NOT supported** - No audio encoder implementation
-- Only **image modality** is production-ready
-- Protocol accepts `video_url` and Processor can forward it, but Encode Worker **only processes `image_url`**
-
-**Why:**
-```python
-# encode_worker_handler.py only checks for image_url
-if not request.multimodal_input.image_url:
-    raise ValueError("image_url is required for the encode worker.")
-```
-
-**Impact:**
-- Cannot run video models like `LLaVA-NeXT-Video-7B-hf`
-- Cannot run audio models like `Qwen2-Audio-7B-Instruct`
-- Use **vLLM backend** for video/audio support (has full implementation)
-
-**Workaround:**
-- For video models: Use vLLM ([`examples/multimodal/launch/video_agg.sh`](../../../examples/multimodal/launch/video_agg.sh))
-- For audio models: Use vLLM ([`examples/multimodal/launch/audio_agg.sh`](../../../examples/multimodal/launch/audio_agg.sh))
-- Or implement custom video/audio encode worker for SGLang
-
-### 6. Bootstrap Coordination and Routing Complexity
-
-**Current State:**
-- Disaggregated mode requires bootstrap coordination between P and D workers
-- Uses host/port/room mechanism from SGLang
-- **Decode Worker is the entry point** (not Prefill like vLLM)
-- Request path: `Encode → Decode → Prefill` (Decode calls Prefill)
-
-**Architectural Pattern:**
-```text
-Encode Worker → pd_worker_client → DECODE Worker
-                                         ↓
-                                    prefill_client → PREFILL Worker
-```
-
-**Impact:**
-- More complex P→D coordination than vLLM
-- Requires network connectivity between P and D workers
-- Different debugging model than vLLM
-
-**Routing Implications:**
-
-**Cannot Route Directly to Prefill:**
-- Prefill Worker does NOT register with Dynamo
-- Frontend cannot route requests to Prefill directly
-- All disaggregated requests MUST go through Decode Worker first
-- Decode Worker initiates bootstrap coordination with Prefill
-
-**Load Balancing Constraints:**
-- Cannot distribute load directly to Prefill workers
-- Must load balance at Decode Worker level
-- Decode Worker becomes bottleneck for prefill requests
-- Different from vLLM where frontend can route to prefill workers directly
-
-**Multi-Instance Limitations:**
-- If you scale Prefill workers, Decode must discover them
-- Cannot use frontend routing to select specific Prefill worker
-- Decode Worker uses `prefill_client.generate()` (round-robin to any prefill)
-- Less control over prefill worker selection compared to vLLM
-
-### 7. Manual Token Expansion in Encode Worker
-
-**Current State:**
-- Encode worker **manually** expands image tokens from 1 → N based on embedding shape
-- Token expansion happens in Python code, not handled by SGLang engine
-- Hard-coded logic specific to model architecture
-
-**Code Location:**
-```python
-# encode_worker_handler.py:144-157
-# Find single image token in sequence
-image_token_id_index = request.request.token_ids.index(self.image_token_id)
-
-# Get number of patches from embedding shape
-num_image_tokens = precomputed_embeddings.shape[1]  # e.g., 576 patches
-
-# Replace 1 token with N tokens
-request.request.token_ids = (
-    request.request.token_ids[:image_token_id_index]
-    + [self.image_token_id] * num_image_tokens
-    + request.request.token_ids[image_token_id_index + 1:]
-)
-```
-
-**Why This Is Error-Prone:**
-
-1. **Model-Specific Logic Required:**
-   - Different models have different patch sizes and embedding dimensions
-   - Number of tokens depends on: image resolution, patch size, pooling strategy
-   - Must update code for each new model architecture
-   - Example: Qwen2-VL uses 576 patches, but other models may use different counts
-
-2. **Assumes Shape[1] is Patch Count:**
-   - Hard-coded assumption: `num_image_tokens = precomputed_embeddings.shape[1]`
-   - Works for: `(batch, patches, hidden_dim)` format
-   - Breaks for: Different embedding formats (e.g., pooled, multi-scale)
-   - No validation that shape[1] is actually the patch dimension
-
-3. **Single Image Token Assumption:**
-   - Assumes exactly one image token in sequence
-   - Fails for: Multiple images, video frames, complex layouts
-   - `token_ids.index()` throws error if token not found or multiple tokens
-
-4. **No Dynamic Resolution Support:**
-   - Fixed expansion based on embedding shape
-   - Cannot handle dynamic image resolutions without code changes
-   - Models with resolution-dependent patch counts need special handling
-
-5. **Tight Coupling with Chat Template:**
-   - Must know exact image token string from chat template
-   - Hard-coded token extraction logic (lines 72-87)
-   - Different templates may use different token formats
-
-**Impact:**
-- **Maintenance burden**: Must update encode worker for each new model
-- **Error-prone**: Easy to miscalculate token counts for new architectures
-- **No abstraction**: Token expansion logic embedded in handler, not engine
-- **Limited flexibility**: Cannot easily support models with variable patch counts
-- **Debugging difficulty**: Token count mismatches hard to diagnose
-
-**Comparison with vLLM:**
-- vLLM handles token expansion **internally in the engine**
-- vLLM workers just pass image data, engine figures out tokens
-- More robust and less prone to manual errors
-
-**Workaround:**
-- Carefully study each new model's architecture
-- Test token expansion with known inputs
-- Add extensive logging for token count validation
+## Known Limitations
 
+- **No Data URL support** - Only HTTP/HTTPS URLs supported; `data:image/...` base64 URLs not supported
+- **No pre-computed embeddings** - Cannot use `.pt`, `.pth`, `.bin` embedding files; vision encoder runs for every request
+- **No video support** - No video encoder implementation
+- **No audio support** - No audio encoder implementation
+- **Only Processor registers with Dynamo** - Workers are internal components, frontend routes to Processor only
+- **Disaggregated routing** - Decode Worker is the entry point (calls Prefill), cannot route directly to Prefill workers
 
 ## Supported Models
 

diff --git a/docs/backends/trtllm/multimodal_epd.md b/docs/backends/trtllm/multimodal_epd.md