Coderabbit comments

ai-dynamo · krishung5 · Dec 9, 2025 · Nov 20, 2025 · Nov 21, 2025 · Dec 4, 2025
commit 3aa5f29b5e628e98f18530549d64443e05687240
diff --git a/docs/backends/sglang/multimodal_sglang_guide.md b/docs/backends/sglang/multimodal_sglang_guide.md
@@ -32,7 +32,7 @@ This document provides a comprehensive guide for multimodal inference using SGLa
 
 SGLang multimodal supports two deployment patterns:
 
-```
+```text
 AGGREGATED (E->PD):
   Client → Frontend (Rust) → Processor → Encoder [NIXL] → PD Worker → Response
   • 3 components • Vision encoder in Python • NIXL embeddings transfer
@@ -48,7 +48,7 @@ In aggregated mode, encoding happens in a separate worker, but prefill and decod
 
 ### Architecture
 
-```
+```text
 HTTP Frontend (Rust)
     ↓
 Processor (Python - ModelInput.Text - REGISTERED)
@@ -81,7 +81,7 @@ In disaggregated mode, encoding, prefill, and decode are handled by separate wor
 
 ### Architecture
 
-```
+```text
 HTTP Frontend (Rust)
     ↓
 Processor (Python - ModelInput.Text - REGISTERED)
@@ -109,7 +109,7 @@ Response → Processor → Frontend
 SGLang disaggregation uses a bootstrap mechanism for P->D coordination:
 
 **Request Flow (Important):**
-```
+```text
 Client → Frontend → Processor → Encode → DECODE Worker → Prefill Worker
                                                ↑
                                     Entry point for disaggregation!
@@ -174,13 +174,13 @@ prefill_client = (
 All component-to-component communication happens via NATS:
 
 **Aggregated Mode (E→PD):**
-```
+```text
 Processor → Encode Worker → PD Worker
   (NATS)        (NATS + NIXL embeddings)
 ```
 
 **Disaggregated Mode (E→P→D):**
-```
+```text
 Processor → Encode Worker → DECODE Worker → Prefill Worker
   (NATS)        (NATS)            (NATS)
                              ↓
@@ -195,7 +195,7 @@ Processor → Encode Worker → DECODE Worker → Prefill Worker
 
 **Detailed Message Flow:**
 
-```
+```text
 Processor → Encode Worker:
   - NATS round_robin with SglangMultimodalRequest
   - Contains: tokenized input_ids, image URL, sampling params
@@ -219,7 +219,7 @@ Prefill ↔ Decode (via bootstrap):
 
 NIXL is used only for embedding transfer:
 
-```
+```python
 Encode Worker:
   descriptor = connect.Descriptor(precomputed_embeddings)
   with connector.create_readable(descriptor) as readable:
@@ -372,7 +372,7 @@ if not request.multimodal_input.image_url:
 - Request path: `Encode → Decode → Prefill` (Decode calls Prefill)
 
 **Architectural Pattern:**
-```
+```text
 Encode Worker → pd_worker_client → DECODE Worker
                                          ↓
                                     prefill_client → PREFILL Worker

diff --git a/docs/backends/trtllm/multimodal_trtllm_guide.md b/docs/backends/trtllm/multimodal_trtllm_guide.md
@@ -30,7 +30,7 @@ This document provides a comprehensive guide for multimodal inference using Tens
 
 TRT-LLM multimodal supports three deployment patterns:
 
-```
+```text
 SIMPLE AGGREGATED (agg.sh):
   Client → Frontend (Rust) → Worker [image load, encode, P+D] → Response
   • 2 components • worker flag `--modality multimodal` • Easiest setup
@@ -59,7 +59,7 @@ In aggregated mode, all processing (image loading, encoding, prefill, decode) ha
 
 ### Architecture
 
-```
+```text
 HTTP Frontend (Rust)
     ↓
 TRT-LLM Worker (Python - ModelInput.Tokens)
@@ -83,7 +83,7 @@ In disaggregated mode, prefill and decode are handled by separate workers. The p
 
 ### Architecture
 
-```
+```text
 HTTP Frontend (Rust)
     ↓
 Prefill Worker (Python - ModelInput.Tokens)
@@ -112,7 +112,7 @@ In EPD mode, encoding, prefill, and decode are handled by separate workers. The
 
 ### Architecture
 
-```
+```text
 HTTP Frontend (Rust)
     ↓
 Encode Worker (Python - NOT registered, uses MultimodalEncoder)
@@ -172,7 +172,7 @@ TRT-LLM components communicate using NATS messaging:
 | Transfer Stage | NATS Message | NIXL Transfer |
 |----------------|--------------|---------------|
 | **Frontend → Prefill** | Request with image URL or embedding path | No |
-| **Encode → Prefill (Precomputed Embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) |
+| **Encode → Prefill (pre-computed embeddings)** | NIXL metadata (pre-computed embeddings) | Yes (Embeddings tensor) |
 | **Encode → Prefill (Image URL) (WIP)** | Disaggregated params with multimodal handles | No (Handles via params) |
 | **Prefill → Decode** | Disaggregated params | Yes/No (KV cache - UCX or NIXL) |
 
@@ -183,7 +183,7 @@ TRT-LLM components communicate using NATS messaging:
 |----------|--------|------------|---------------|
 | Simple Aggregated | [`examples/backends/trtllm/launch/agg.sh`](../../../examples/backends/trtllm/launch/agg.sh) | ❌ No | All in one worker |
 | P->D Disaggregated | [`examples/backends/trtllm/launch/disagg_multimodal.sh`](../../../examples/backends/trtllm/launch/disagg_multimodal.sh) | ⚙️ Optional | Prefill → Decode (KV cache via UCX or NIXL) |
-| E->P->D Disaggregated (Precomputed Embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) |
+| E->P->D Disaggregated (pre-computed embeddings) | [`examples/backends/trtllm/launch/epd_disagg.sh`](../../../examples/backends/trtllm/launch/epd_disagg.sh) | ✅ Yes | Encoder → Prefill (pre-computed embeddings via NIXL) |
 | E->P->D Disaggregated (WIP) | `examples/backends/trtllm/launch/url_epd_disagg.sh` | ❌ No | Encoder → Prefill (multimodal handles via disaggregated_params)<br>Prefill → Decode (KV cache via UCX/NIXL) |
 
 **Note:** NIXL for KV cache transfer is currently beta and only supported on AMD64 (x86_64) architecture.
@@ -221,7 +221,8 @@ TRT-LLM supports providing pre-computed embeddings, bypassing image-to-embedding
 
 TRT-LLM supports two formats for embedding files:
 
-**1. Simple Tensor Format**
+#### 1. Simple Tensor Format
+
 - Direct tensor saved as `.pt` file
 - Example: `llava_next_mm_embed_seashore.pt`
 - Contains only the embedding tensor
@@ -232,7 +233,8 @@ embedding_tensor = torch.rand(1, 576, 4096)  # [batch, seq_len, hidden_dim]
 torch.save(embedding_tensor, "embedding.pt")
 ```
 
-**2. Dictionary Format with Auxiliary Data**
+#### 2. Dictionary Format with Auxiliary Data
+
 - Dictionary containing multiple keys
 - Used by models like Llama-4 that require additional metadata
 - Must contain `mm_embeddings` key with the main tensor

diff --git a/docs/backends/vllm/multimodal_vllm_guide.md b/docs/backends/vllm/multimodal_vllm_guide.md
@@ -32,7 +32,7 @@ This document provides a comprehensive guide for multimodal inference using vLLM
 
 vLLM multimodal supports three deployment patterns:
 
-```
+```text
 SIMPLE AGGREGATED ([examples/backends/vllm/launch/agg_multimodal.sh](../../../examples/backends/vllm/launch/agg_multimodal.sh)):
   Client → Frontend (Rust processor) → Worker [image load, encode, P+D] → Response
   • 2 components • --connector none • Easiest setup
@@ -61,7 +61,7 @@ In simple aggregated mode, encoding, prefill, and decode happen within the same
 
 ### Architecture
 
-```
+```text
 HTTP Frontend with Rust processor
     ↓
 Worker (Python - ModelInput.Tokens)
@@ -75,7 +75,7 @@ In EPD aggregated mode, encoding happens in a separate worker and prefill and de
 
 ### Architecture
 
-```
+```text
 HTTP Frontend (Rust)
     ↓
 Processor (Python - ModelInput.Text)
@@ -101,7 +101,7 @@ In EPD disaggregated mode, encoding, prefill, and decode are handled by separate
 
 ### Architecture
 
-```
+```text
 HTTP Frontend (Rust)
     ↓
 Processor (Python - ModelInput.Text)
@@ -130,7 +130,7 @@ Llama 4 models don't support pre-computed embeddings, so they use a combined Enc
 
 ### Architecture
 
-```
+```text
 HTTP Frontend (Rust)
     ↓
 Processor (Python - ModelInput.Text)