feat: Standalone encoder in dynamo trtllm #4668

indrajit96 · 2025-12-01T18:30:03Z

Overview:

Adds standalone encoder support to the dynamo TRT-LLM backend, enabling the full E-P-D (Encode-Prefill-Decode) disaggregated multimodal serving architecture.

Details:

TensorRTLLMEngine changes: Initializes MultimodalEncoder when in encode mode instead of the standard LLM
process_encode_request: Extended to handle both pre-computed embeddings (.pt files via NIXL) and raw images (via MultimodalEncoder)
EPD flow in PrefillHandler: Calls encode worker for image URLs and passes ep_disaggregated_params through the pipeline
Metadata forwarding: EPD-specific fields (_epd_processed_prompt, _epd_prompt_token_ids) are packed into disaggregated_params for the decode worker
New launch script: epd_multimodal.sh for running the full E-P-D setup with LLaVA

Where should the reviewer start?

components/src/dynamo/trtllm/encode_helper.py - Core encoding logic and process_encode_request
components/src/dynamo/trtllm/request_handlers/handlers.py - EncodeHandler and PrefillHandler.remote_encode_full_epd
components/src/dynamo/trtllm/engine.py - MultimodalEncoder initialization
components/src/dynamo/trtllm/request_handlers/handler_base.py - EPD metadata handling in _prepare_decode_input and _encode_and_pack_disaggregated_params

Summary by CodeRabbit

New Features
- Introduced disaggregated inference mode support, enabling flexible splitting of model computation into encode, prefill, and decode stages for optimized performance
- Enhanced multimodal processing capabilities with improved image embedding and handling
Chores
- Added example configuration files and launcher scripts for multimodal model deployment

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Indrajit Bhosale <[email protected]>

2ez4bz · 2025-12-01T21:06:55Z

components/src/dynamo/trtllm/utils/disagg_utils.py

            draft_tokens=disaggregated_params.draft_tokens,
+            # E-P Disaggregated Params (for full EPD flow)
+            # Use getattr with None default for backward compatibility with text-only requests
+            multimodal_embedding_handles=getattr(


Maybe stupid question: these already default to None in the DisaggregatedParams definition from TRTLLM: https://github.com/NVIDIA/TensorRT-LLM/blob/v1.2.0rc4/tensorrt_llm/disaggregated_params.py#L37

Why not use:

dataclasses.replace(disaggregated_params, opaque_state=opaque_state)

since opaque_state seems to be the only actual difference between the input disaggergate_params and what we're returning here?

Similar question for the encode method below.

2ez4bz · 2025-12-01T21:11:57Z

components/src/dynamo/trtllm/encode_helper.py

-                yield {"error": "Dictionary embeddings missing 'mm_embeddings' key"}
+            # Handle both tensor and dictionary formats
+            if isinstance(loaded_data, dict):
+                # Dictionary format (e.g., maverick_mm_embed_seashore_v3.pt)


Nit: was this in reference to some local debugging? Should it be removed?

2ez4bz · 2025-12-01T21:18:35Z

components/src/dynamo/trtllm/encode_helper.py

+                yield {"ep_disaggregated_params": None}
+                return
+            if (
+                hasattr(ep_disaggregated_params, "multimodal_embedding_handles")


Maybe stupid question: aren't we guaranteed that this attribute exists? Couldn't we just check for if ep_disaggregated_params.multimodal_embedding_handles is not None directly?

Or is the idea that we want to support multiple TRTLLM versions somehow?

2ez4bz · 2025-12-01T21:31:07Z

components/src/dynamo/trtllm/encode_helper.py

+                # Tokenize the processed prompt for prefill worker
+                if processed_prompt and tokenizer is not None:
+                    prompt_token_ids = tokenizer.encode(
+                        processed_prompt, add_special_tokens=False


Could you leave a comment why add_special_tokens is set to False?

2ez4bz · 2025-12-01T21:43:09Z

components/src/dynamo/trtllm/engine.py

    async def initialize(self):
        if not self._llm:
-            self._llm = self._llm_cls(**self.engine_args)
+            if self.disaggregation_mode == DisaggregationMode.ENCODE:


Out of curiosity, how is the engine initialized for a prefill worker? Is that encapsulated via engine_args itself somehow? (It might be worth leaving a comment whatever the case is 🙏 )

2ez4bz · 2025-12-01T21:44:26Z

components/src/dynamo/trtllm/engine.py

-            self._llm = self._llm_cls(**self.engine_args)
+            if self.disaggregation_mode == DisaggregationMode.ENCODE:
+                # Initialize the multimodal encoder for full EPD
+                max_batch_size = self.engine_args.pop("max_batch_size", 1)


Out of curiosity, why was this necessary? Maybe leave a comment? 🙏

2ez4bz · 2025-12-01T21:53:03Z

components/src/dynamo/trtllm/engine.py

+                )
+                self._llm = MultimodalEncoder(
+                    model=model,
+                    max_batch_size=max_batch_size,


Out of curiosity, why not forward the rest of the self.engine_args? MultimodalEncoder is also a LLM class: https://github.com/NVIDIA/TensorRT-LLM/blob/v1.2.0rc4/tensorrt_llm/llmapi/mm_encoder.py#L16

2ez4bz · 2025-12-01T21:53:49Z

components/src/dynamo/trtllm/engine.py


    @property
-    def llm(self):
+    def llm(self) -> Union[LLM, MultimodalEncoder]:


Nit: could just be the BaseLLM class https://github.com/NVIDIA/TensorRT-LLM/blob/v1.2.0rc4/tensorrt_llm/llmapi/llm.py#L112

2ez4bz · 2025-12-01T22:03:02Z

components/src/dynamo/trtllm/request_handlers/handler_base.py

+
+        # Setup disaggregated_params for PREFILL mode
+        if self.disaggregation_mode == DisaggregationMode.PREFILL:
+            if ep_disaggregated_params:


Nit: could these if / elif clauses' contents be moved to helper functions for readability? 🙏

2ez4bz · 2025-12-01T22:04:18Z

components/src/dynamo/trtllm/request_handlers/handler_base.py

-                        out["disaggregated_params"] = asdict(
-                            DisaggregatedParamsCodec.encode(output.disaggregated_params)
+                        # In EPD flow, output.disaggregated_params might be None, use the input params
+                        logging.info(


Walkthrough

Introduces a disaggregation mode framework for TensorRT-LLM enabling encoder/prefill/decode pipeline separation. Adds mode-aware engine initialization, multimodal encoding support, request routing logic, configuration files, and launch infrastructure for multimodal model deployment.

Changes

Cohort / File(s)	Summary
Disaggregation Mode Constants `components/src/dynamo/trtllm/constants.py`	New enum `DisaggregationMode` with four modes: AGGREGATED, PREFILL, DECODE, ENCODE. Enables mode-aware initialization and request routing throughout the pipeline.
Engine Initialization & Management `components/src/dynamo/trtllm/engine.py`, `components/src/dynamo/trtllm/main.py`	`TensorRTLLMEngine.__init__()` now accepts `disaggregation_mode` parameter; conditionally initializes `MultimodalEncoder` when mode is ENCODE. `get_llm_engine()` propagates disaggregation mode from config. `llm` property returns `Union[LLM, MultimodalEncoder]`.
Encoding & Multimodal Processing `components/src/dynamo/trtllm/encode_helper.py`, `components/src/dynamo/trtllm/multimodal_processor.py`	`process_embedding_request()` renamed to `process_encode_request()` with expanded interface (tokenizer, model_dir, model_type, engine). Adds dual-path handling: load embeddings from paths or generate via multimodal encoder. `process_openai_request()` signature extended with `ep_disaggregated_params`; new encoder/data-plane flow handling and refactored token streaming logic.
Request Flow & Routing `components/src/dynamo/trtllm/request_handlers/handler_base.py`, `components/src/dynamo/trtllm/request_handlers/handlers.py`	`HandlerBase` adds three new helper methods: `_decode_disaggregated_params_from_prefill()`, `_prepare_decode_input()`, `_encode_and_pack_disaggregated_params()`. `generate_locally()` signature extended with `ep_disaggregated_params`. `EncodeHandler` adds model_dir, model_type, tokenizer attributes. `PrefillHandler` adds `remote_encode_full_epd()` method; manages disaggregated params reconstruction and request metadata injection.
Utilities `components/src/dynamo/trtllm/utils/disagg_utils.py`	Enhanced `DisaggregatedParamsCodec` to thread multimodal_embedding_handles and multimodal_hashes through decode/encode paths with backward-compatible defaults.
Engine Configuration Files `examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/decode.yaml`, `examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/encode.yaml`, `examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/prefill.yaml`	Three new YAML configuration files specifying model runtime settings (tensor parallelism, batch sizes, KV cache config, overlap scheduler settings) for disaggregated decode, encode, and prefill workers.
Launch & Prompt Template `examples/backends/trtllm/launch/epd_multimodal.sh`, `examples/backends/trtllm/templates/llava_multimodal.jinja`	Bash launcher orchestrating frontend and three disaggregated workers (encode, prefill, decode) with device binding and cleanup handling. Jinja2 template formats multimodal messages with special handling for system, user (with image placeholders), and assistant roles.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Areas requiring extra attention:

encode_helper.py — Dual-path embedding loading (dict vs. tensor) with NIXL metadata serialization and disaggregated params encoding introduces logic density; verify correctness of auxiliary data preservation and error handling paths.
request_handlers/handler_base.py — Three new helper methods and refactored generate_locally() flow orchestrate disaggregation logic; review state transitions, metadata threading (_epd_processed_prompt, _epd_prompt_token_ids), and interaction with multimodal stripping.
request_handlers/handlers.py — PrefillHandler.remote_encode_full_epd() and its interaction with ep_disaggregated_params reconstruction requires tracing disaggregated params lifecycle through prefill→decode handoff.
multimodal_processor.py — Token streaming delta calculation refactor and EPD flow override logic; verify previous_decoded_text initialization and delta computation across chunk boundaries.
engine.py — Mode-dependent initialization path selection (ENCODE → MultimodalEncoder vs. LLM); ensure proper cleanup in exception scenarios and type safety of union return types.

Poem

🐰 A disaggregated dance we now perform,
Encode, prefill, decode — each form,
Multimodal whispers threaded through,
Embeddings and prompts find their true
Path to harmony, pixel by token anew! ✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding standalone encoder support to the dynamo TRT-LLM backend.
Description check	✅ Passed	The description includes all required sections from the template (Overview, Details, Where should the reviewer start) with comprehensive and specific information about the changes.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/src/dynamo/trtllm/request_handlers/handlers.py (1)

90-94: Unreachable code after return statement.

Lines 90-94 are unreachable because line 85 returns unconditionally after the connector path. This dead code should be removed.

             yield response
         return
-    else:
-        logging.error("encode handler: no Dynamo NIXL connector found")
-        raise RuntimeError("encode handler: no Dynamo NIXL connector found")
-
-    if not request.get("streaming", False):
-        yield request
-        return
-
-    yield request
+    # No connector available
+    logging.error("encode handler: no Dynamo NIXL connector found")
+    raise RuntimeError("encode handler: no Dynamo NIXL connector found")

♻️ Duplicate comments (2)

components/src/dynamo/trtllm/engine.py (1)

44-56: Conditional initialization for ENCODE mode is correct.

The MultimodalEncoder initialization properly handles the ENCODE disaggregation mode. Note that engine_args.pop() mutates the dictionary, which could cause issues if engine_args is used elsewhere after initialization. Since past reviews have already discussed parameter forwarding to MultimodalEncoder, I'll defer to those discussions.

components/src/dynamo/trtllm/encode_helper.py (1)

316-318: Add comment explaining add_special_tokens=False.

Per previous review feedback, please add a comment explaining why special tokens are disabled during tokenization.

🧹 Nitpick comments (7)

examples/backends/trtllm/launch/epd_multimodal.sh (2)
21-28: Consider debugging-friendly cleanup for troubleshooting.

The cleanup trap unconditionally terminates all background processes on EXIT/INT/TERM. Based on learnings from similar launch infrastructure (PR 1730), keeping background processes alive after script exit can enable users to manually connect and debug issues without restarting everything. However, since this script launches ephemeral worker processes (not persistent infrastructure services), the current unconditional cleanup is simpler and may be appropriate for this use case.

Evaluate whether your operational workflows would benefit from selective cleanup or keeping workers alive for debugging attached nodes.

9-11: Consider validating engine configuration paths before launching workers.

The script references engine config files that may not exist (e.g., llava-v1.6-mistral-7b-hf/encode.yaml). If files are missing, workers will fail with cryptic errors. A simple validation check early in the script could surface issues earlier.

Example validation (optional):
for config_var in ENCODE_ENGINE_ARGS PREFILL_ENGINE_ARGS DECODE_ENGINE_ARGS; do
  config_path="${!config_var}"
  if [[ ! -f "$config_path" ]]; then
    echo "Error: Config file not found: $config_path" >&2
    exit 1
  fi
done
components/src/dynamo/trtllm/multimodal_processor.py (2)
162-162: Instance attribute should be initialized in __init__.

previous_decoded_text is set as an instance attribute here but isn't declared in __init__. This works but is unconventional and could cause issues if create_response_chunk is called before process_openai_request.

Consider initializing in __init__:
 def __init__(
     self,
     model_type: str,
     model_dir: str,
     max_file_size_mb: int,
     tokenizer: Optional[TokenizerProtocol] = None,
     allowed_local_media_path: str = "",
 ):
     self.model_type = model_type
     self.model_dir = model_dir
     self.tokenizer = tokenizer
     self.modality = ""
     self.allowed_local_media_path = allowed_local_media_path
     self.max_file_size_mb = max_file_size_mb
     self.max_file_size_bytes = max_file_size_mb * 1024 * 1024
+    self.previous_decoded_text = ""
180-184: Simplify key check with dict.get().

Per static analysis hint (RUF019), the key check before dictionary access can be simplified.
-            if "_epd_prompt_token_ids" in request and request["_epd_prompt_token_ids"]:
-                result["prompt_token_ids"] = request["_epd_prompt_token_ids"]
+            prompt_token_ids = request.get("_epd_prompt_token_ids")
+            if prompt_token_ids:
+                result["prompt_token_ids"] = prompt_token_ids
components/src/dynamo/trtllm/encode_helper.py (1)
278-278: Potential blocking call in async context.

engine.llm.generate(inputs) appears to be a synchronous call wrapped in list(). In an async handler, this could block the event loop. Consider whether this should be run in an executor or if TRTLLM provides an async variant.

If blocking is confirmed, consider:
import asyncio
encoder_outputs = await asyncio.get_event_loop().run_in_executor(
    None, lambda: list(engine.llm.generate(inputs))
)
components/src/dynamo/trtllm/request_handlers/handlers.py (2)
105-113: Consider adding error handling for generator exhaustion.

The method breaks after the first response, but if the generator yields no items, encode_response remains None. The subsequent check handles this, but consider using async for ... else or anext() for clarity.
     async def remote_encode_full_epd(self, request: dict):
-        async for res in await self.encode_client.round_robin(request):
-            encode_response = res.data()
-            break
-
-        if not encode_response:
-            raise RuntimeError("Did not receive a response from the encode worker.")
-
-        return encode_response
+        encode_response = None
+        async for res in await self.encode_client.round_robin(request):
+            encode_response = res.data()
+            break
+
+        if not encode_response:
+            raise RuntimeError("Did not receive a response from the encode worker.")
+
+        return encode_response
Alternatively, consider extracting the round-robin single-response pattern to a shared helper since remote_encode_with_nixl uses the same pattern.

158-185: EPD flow correctly reconstructs DisaggregatedParams and propagates metadata.

The logic correctly:

Calls remote_encode_full_epd for image URLs

Reconstructs DisaggregatedParams from the dict response

Sets request_type to "context_only" for prefill phase

Stores _epd_processed_prompt and _epd_prompt_token_ids in the request for downstream use

One minor improvement per static analysis: the nested key check on lines 179-181 can be simplified.
-                            if (
-                                "prompt_token_ids" in encode_response
-                                and encode_response["prompt_token_ids"]
-                            ):
+                            prompt_token_ids = encode_response.get("prompt_token_ids")
+                            if prompt_token_ids:
                                 request["_epd_prompt_token_ids"] = encode_response[
                                     "prompt_token_ids"
                                 ]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d9fc85 and 1f152c0.

📒 Files selected for processing (13)

components/src/dynamo/trtllm/constants.py (1 hunks)
components/src/dynamo/trtllm/encode_helper.py (3 hunks)
components/src/dynamo/trtllm/engine.py (5 hunks)
components/src/dynamo/trtllm/main.py (1 hunks)
components/src/dynamo/trtllm/multimodal_processor.py (2 hunks)
components/src/dynamo/trtllm/request_handlers/handler_base.py (6 hunks)
components/src/dynamo/trtllm/request_handlers/handlers.py (4 hunks)
components/src/dynamo/trtllm/utils/disagg_utils.py (2 hunks)
examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/decode.yaml (1 hunks)
examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/encode.yaml (1 hunks)
examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/prefill.yaml (1 hunks)
examples/backends/trtllm/launch/epd_multimodal.sh (1 hunks)
examples/backends/trtllm/templates/llava_multimodal.jinja (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-09-16T19:47:30.312Z

Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.

Applied to files:

components/src/dynamo/trtllm/multimodal_processor.py

📚 Learning: 2025-07-03T10:14:30.570Z

Learnt from: fsaady
Repo: ai-dynamo/dynamo PR: 1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

Applied to files:

examples/backends/trtllm/launch/epd_multimodal.sh

📚 Learning: 2025-06-05T01:10:51.865Z

Learnt from: tanmayv25
Repo: ai-dynamo/dynamo PR: 1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.

Applied to files:

components/src/dynamo/trtllm/engine.py

🧬 Code graph analysis (5)

components/src/dynamo/trtllm/main.py (1)

components/src/dynamo/trtllm/engine.py (1)

get_llm_engine (109-120)

components/src/dynamo/trtllm/multimodal_processor.py (1)

components/src/dynamo/trtllm/utils/disagg_utils.py (1)

decode (27-50)

examples/backends/trtllm/launch/epd_multimodal.sh (3)

components/src/dynamo/trtllm/engine.py (1)

cleanup (58-65)

components/src/dynamo/trtllm/utils/disagg_utils.py (2)

encode (53-77)

decode (27-50)

components/src/dynamo/trtllm/multimodal_processor.py (1)

decode (40-41)

components/src/dynamo/trtllm/engine.py (2)

components/src/dynamo/trtllm/constants.py (1)

DisaggregationMode (7-11)

components/src/dynamo/sglang/args.py (1)

DisaggregationMode (144-147)

components/src/dynamo/trtllm/request_handlers/handlers.py (2)

components/src/dynamo/trtllm/encode_helper.py (1)

process_encode_request (191-327)

components/src/dynamo/trtllm/multimodal_processor.py (1)

extract_prompt_and_media (132-154)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4668/merge) by indrajit96.

examples/backends/trtllm/launch/epd_multimodal.sh

[error] 1-1: Check for executable shebang scripts: the file has a shebang but is not marked executable. Run 'chmod +x examples/backends/trtllm/launch/epd_multimodal.sh' (or 'git add --chmod=+x ...' on Windows if needed).

🪛 Ruff (0.14.7)

components/src/dynamo/trtllm/multimodal_processor.py

180-180: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

components/src/dynamo/trtllm/request_handlers/handlers.py

111-111: Avoid specifying long messages outside the exception class

(TRY003)

180-181: Unnecessary key check before dictionary access

Replace with dict.get

(RUF019)

components/src/dynamo/trtllm/request_handlers/handler_base.py

191-191: Unused method argument: prefill_result

(ARG002)

457-457: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (30)

examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/decode.yaml (2)

22-22: Verify enable_chunked_prefill is appropriate for decode-only configuration.

In the EPD disaggregated architecture, the decode phase handles tokens post-prefill. The enable_chunked_prefill: true setting on line 22 is typically a prefill-phase optimization and may be unrelated or problematic in a decode-only context.

Please confirm:

Whether this setting is actively used by the TensorRT-LLM decode engine or safely ignored.

Whether this was intentionally copied from a full-model config and should be removed for decode-only mode.

15-29: Configuration looks reasonable for decode-only workload; verify EPD completeness.

The conservative memory allocation (free_gpu_memory_fraction: 0.30) and batch size (max_batch_size: 16) are appropriate for the decode phase of an EPD pipeline. However, verify that companion encode.yaml and prefill.yaml configurations exist elsewhere in the PR and are properly integrated into the disaggregated setup.

examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/prefill.yaml (3)

1-14: License header is present and correct.

The Apache 2.0 license header is properly formatted and applied.

26-31: KV cache and cache transceiver configuration appears appropriate for disaggregated prefill.

The free_gpu_memory_fraction: 0.30 is conservative, leaving substantial headroom for input embeddings and compute. The enable_block_reuse: false is reasonable for a disaggregated setup where workers are specialized and may not benefit from block reuse across different request flows.

Confirm that the 30% KV cache allocation (line 27) leaves sufficient GPU memory for the prefill worker's input embeddings, attention computation, and output buffering. If memory utilization runs high, consider adjusting the free_gpu_memory_fraction upward.

15-24: Configuration parameters and prefill-only constraints are appropriate.

The settings are conservatively tuned for a prefill-only disaggregated worker. The comment on line 23 correctly documents that the overlap scheduler is unsupported for prefill-only workers, aligning with the disaggregation mode framework.

All related configuration files (encode.yaml and decode.yaml) exist in the same directory with consistent memory and parallelism settings across phases. YAML syntax validation confirms all three files are well-formed. The prefill.yaml configuration is correctly referenced in launch scripts (e.g., examples/backends/trtllm/launch/epd_multimodal.sh). Parameter choices like max_num_tokens: 8192, max_batch_size: 16, and enable_chunked_prefill: true are sensible for prefill-only workers. Note that decode.yaml appropriately sets disable_overlap_scheduler: false for the decode phase, distinguishing it from the prefill phase.

examples/backends/trtllm/templates/llava_multimodal.jinja (1)

1-13: Template structure looks correct for LLaVA-Mistral format.

The template correctly handles the multimodal content structure with <image> placeholders for image_url types. A few observations:

Unknown role types are silently ignored - consider adding an else clause with a warning or error for debugging.

Verify that the BOS token (<s>) is added elsewhere in the pipeline, as Mistral-based models typically require it at the start.

components/src/dynamo/trtllm/utils/disagg_utils.py (2)

44-50: Backward-compatible multimodal parameter handling looks good.

Using getattr with None default provides backward compatibility with older TensorRT-LLM versions that may not have these fields.

64-77: Consistent encoding of multimodal parameters.

The encode path correctly mirrors the decode path for multimodal fields.

components/src/dynamo/trtllm/constants.py (1)

7-11: Clean enum definition for disaggregation modes.

The enum clearly defines the four supported modes. Note that the AGGREGATED value differs from sglang's equivalent ("prefill_and_decode" vs "agg" in components/src/dynamo/sglang/args.py). If cross-backend consistency is desired, consider aligning the string values.

components/src/dynamo/trtllm/main.py (1)

290-290: Correctly passes disaggregation mode to engine initialization.

The change properly forwards config.disaggregation_mode to get_llm_engine, enabling mode-aware engine initialization.

examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/encode.yaml (1)

15-31: Reasonable encode worker configuration.

The configuration appropriately sets a lower free_gpu_memory_fraction (0.30) for the encode-only worker, which doesn't need as much KV cache memory. The comment explaining why disable_overlap_scheduler is required is helpful.

components/src/dynamo/trtllm/multimodal_processor.py (2)

173-184: EPD flow handling looks correct.

The logic properly handles the EPD case by using the encoder-provided processed prompt and token IDs.

229-242: Token streaming refactor is correct.

The incremental delta calculation using cached previous_decoded_text properly handles both first and subsequent chunks.

components/src/dynamo/trtllm/engine.py (2)

58-65: Improved cleanup with try/finally.

The cleanup method now properly guards against exceptions during shutdown and ensures _llm is set to None in all cases.

109-112: Updated context manager signature correctly passes disaggregation mode.

The async context manager now properly accepts and forwards the disaggregation mode to TensorRTLLMEngine.

components/src/dynamo/trtllm/encode_helper.py (3)

279-300: Good defensive error handling for encoder outputs.

The code properly handles cases where encoder outputs are empty or missing disaggregated params, with appropriate logging at different levels (error vs warning).

222-265: Embedding paths flow with NIXL looks correct.

The logic properly handles both dict and tensor formats, creates NIXL readable operations, and waits for completion before returning.

274-275: Not an issue: Intentional single-image processing design.

The code uses image_urls[0] as intended. The default_multimodal_input_loader function expects a single media item (not a list), as documented on line 305 with the comment "default_multimodal_input_loader returns a list, get the first element." Processing one image at a time through this flow is the designed behavior.

components/src/dynamo/trtllm/request_handlers/handlers.py (5)

6-6: LGTM!

The import of DisaggregatedParams from tensorrt_llm.llmapi is correctly added to support the EPD flow parameter handling.

60-69: LGTM!

Good defensive initialization pattern - setting attributes to None first, then conditionally populating them from multimodal_processor. This prevents AttributeError if multimodal_processor is not available.

75-85: LGTM!

The updated call to process_encode_request now passes all required parameters for both NIXL embedding transfer and full EPD encoding paths. The early return on line 85 ensures no further processing after the async generator completes.

145-149: Unused variable _ from extract_prompt_and_media.

The text_prompt (assigned to _) is extracted but not used in this method. If it's intentionally unused, that's fine, but verify this matches the expected EPD flow where text_prompt handling happens elsewhere.

186-199: LGTM!

The generate_locally call correctly passes both embeddings_tensor and ep_disaggregated_params, enabling the base handler to use either NIXL-transferred embeddings or EPD disaggregated params. The single-response assertion is appropriate for prefill mode.

components/src/dynamo/trtllm/request_handlers/handler_base.py (7)

35-35: LGTM!

Import of DisaggregationMode from constants centralizes the enum definition, improving maintainability.

144-185: Well-structured helper for decoding disaggregated params.

The method correctly:

Extracts EPD metadata from the packed _epd_metadata field

Uses pop() to remove metadata from params_dict before decoding

Sets request_type to "generation_only" for decode phase

Clears multimodal_embedding_handles to avoid TRT-LLM validation errors

The docstring clearly explains the return tuple.

253-328: Well-designed helper for encoding and packing disaggregated params.

This method addresses the previous review comment about extracting helper functions for readability. It properly:

Chooses between output and input disaggregated params

Preserves multimodal_embedding_handles and multimodal_hashes for EPD flow

Packs EPD metadata into the params dict for transmission

The logic for preserving handles from input when output doesn't have them (lines 287-302) correctly handles the case where TRT-LLM doesn't propagate these fields through prefill.

365-410: Complex flow but well-structured with clear comments.

The initialization flow correctly:

Normalizes request format (stop_conditions, sampling_options)

Sets up disaggregated_params for PREFILL mode with proper request_type

Decodes params from prefill_result for DECODE mode

Makes ep_disaggregated_params available to multimodal processor

The comments explaining the Rust frontend's max_tokens handling are helpful for maintainability.

416-442: Mode-based processing correctly routes to appropriate handlers.

The branching logic properly separates DECODE mode (using _prepare_decode_input) from PREFILL/ENCODE modes (using multimodal_processor.process_openai_request). The fallback to request.get("token_ids") for text-only flows is appropriate.

463-482: Defensive handling of optional request fields.

Good defensive pattern - checking for presence of sampling_options and stop_conditions before iterating, and checking each value before setting. This handles cases where decode workers may receive minimal requests.

537-543: LGTM - Cleaner refactor using helper method.

The PREFILL mode handling is now cleaner with the extracted _encode_and_pack_disaggregated_params helper. The None check before assignment is correct.

coderabbitai · 2025-12-02T22:08:48Z

components/src/dynamo/trtllm/request_handlers/handler_base.py

+    async def _prepare_decode_input(
+        self,
+        request: dict,
+        epd_metadata: dict,
+        prefill_result: Optional[dict],
+        embeddings: Any,
+        ep_disaggregated_params: Any,
+    ) -> Optional[Any]:
+        """
+        Prepare input for DECODE mode processing.
+
+        Handles EPD flow (with encoder) by extracting prompt and token IDs,
+        or falls back to multimodal processor for other flows.
+
+        Args:
+            request: The request dictionary
+            epd_metadata: EPD metadata extracted from prefill result
+            prefill_result: Result from prefill worker
+            embeddings: Multimodal embeddings (if any)
+            ep_disaggregated_params: Disaggregated params from encoder/prefill
+
+        Returns:
+            Processed input ready for the engine, or None if not available
+        """
+        # Decode worker with generation_only mode
+        # Pass the same inputs format as prefill
+        # Check epd_metadata (packed by prefill), then prefill_result, then direct request
+        epd_prompt = epd_metadata.get("_epd_processed_prompt")
+        epd_token_ids = epd_metadata.get("_epd_prompt_token_ids")
+
+        if epd_prompt:
+            # In EPD generation-only mode (decode), pass the SAME input format as prefill
+            # This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...)
+            # The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings
+            if epd_token_ids:
+                prompt_token_ids = epd_token_ids
+
+            processed_input = {
+                "prompt": epd_prompt,
+                "prompt_token_ids": prompt_token_ids,
+            }
+
+            # Remove ALL multimodal data from request to avoid TRT-LLM validation error
+            # In generation-only mode, ALL multimodal data must be in disaggregated_params only
+            mm_keys_to_remove = ["multi_modal_data", "image_data", "mm_data"]
+            for key in mm_keys_to_remove:
+                if key in request:
+                    request.pop(key)
+                    logging.debug(
+                        f"DECODE: Removed {key} from request (already in disaggregated_params)"
+                    )
+            return processed_input
+        elif self.multimodal_processor:
+            # Encode/Prefill worker: Process multimodal content normally
+            # In EPD flow, multimodal_processor should be called in PREFILL/ENCODE modes only
+            # DECODE mode should skip this and use EPD metadata from prefill
+            processed_input = await self.multimodal_processor.process_openai_request(
+                request, embeddings, ep_disaggregated_params
+            )
+            return processed_input
+        else:
+            logging.debug(
+                "DECODE: No multimodal_processor found, using request token_ids"
+            )
+            return None


⚠️ Potential issue | 🟡 Minor

Unused parameter prefill_result.

The static analysis correctly identifies that prefill_result is unused within this method. Since EPD metadata is already extracted and passed via epd_metadata, consider removing this parameter unless it's intended for future use.

async def _prepare_decode_input( self, request: dict, epd_metadata: dict, - prefill_result: Optional[dict], embeddings: Any, ep_disaggregated_params: Any, ) -> Optional[Any]:

And update the call site at line 417-423.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def _prepare_decode_input(

self,

request: dict,

epd_metadata: dict,

prefill_result: Optional[dict],

embeddings: Any,

ep_disaggregated_params: Any,

) -> Optional[Any]:

"""

Prepare input for DECODE mode processing.

Handles EPD flow (with encoder) by extracting prompt and token IDs,

or falls back to multimodal processor for other flows.

Args:

request: The request dictionary

epd_metadata: EPD metadata extracted from prefill result

prefill_result: Result from prefill worker

embeddings: Multimodal embeddings (if any)

ep_disaggregated_params: Disaggregated params from encoder/prefill

Returns:

Processed input ready for the engine, or None if not available

"""

# Decode worker with generation_only mode

# Pass the same inputs format as prefill

# Check epd_metadata (packed by prefill), then prefill_result, then direct request

epd_prompt = epd_metadata.get("_epd_processed_prompt")

epd_token_ids = epd_metadata.get("_epd_prompt_token_ids")

if epd_prompt:

# In EPD generation-only mode (decode), pass the SAME input format as prefill

# This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...)

# The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings

if epd_token_ids:

prompt_token_ids = epd_token_ids

processed_input = {

"prompt": epd_prompt,

"prompt_token_ids": prompt_token_ids,

}

# Remove ALL multimodal data from request to avoid TRT-LLM validation error

# In generation-only mode, ALL multimodal data must be in disaggregated_params only

mm_keys_to_remove = ["multi_modal_data", "image_data", "mm_data"]

for key in mm_keys_to_remove:

if key in request:

request.pop(key)

logging.debug(

f"DECODE: Removed {key} from request (already in disaggregated_params)"

)

return processed_input

elif self.multimodal_processor:

# Encode/Prefill worker: Process multimodal content normally

# In EPD flow, multimodal_processor should be called in PREFILL/ENCODE modes only

# DECODE mode should skip this and use EPD metadata from prefill

processed_input = await self.multimodal_processor.process_openai_request(

request, embeddings, ep_disaggregated_params

)

return processed_input

else:

logging.debug(

"DECODE: No multimodal_processor found, using request token_ids"

)

return None

async def _prepare_decode_input(

self,

request: dict,

epd_metadata: dict,

embeddings: Any,

ep_disaggregated_params: Any,

) -> Optional[Any]:

"""

Prepare input for DECODE mode processing.

Handles EPD flow (with encoder) by extracting prompt and token IDs,

or falls back to multimodal processor for other flows.

Args:

request: The request dictionary

epd_metadata: EPD metadata extracted from prefill result

embeddings: Multimodal embeddings (if any)

ep_disaggregated_params: Disaggregated params from encoder/prefill

Returns:

Processed input ready for the engine, or None if not available

"""

# Decode worker with generation_only mode

# Pass the same inputs format as prefill

# Check epd_metadata (packed by prefill), then prefill_result, then direct request

epd_prompt = epd_metadata.get("_epd_processed_prompt")

epd_token_ids = epd_metadata.get("_epd_prompt_token_ids")

if epd_prompt:

# In EPD generation-only mode (decode), pass the SAME input format as prefill

# This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...)

# The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings

if epd_token_ids:

prompt_token_ids = epd_token_ids

processed_input = {

"prompt": epd_prompt,

"prompt_token_ids": prompt_token_ids,

}

# Remove ALL multimodal data from request to avoid TRT-LLM validation error

# In generation-only mode, ALL multimodal data must be in disaggregated_params only

mm_keys_to_remove = ["multi_modal_data", "image_data", "mm_data"]

for key in mm_keys_to_remove:

if key in request:

request.pop(key)

logging.debug(

f"DECODE: Removed {key} from request (already in disaggregated_params)"

)

return processed_input

elif self.multimodal_processor:

# Encode/Prefill worker: Process multimodal content normally

# In EPD flow, multimodal_processor should be called in PREFILL/ENCODE modes only

# DECODE mode should skip this and use EPD metadata from prefill

processed_input = await self.multimodal_processor.process_openai_request(

request, embeddings, ep_disaggregated_params

)

return processed_input

else:

logging.debug(

"DECODE: No multimodal_processor found, using request token_ids"

)

return None

🧰 Tools

🪛 Ruff (0.14.7)

191-191: Unused method argument: prefill_result

(ARG002)

🤖 Prompt for AI Agents

components/src/dynamo/trtllm/request_handlers/handler_base.py around lines 187 to 251: the parameter prefill_result is declared but never used in _prepare_decode_input; remove prefill_result from the method signature and any type hints, update the docstring args list to drop it, and then update the call site mentioned (lines ~417-423) to stop passing prefill_result (pass the remaining args in their same order); ensure imports/types still match and run tests/type checks to confirm no other callers remain.

coderabbitai · 2025-12-02T22:08:48Z

components/src/dynamo/trtllm/request_handlers/handler_base.py

+        if epd_prompt:
+            # In EPD generation-only mode (decode), pass the SAME input format as prefill
+            # This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...)
+            # The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings
+            if epd_token_ids:
+                prompt_token_ids = epd_token_ids
+
+            processed_input = {
+                "prompt": epd_prompt,
+                "prompt_token_ids": prompt_token_ids,
+            }


⚠️ Potential issue | 🟠 Major

Potential UnboundLocalError if epd_token_ids is falsy.

If epd_prompt is truthy but epd_token_ids is falsy, the code at line 226 references prompt_token_ids which may not be defined in this scope.

if epd_prompt: # In EPD generation-only mode (decode), pass the SAME input format as prefill # This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...) # The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings + prompt_token_ids = None if epd_token_ids: prompt_token_ids = epd_token_ids processed_input = { "prompt": epd_prompt, "prompt_token_ids": prompt_token_ids, }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if epd_prompt:

# In EPD generation-only mode (decode), pass the SAME input format as prefill

# This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...)

# The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings

if epd_token_ids:

prompt_token_ids = epd_token_ids

processed_input = {

"prompt": epd_prompt,

"prompt_token_ids": prompt_token_ids,

}

if epd_prompt:

# In EPD generation-only mode (decode), pass the SAME input format as prefill

# This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...)

# The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings

prompt_token_ids = None

if epd_token_ids:

prompt_token_ids = epd_token_ids

processed_input = {

"prompt": epd_prompt,

"prompt_token_ids": prompt_token_ids,

}

🤖 Prompt for AI Agents

In components/src/dynamo/trtllm/request_handlers/handler_base.py around lines 217-227, the block that builds processed_input can reference prompt_token_ids when epd_token_ids is falsy, causing an UnboundLocalError; ensure prompt_token_ids is always defined before use by initializing it (e.g., set prompt_token_ids = None or to the appropriate existing token variable) before the if epd_token_ids check or by using a conditional expression when constructing processed_input so prompt_token_ids is assigned a defined value even if epd_token_ids is falsy.

coderabbitai · 2025-12-02T22:08:49Z

examples/backends/trtllm/launch/epd_multimodal.sh

@@ -0,0 +1,67 @@
+#!/bin/bash


⚠️ Potential issue | 🔴 Critical

Make the script executable to pass pre-merge validation.

The pipeline check flags that the file has a shebang but is not marked executable.

Apply the following to fix:

git add --chmod=+x examples/backends/trtllm/launch/epd_multimodal.sh

Alternatively, if using a standard Unix environment:

chmod +x examples/backends/trtllm/launch/epd_multimodal.sh

🧰 Tools

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4668/merge) by indrajit96.

[error] 1-1: Check for executable shebang scripts: the file has a shebang but is not marked executable. Run 'chmod +x examples/backends/trtllm/launch/epd_multimodal.sh' (or 'git add --chmod=+x ...' on Windows if needed).

🤖 Prompt for AI Agents

In examples/backends/trtllm/launch/epd_multimodal.sh around line 1, the file has a shebang but is not marked executable which fails pre-merge validation; fix it by making the file executable before committing (for example, set the executable bit on the file locally or stage it with the executable bit set in git) and re-run the pipeline to ensure the check passes.

rmccorm4 · 2025-12-02T22:13:59Z

examples/backends/trtllm/launch/epd_multimodal.sh

+export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"$DYNAMO_HOME/examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/prefill.yaml"}
+export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"$DYNAMO_HOME/examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/decode.yaml"}
+export ENCODE_ENGINE_ARGS=${ENCODE_ENGINE_ARGS:-"$DYNAMO_HOME/examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/encode.yaml"}
+export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"}


Does encode worker need a whole dedicated GPU? Or can it fit on same GPU as prefill? Reason I ask is because our CI (at least on gitlab side + pytest markers) has 1 and 2 gpu tests setup, but not sure about 3 gpu tests (we could use a 4 gpu runner if needed).

rmccorm4 · 2025-12-02T22:15:28Z

examples/backends/trtllm/launch/epd_multimodal.sh

Similar to https://github.com/ai-dynamo/dynamo/pull/4668/files#r2582944082, can we enable this script as a CI test like we do for the other scripts?

ex:

dynamo/tests/serve/test_trtllm.py

Lines 109 to 119 in af9ae79

"disaggregated_multimodal": TRTLLMConfig(

name="disaggregated_multimodal",

directory=trtllm_dir,

script_name="disagg_multimodal.sh",

marks=[pytest.mark.gpu_2, pytest.mark.trtllm, pytest.mark.multimodal],

model="Qwen/Qwen2-VL-7B-Instruct",

models_port=8000,

timeout=900,

delayed_start=60,

request_payloads=[multimodal_payload_default()],

),

KrishnanPrash · 2025-12-03T08:18:21Z

examples/backends/trtllm/templates/llava_multimodal.jinja

Are we using a custom template because the default llava template is incompatible with dynamo's pre-processing? From my understanding, llava models will expect image instead image_url which is passed to the jinja template (More details: #4501).

KrishnanPrash · 2025-12-03T08:59:02Z

components/src/dynamo/trtllm/request_handlers/handlers.py

+            # Handle image URLs (full E-PD flow with MultimodalEncoder)
+            elif image_urls:
+                if self.encode_client and self.connector:
+                    encode_response = await self.remote_encode_full_epd(request)


Relatively new to the EPD flow so please feel free to correct me, but based of this call, does the execution path look like this: Frontend -> Prefill -> Encode -> Prefill -> Decode?

If so, I was wondering how feasible a flow like this would look like: Frontend -> Encode -> Prefill -> Decode? Is this something we would want to do to save an extra network hop per request?

@krishung5 any thoughts on this one?

@KrishnanPrash yes this makes sense from performance POV.
The design currently is such due to initial prefill/decode first routing limitation (now it's gone).
I feel we can take this up in a new PR as it would need some rework in prefill worker.
We can use this PR for functinality and a new one for optimization.

Also, just wanted to get your thoughts on this:
From my understanding, the current EPD flow looks like this:

Frontend -> Prefill Worker

Within the prefill worker we make a blocking call to the encode worker (await self.remote_encode_full_epd(request))

Encode worker finishes work and yields to the prefill worker

At a request-level, we are currently blocking at the encode worker before calling the prefill worker, which makes sense. But if we need to need to implement a feature like batching in the future for this flow, would the prefill worker be blocked until all the requests are done on the encoder-side?

krishung5 · 2025-12-03T00:05:47Z

components/src/dynamo/trtllm/encode_helper.py

+        engine=None,
    ):
        """
        Process embedding request by loading embeddings and creating NIXL readable operation.


nit: comments for this function need to be updated.

krishung5 · 2025-12-03T20:33:56Z

components/src/dynamo/trtllm/encode_helper.py

+                )
+                await readable_op.wait_for_completion()
+                logging.debug("EncodeHelper completed readable operation.")
+        elif image_urls and text_prompt:


Do we need to add a else case to handle the case if either image_urls or text_prompt or both are empty?

indrajit96 added 7 commits November 26, 2025 09:03

EPD Changes

3438e07

Signed-off-by: Indrajit Bhosale <[email protected]>

EPD Changes

9d831c8

Signed-off-by: Indrajit Bhosale <[email protected]>

EPD Changes

86a2b46

Signed-off-by: Indrajit Bhosale <[email protected]>

EPD Changes

6b58e80

Signed-off-by: Indrajit Bhosale <[email protected]>

Merge branch 'main' into ibhosale_epd_1.2

60faba0

Working EPD

0fb6e95

Signed-off-by: Indrajit Bhosale <[email protected]>

Working EPD

fa7136e

Signed-off-by: Indrajit Bhosale <[email protected]>

pull-request-size bot added the size/XL label Dec 1, 2025

indrajit96 changed the title ~~Ibhosale epd 1.2~~ feat: Standalone encoder in dynamo trtllm Dec 1, 2025

github-actions bot added the feat label Dec 1, 2025

indrajit96 added 2 commits December 1, 2025 12:42

Working EPD and PD

2d6a358

Signed-off-by: Indrajit Bhosale <[email protected]>

Working EPD and PD

8e60929

Signed-off-by: Indrajit Bhosale <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 1, 2025 20:42 Failure

2ez4bz reviewed Dec 1, 2025

View reviewed changes

indrajit96 added 2 commits December 2, 2025 13:38

Remove logs and add helper func

ae9b10b

Signed-off-by: Indrajit Bhosale <[email protected]>

Remove logs and add helper func

fdb86d3

Signed-off-by: Indrajit Bhosale <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 21:40 Failure

Remove logs and add helper func

9ccb564

Signed-off-by: Indrajit Bhosale <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 21:52 Failure

Remove logs and add helper func

6101937

Signed-off-by: Indrajit Bhosale <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 21:52 Failure

indrajit96 requested review from GuanLuo, krishung5, rmccorm4 and tanmayv25 December 2, 2025 21:59

Remove logs and add helper func

1f152c0

Signed-off-by: Indrajit Bhosale <[email protected]>

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 22:03 Failure

indrajit96 marked this pull request as ready for review December 2, 2025 22:04

indrajit96 requested review from a team as code owners December 2, 2025 22:04

indrajit96 requested a review from a team as a code owner December 2, 2025 22:04

indrajit96 requested a review from PeaBrane December 2, 2025 22:04

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

rmccorm4 reviewed Dec 2, 2025

View reviewed changes

KrishnanPrash reviewed Dec 3, 2025

View reviewed changes

krishung5 reviewed Dec 3, 2025

View reviewed changes

This was referenced Dec 6, 2025

docs: Add multimodal documentation vllm, sglang, and trtllm backends #4510

Merged

feat: Full EPD flow trtllm backend #3818

Draft

	"disaggregated_multimodal": TRTLLMConfig(
	name="disaggregated_multimodal",
	directory=trtllm_dir,
	script_name="disagg_multimodal.sh",
	marks=[pytest.mark.gpu_2, pytest.mark.trtllm, pytest.mark.multimodal],
	model="Qwen/Qwen2-VL-7B-Instruct",
	models_port=8000,
	timeout=900,
	delayed_start=60,
	request_payloads=[multimodal_payload_default()],
	),

feat: Standalone encoder in dynamo trtllm #4668

Are you sure you want to change the base?

feat: Standalone encoder in dynamo trtllm #4668

Conversation

indrajit96 commented Dec 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Summary by CodeRabbit

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Dec 2, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KrishnanPrash Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KrishnanPrash Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

indrajit96 commented Dec 1, 2025 •

edited by coderabbitai bot

Loading

rmccorm4 Dec 2, 2025 •

edited

Loading

rmccorm4 Dec 2, 2025 •

edited

Loading

KrishnanPrash Dec 3, 2025 •

edited

Loading

KrishnanPrash Dec 8, 2025 •

edited

Loading