-
Notifications
You must be signed in to change notification settings - Fork 738
feat: Standalone encoder in dynamo trtllm #4668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
| draft_tokens=disaggregated_params.draft_tokens, | ||
| # E-P Disaggregated Params (for full EPD flow) | ||
| # Use getattr with None default for backward compatibility with text-only requests | ||
| multimodal_embedding_handles=getattr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe stupid question: these already default to None in the DisaggregatedParams definition from TRTLLM: https://github.com/NVIDIA/TensorRT-LLM/blob/v1.2.0rc4/tensorrt_llm/disaggregated_params.py#L37
Why not use:
dataclasses.replace(disaggregated_params, opaque_state=opaque_state)since opaque_state seems to be the only actual difference between the input disaggergate_params and what we're returning here?
Similar question for the encode method below.
| yield {"error": "Dictionary embeddings missing 'mm_embeddings' key"} | ||
| # Handle both tensor and dictionary formats | ||
| if isinstance(loaded_data, dict): | ||
| # Dictionary format (e.g., maverick_mm_embed_seashore_v3.pt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: was this in reference to some local debugging? Should it be removed?
| yield {"ep_disaggregated_params": None} | ||
| return | ||
| if ( | ||
| hasattr(ep_disaggregated_params, "multimodal_embedding_handles") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe stupid question: aren't we guaranteed that this attribute exists? Couldn't we just check for if ep_disaggregated_params.multimodal_embedding_handles is not None directly?
Or is the idea that we want to support multiple TRTLLM versions somehow?
| # Tokenize the processed prompt for prefill worker | ||
| if processed_prompt and tokenizer is not None: | ||
| prompt_token_ids = tokenizer.encode( | ||
| processed_prompt, add_special_tokens=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you leave a comment why add_special_tokens is set to False?
| async def initialize(self): | ||
| if not self._llm: | ||
| self._llm = self._llm_cls(**self.engine_args) | ||
| if self.disaggregation_mode == DisaggregationMode.ENCODE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, how is the engine initialized for a prefill worker? Is that encapsulated via engine_args itself somehow? (It might be worth leaving a comment whatever the case is 🙏 )
| self._llm = self._llm_cls(**self.engine_args) | ||
| if self.disaggregation_mode == DisaggregationMode.ENCODE: | ||
| # Initialize the multimodal encoder for full EPD | ||
| max_batch_size = self.engine_args.pop("max_batch_size", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why was this necessary? Maybe leave a comment? 🙏
| ) | ||
| self._llm = MultimodalEncoder( | ||
| model=model, | ||
| max_batch_size=max_batch_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why not forward the rest of the self.engine_args? MultimodalEncoder is also a LLM class: https://github.com/NVIDIA/TensorRT-LLM/blob/v1.2.0rc4/tensorrt_llm/llmapi/mm_encoder.py#L16
|
|
||
| @property | ||
| def llm(self): | ||
| def llm(self) -> Union[LLM, MultimodalEncoder]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could just be the BaseLLM class https://github.com/NVIDIA/TensorRT-LLM/blob/v1.2.0rc4/tensorrt_llm/llmapi/llm.py#L112
|
|
||
| # Setup disaggregated_params for PREFILL mode | ||
| if self.disaggregation_mode == DisaggregationMode.PREFILL: | ||
| if ep_disaggregated_params: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could these if / elif clauses' contents be moved to helper functions for readability? 🙏
| out["disaggregated_params"] = asdict( | ||
| DisaggregatedParamsCodec.encode(output.disaggregated_params) | ||
| # In EPD flow, output.disaggregated_params might be None, use the input params | ||
| logging.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar readability comment here 🙏
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
Signed-off-by: Indrajit Bhosale <[email protected]>
WalkthroughIntroduces a disaggregation mode framework for TensorRT-LLM enabling encoder/prefill/decode pipeline separation. Adds mode-aware engine initialization, multimodal encoding support, request routing logic, configuration files, and launch infrastructure for multimodal model deployment. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Areas requiring extra attention:
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/src/dynamo/trtllm/request_handlers/handlers.py (1)
90-94: Unreachable code after return statement.Lines 90-94 are unreachable because line 85 returns unconditionally after the connector path. This dead code should be removed.
yield response return - else: - logging.error("encode handler: no Dynamo NIXL connector found") - raise RuntimeError("encode handler: no Dynamo NIXL connector found") - - if not request.get("streaming", False): - yield request - return - - yield request + # No connector available + logging.error("encode handler: no Dynamo NIXL connector found") + raise RuntimeError("encode handler: no Dynamo NIXL connector found")
♻️ Duplicate comments (2)
components/src/dynamo/trtllm/engine.py (1)
44-56: Conditional initialization for ENCODE mode is correct.The MultimodalEncoder initialization properly handles the ENCODE disaggregation mode. Note that
engine_args.pop()mutates the dictionary, which could cause issues ifengine_argsis used elsewhere after initialization. Since past reviews have already discussed parameter forwarding to MultimodalEncoder, I'll defer to those discussions.components/src/dynamo/trtllm/encode_helper.py (1)
316-318: Add comment explainingadd_special_tokens=False.Per previous review feedback, please add a comment explaining why special tokens are disabled during tokenization.
🧹 Nitpick comments (7)
examples/backends/trtllm/launch/epd_multimodal.sh (2)
21-28: Consider debugging-friendly cleanup for troubleshooting.The cleanup trap unconditionally terminates all background processes on EXIT/INT/TERM. Based on learnings from similar launch infrastructure (PR 1730), keeping background processes alive after script exit can enable users to manually connect and debug issues without restarting everything. However, since this script launches ephemeral worker processes (not persistent infrastructure services), the current unconditional cleanup is simpler and may be appropriate for this use case.
Evaluate whether your operational workflows would benefit from selective cleanup or keeping workers alive for debugging attached nodes.
9-11: Consider validating engine configuration paths before launching workers.The script references engine config files that may not exist (e.g.,
llava-v1.6-mistral-7b-hf/encode.yaml). If files are missing, workers will fail with cryptic errors. A simple validation check early in the script could surface issues earlier.Example validation (optional):
for config_var in ENCODE_ENGINE_ARGS PREFILL_ENGINE_ARGS DECODE_ENGINE_ARGS; do config_path="${!config_var}" if [[ ! -f "$config_path" ]]; then echo "Error: Config file not found: $config_path" >&2 exit 1 fi donecomponents/src/dynamo/trtllm/multimodal_processor.py (2)
162-162: Instance attribute should be initialized in__init__.
previous_decoded_textis set as an instance attribute here but isn't declared in__init__. This works but is unconventional and could cause issues ifcreate_response_chunkis called beforeprocess_openai_request.Consider initializing in
__init__:def __init__( self, model_type: str, model_dir: str, max_file_size_mb: int, tokenizer: Optional[TokenizerProtocol] = None, allowed_local_media_path: str = "", ): self.model_type = model_type self.model_dir = model_dir self.tokenizer = tokenizer self.modality = "" self.allowed_local_media_path = allowed_local_media_path self.max_file_size_mb = max_file_size_mb self.max_file_size_bytes = max_file_size_mb * 1024 * 1024 + self.previous_decoded_text = ""
180-184: Simplify key check withdict.get().Per static analysis hint (RUF019), the key check before dictionary access can be simplified.
- if "_epd_prompt_token_ids" in request and request["_epd_prompt_token_ids"]: - result["prompt_token_ids"] = request["_epd_prompt_token_ids"] + prompt_token_ids = request.get("_epd_prompt_token_ids") + if prompt_token_ids: + result["prompt_token_ids"] = prompt_token_idscomponents/src/dynamo/trtllm/encode_helper.py (1)
278-278: Potential blocking call in async context.
engine.llm.generate(inputs)appears to be a synchronous call wrapped inlist(). In an async handler, this could block the event loop. Consider whether this should be run in an executor or if TRTLLM provides an async variant.If blocking is confirmed, consider:
import asyncio encoder_outputs = await asyncio.get_event_loop().run_in_executor( None, lambda: list(engine.llm.generate(inputs)) )components/src/dynamo/trtllm/request_handlers/handlers.py (2)
105-113: Consider adding error handling for generator exhaustion.The method breaks after the first response, but if the generator yields no items,
encode_responseremainsNone. The subsequent check handles this, but consider usingasync for ... elseoranext()for clarity.async def remote_encode_full_epd(self, request: dict): - async for res in await self.encode_client.round_robin(request): - encode_response = res.data() - break - - if not encode_response: - raise RuntimeError("Did not receive a response from the encode worker.") - - return encode_response + encode_response = None + async for res in await self.encode_client.round_robin(request): + encode_response = res.data() + break + + if not encode_response: + raise RuntimeError("Did not receive a response from the encode worker.") + + return encode_responseAlternatively, consider extracting the round-robin single-response pattern to a shared helper since
remote_encode_with_nixluses the same pattern.
158-185: EPD flow correctly reconstructs DisaggregatedParams and propagates metadata.The logic correctly:
- Calls
remote_encode_full_epdfor image URLs- Reconstructs
DisaggregatedParamsfrom the dict response- Sets
request_typeto"context_only"for prefill phase- Stores
_epd_processed_promptand_epd_prompt_token_idsin the request for downstream useOne minor improvement per static analysis: the nested key check on lines 179-181 can be simplified.
- if ( - "prompt_token_ids" in encode_response - and encode_response["prompt_token_ids"] - ): + prompt_token_ids = encode_response.get("prompt_token_ids") + if prompt_token_ids: request["_epd_prompt_token_ids"] = encode_response[ "prompt_token_ids" ]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
components/src/dynamo/trtllm/constants.py(1 hunks)components/src/dynamo/trtllm/encode_helper.py(3 hunks)components/src/dynamo/trtllm/engine.py(5 hunks)components/src/dynamo/trtllm/main.py(1 hunks)components/src/dynamo/trtllm/multimodal_processor.py(2 hunks)components/src/dynamo/trtllm/request_handlers/handler_base.py(6 hunks)components/src/dynamo/trtllm/request_handlers/handlers.py(4 hunks)components/src/dynamo/trtllm/utils/disagg_utils.py(2 hunks)examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/decode.yaml(1 hunks)examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/encode.yaml(1 hunks)examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/prefill.yaml(1 hunks)examples/backends/trtllm/launch/epd_multimodal.sh(1 hunks)examples/backends/trtllm/templates/llava_multimodal.jinja(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-16T19:47:30.312Z
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.
Applied to files:
components/src/dynamo/trtllm/multimodal_processor.py
📚 Learning: 2025-07-03T10:14:30.570Z
Learnt from: fsaady
Repo: ai-dynamo/dynamo PR: 1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.
Applied to files:
examples/backends/trtllm/launch/epd_multimodal.sh
📚 Learning: 2025-06-05T01:10:51.865Z
Learnt from: tanmayv25
Repo: ai-dynamo/dynamo PR: 1391
File: examples/tensorrt_llm/common/base_engine.py:171-176
Timestamp: 2025-06-05T01:10:51.865Z
Learning: In examples/tensorrt_llm/common/base_engine.py, the _init_engine method is called only once during initialization, so direct mutation of the _default_sampling_params object during setup is safe and appropriate.
Applied to files:
components/src/dynamo/trtllm/engine.py
🧬 Code graph analysis (5)
components/src/dynamo/trtllm/main.py (1)
components/src/dynamo/trtllm/engine.py (1)
get_llm_engine(109-120)
components/src/dynamo/trtllm/multimodal_processor.py (1)
components/src/dynamo/trtllm/utils/disagg_utils.py (1)
decode(27-50)
examples/backends/trtllm/launch/epd_multimodal.sh (3)
components/src/dynamo/trtllm/engine.py (1)
cleanup(58-65)components/src/dynamo/trtllm/utils/disagg_utils.py (2)
encode(53-77)decode(27-50)components/src/dynamo/trtllm/multimodal_processor.py (1)
decode(40-41)
components/src/dynamo/trtllm/engine.py (2)
components/src/dynamo/trtllm/constants.py (1)
DisaggregationMode(7-11)components/src/dynamo/sglang/args.py (1)
DisaggregationMode(144-147)
components/src/dynamo/trtllm/request_handlers/handlers.py (2)
components/src/dynamo/trtllm/encode_helper.py (1)
process_encode_request(191-327)components/src/dynamo/trtllm/multimodal_processor.py (1)
extract_prompt_and_media(132-154)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4668/merge) by indrajit96.
examples/backends/trtllm/launch/epd_multimodal.sh
[error] 1-1: Check for executable shebang scripts: the file has a shebang but is not marked executable. Run 'chmod +x examples/backends/trtllm/launch/epd_multimodal.sh' (or 'git add --chmod=+x ...' on Windows if needed).
🪛 Ruff (0.14.7)
components/src/dynamo/trtllm/multimodal_processor.py
180-180: Unnecessary key check before dictionary access
Replace with dict.get
(RUF019)
components/src/dynamo/trtllm/request_handlers/handlers.py
111-111: Avoid specifying long messages outside the exception class
(TRY003)
180-181: Unnecessary key check before dictionary access
Replace with dict.get
(RUF019)
components/src/dynamo/trtllm/request_handlers/handler_base.py
191-191: Unused method argument: prefill_result
(ARG002)
457-457: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (30)
examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/decode.yaml (2)
22-22: Verifyenable_chunked_prefillis appropriate for decode-only configuration.In the EPD disaggregated architecture, the decode phase handles tokens post-prefill. The
enable_chunked_prefill: truesetting on line 22 is typically a prefill-phase optimization and may be unrelated or problematic in a decode-only context.Please confirm:
- Whether this setting is actively used by the TensorRT-LLM decode engine or safely ignored.
- Whether this was intentionally copied from a full-model config and should be removed for decode-only mode.
15-29: Configuration looks reasonable for decode-only workload; verify EPD completeness.The conservative memory allocation (
free_gpu_memory_fraction: 0.30) and batch size (max_batch_size: 16) are appropriate for the decode phase of an EPD pipeline. However, verify that companionencode.yamlandprefill.yamlconfigurations exist elsewhere in the PR and are properly integrated into the disaggregated setup.examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/prefill.yaml (3)
1-14: License header is present and correct.The Apache 2.0 license header is properly formatted and applied.
26-31: KV cache and cache transceiver configuration appears appropriate for disaggregated prefill.The
free_gpu_memory_fraction: 0.30is conservative, leaving substantial headroom for input embeddings and compute. Theenable_block_reuse: falseis reasonable for a disaggregated setup where workers are specialized and may not benefit from block reuse across different request flows.Confirm that the 30% KV cache allocation (line 27) leaves sufficient GPU memory for the prefill worker's input embeddings, attention computation, and output buffering. If memory utilization runs high, consider adjusting the
free_gpu_memory_fractionupward.
15-24: Configuration parameters and prefill-only constraints are appropriate.The settings are conservatively tuned for a prefill-only disaggregated worker. The comment on line 23 correctly documents that the overlap scheduler is unsupported for prefill-only workers, aligning with the disaggregation mode framework.
All related configuration files (
encode.yamlanddecode.yaml) exist in the same directory with consistent memory and parallelism settings across phases. YAML syntax validation confirms all three files are well-formed. Theprefill.yamlconfiguration is correctly referenced in launch scripts (e.g.,examples/backends/trtllm/launch/epd_multimodal.sh). Parameter choices likemax_num_tokens: 8192,max_batch_size: 16, andenable_chunked_prefill: trueare sensible for prefill-only workers. Note thatdecode.yamlappropriately setsdisable_overlap_scheduler: falsefor the decode phase, distinguishing it from the prefill phase.examples/backends/trtllm/templates/llava_multimodal.jinja (1)
1-13: Template structure looks correct for LLaVA-Mistral format.The template correctly handles the multimodal content structure with
<image>placeholders for image_url types. A few observations:
- Unknown role types are silently ignored - consider adding an else clause with a warning or error for debugging.
- Verify that the BOS token (
<s>) is added elsewhere in the pipeline, as Mistral-based models typically require it at the start.components/src/dynamo/trtllm/utils/disagg_utils.py (2)
44-50: Backward-compatible multimodal parameter handling looks good.Using
getattrwithNonedefault provides backward compatibility with older TensorRT-LLM versions that may not have these fields.
64-77: Consistent encoding of multimodal parameters.The encode path correctly mirrors the decode path for multimodal fields.
components/src/dynamo/trtllm/constants.py (1)
7-11: Clean enum definition for disaggregation modes.The enum clearly defines the four supported modes. Note that the
AGGREGATEDvalue differs from sglang's equivalent ("prefill_and_decode"vs"agg"incomponents/src/dynamo/sglang/args.py). If cross-backend consistency is desired, consider aligning the string values.components/src/dynamo/trtllm/main.py (1)
290-290: Correctly passes disaggregation mode to engine initialization.The change properly forwards
config.disaggregation_modetoget_llm_engine, enabling mode-aware engine initialization.examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/encode.yaml (1)
15-31: Reasonable encode worker configuration.The configuration appropriately sets a lower
free_gpu_memory_fraction(0.30) for the encode-only worker, which doesn't need as much KV cache memory. The comment explaining whydisable_overlap_scheduleris required is helpful.components/src/dynamo/trtllm/multimodal_processor.py (2)
173-184: EPD flow handling looks correct.The logic properly handles the EPD case by using the encoder-provided processed prompt and token IDs.
229-242: Token streaming refactor is correct.The incremental delta calculation using cached
previous_decoded_textproperly handles both first and subsequent chunks.components/src/dynamo/trtllm/engine.py (2)
58-65: Improved cleanup with try/finally.The cleanup method now properly guards against exceptions during shutdown and ensures
_llmis set toNonein all cases.
109-112: Updated context manager signature correctly passes disaggregation mode.The async context manager now properly accepts and forwards the disaggregation mode to
TensorRTLLMEngine.components/src/dynamo/trtllm/encode_helper.py (3)
279-300: Good defensive error handling for encoder outputs.The code properly handles cases where encoder outputs are empty or missing disaggregated params, with appropriate logging at different levels (error vs warning).
222-265: Embedding paths flow with NIXL looks correct.The logic properly handles both dict and tensor formats, creates NIXL readable operations, and waits for completion before returning.
274-275: Not an issue: Intentional single-image processing design.The code uses
image_urls[0]as intended. Thedefault_multimodal_input_loaderfunction expects a single media item (not a list), as documented on line 305 with the comment "default_multimodal_input_loader returns a list, get the first element." Processing one image at a time through this flow is the designed behavior.components/src/dynamo/trtllm/request_handlers/handlers.py (5)
6-6: LGTM!The import of
DisaggregatedParamsfromtensorrt_llm.llmapiis correctly added to support the EPD flow parameter handling.
60-69: LGTM!Good defensive initialization pattern - setting attributes to
Nonefirst, then conditionally populating them frommultimodal_processor. This preventsAttributeErrorifmultimodal_processoris not available.
75-85: LGTM!The updated call to
process_encode_requestnow passes all required parameters for both NIXL embedding transfer and full EPD encoding paths. The early return on line 85 ensures no further processing after the async generator completes.
145-149: Unused variable_from extract_prompt_and_media.The text_prompt (assigned to
_) is extracted but not used in this method. If it's intentionally unused, that's fine, but verify this matches the expected EPD flow where text_prompt handling happens elsewhere.
186-199: LGTM!The
generate_locallycall correctly passes bothembeddings_tensorandep_disaggregated_params, enabling the base handler to use either NIXL-transferred embeddings or EPD disaggregated params. The single-response assertion is appropriate for prefill mode.components/src/dynamo/trtllm/request_handlers/handler_base.py (7)
35-35: LGTM!Import of
DisaggregationModefrom constants centralizes the enum definition, improving maintainability.
144-185: Well-structured helper for decoding disaggregated params.The method correctly:
- Extracts EPD metadata from the packed
_epd_metadatafield- Uses
pop()to remove metadata from params_dict before decoding- Sets
request_typeto"generation_only"for decode phase- Clears
multimodal_embedding_handlesto avoid TRT-LLM validation errorsThe docstring clearly explains the return tuple.
253-328: Well-designed helper for encoding and packing disaggregated params.This method addresses the previous review comment about extracting helper functions for readability. It properly:
- Chooses between output and input disaggregated params
- Preserves
multimodal_embedding_handlesandmultimodal_hashesfor EPD flow- Packs EPD metadata into the params dict for transmission
The logic for preserving handles from input when output doesn't have them (lines 287-302) correctly handles the case where TRT-LLM doesn't propagate these fields through prefill.
365-410: Complex flow but well-structured with clear comments.The initialization flow correctly:
- Normalizes request format (stop_conditions, sampling_options)
- Sets up
disaggregated_paramsfor PREFILL mode with properrequest_type- Decodes params from prefill_result for DECODE mode
- Makes
ep_disaggregated_paramsavailable to multimodal processorThe comments explaining the Rust frontend's max_tokens handling are helpful for maintainability.
416-442: Mode-based processing correctly routes to appropriate handlers.The branching logic properly separates DECODE mode (using
_prepare_decode_input) from PREFILL/ENCODE modes (usingmultimodal_processor.process_openai_request). The fallback torequest.get("token_ids")for text-only flows is appropriate.
463-482: Defensive handling of optional request fields.Good defensive pattern - checking for presence of
sampling_optionsandstop_conditionsbefore iterating, and checking each value before setting. This handles cases where decode workers may receive minimal requests.
537-543: LGTM - Cleaner refactor using helper method.The PREFILL mode handling is now cleaner with the extracted
_encode_and_pack_disaggregated_paramshelper. TheNonecheck before assignment is correct.
| async def _prepare_decode_input( | ||
| self, | ||
| request: dict, | ||
| epd_metadata: dict, | ||
| prefill_result: Optional[dict], | ||
| embeddings: Any, | ||
| ep_disaggregated_params: Any, | ||
| ) -> Optional[Any]: | ||
| """ | ||
| Prepare input for DECODE mode processing. | ||
| Handles EPD flow (with encoder) by extracting prompt and token IDs, | ||
| or falls back to multimodal processor for other flows. | ||
| Args: | ||
| request: The request dictionary | ||
| epd_metadata: EPD metadata extracted from prefill result | ||
| prefill_result: Result from prefill worker | ||
| embeddings: Multimodal embeddings (if any) | ||
| ep_disaggregated_params: Disaggregated params from encoder/prefill | ||
| Returns: | ||
| Processed input ready for the engine, or None if not available | ||
| """ | ||
| # Decode worker with generation_only mode | ||
| # Pass the same inputs format as prefill | ||
| # Check epd_metadata (packed by prefill), then prefill_result, then direct request | ||
| epd_prompt = epd_metadata.get("_epd_processed_prompt") | ||
| epd_token_ids = epd_metadata.get("_epd_prompt_token_ids") | ||
|
|
||
| if epd_prompt: | ||
| # In EPD generation-only mode (decode), pass the SAME input format as prefill | ||
| # This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...) | ||
| # The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings | ||
| if epd_token_ids: | ||
| prompt_token_ids = epd_token_ids | ||
|
|
||
| processed_input = { | ||
| "prompt": epd_prompt, | ||
| "prompt_token_ids": prompt_token_ids, | ||
| } | ||
|
|
||
| # Remove ALL multimodal data from request to avoid TRT-LLM validation error | ||
| # In generation-only mode, ALL multimodal data must be in disaggregated_params only | ||
| mm_keys_to_remove = ["multi_modal_data", "image_data", "mm_data"] | ||
| for key in mm_keys_to_remove: | ||
| if key in request: | ||
| request.pop(key) | ||
| logging.debug( | ||
| f"DECODE: Removed {key} from request (already in disaggregated_params)" | ||
| ) | ||
| return processed_input | ||
| elif self.multimodal_processor: | ||
| # Encode/Prefill worker: Process multimodal content normally | ||
| # In EPD flow, multimodal_processor should be called in PREFILL/ENCODE modes only | ||
| # DECODE mode should skip this and use EPD metadata from prefill | ||
| processed_input = await self.multimodal_processor.process_openai_request( | ||
| request, embeddings, ep_disaggregated_params | ||
| ) | ||
| return processed_input | ||
| else: | ||
| logging.debug( | ||
| "DECODE: No multimodal_processor found, using request token_ids" | ||
| ) | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused parameter prefill_result.
The static analysis correctly identifies that prefill_result is unused within this method. Since EPD metadata is already extracted and passed via epd_metadata, consider removing this parameter unless it's intended for future use.
async def _prepare_decode_input(
self,
request: dict,
epd_metadata: dict,
- prefill_result: Optional[dict],
embeddings: Any,
ep_disaggregated_params: Any,
) -> Optional[Any]:And update the call site at line 417-423.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def _prepare_decode_input( | |
| self, | |
| request: dict, | |
| epd_metadata: dict, | |
| prefill_result: Optional[dict], | |
| embeddings: Any, | |
| ep_disaggregated_params: Any, | |
| ) -> Optional[Any]: | |
| """ | |
| Prepare input for DECODE mode processing. | |
| Handles EPD flow (with encoder) by extracting prompt and token IDs, | |
| or falls back to multimodal processor for other flows. | |
| Args: | |
| request: The request dictionary | |
| epd_metadata: EPD metadata extracted from prefill result | |
| prefill_result: Result from prefill worker | |
| embeddings: Multimodal embeddings (if any) | |
| ep_disaggregated_params: Disaggregated params from encoder/prefill | |
| Returns: | |
| Processed input ready for the engine, or None if not available | |
| """ | |
| # Decode worker with generation_only mode | |
| # Pass the same inputs format as prefill | |
| # Check epd_metadata (packed by prefill), then prefill_result, then direct request | |
| epd_prompt = epd_metadata.get("_epd_processed_prompt") | |
| epd_token_ids = epd_metadata.get("_epd_prompt_token_ids") | |
| if epd_prompt: | |
| # In EPD generation-only mode (decode), pass the SAME input format as prefill | |
| # This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...) | |
| # The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings | |
| if epd_token_ids: | |
| prompt_token_ids = epd_token_ids | |
| processed_input = { | |
| "prompt": epd_prompt, | |
| "prompt_token_ids": prompt_token_ids, | |
| } | |
| # Remove ALL multimodal data from request to avoid TRT-LLM validation error | |
| # In generation-only mode, ALL multimodal data must be in disaggregated_params only | |
| mm_keys_to_remove = ["multi_modal_data", "image_data", "mm_data"] | |
| for key in mm_keys_to_remove: | |
| if key in request: | |
| request.pop(key) | |
| logging.debug( | |
| f"DECODE: Removed {key} from request (already in disaggregated_params)" | |
| ) | |
| return processed_input | |
| elif self.multimodal_processor: | |
| # Encode/Prefill worker: Process multimodal content normally | |
| # In EPD flow, multimodal_processor should be called in PREFILL/ENCODE modes only | |
| # DECODE mode should skip this and use EPD metadata from prefill | |
| processed_input = await self.multimodal_processor.process_openai_request( | |
| request, embeddings, ep_disaggregated_params | |
| ) | |
| return processed_input | |
| else: | |
| logging.debug( | |
| "DECODE: No multimodal_processor found, using request token_ids" | |
| ) | |
| return None | |
| async def _prepare_decode_input( | |
| self, | |
| request: dict, | |
| epd_metadata: dict, | |
| embeddings: Any, | |
| ep_disaggregated_params: Any, | |
| ) -> Optional[Any]: | |
| """ | |
| Prepare input for DECODE mode processing. | |
| Handles EPD flow (with encoder) by extracting prompt and token IDs, | |
| or falls back to multimodal processor for other flows. | |
| Args: | |
| request: The request dictionary | |
| epd_metadata: EPD metadata extracted from prefill result | |
| embeddings: Multimodal embeddings (if any) | |
| ep_disaggregated_params: Disaggregated params from encoder/prefill | |
| Returns: | |
| Processed input ready for the engine, or None if not available | |
| """ | |
| # Decode worker with generation_only mode | |
| # Pass the same inputs format as prefill | |
| # Check epd_metadata (packed by prefill), then prefill_result, then direct request | |
| epd_prompt = epd_metadata.get("_epd_processed_prompt") | |
| epd_token_ids = epd_metadata.get("_epd_prompt_token_ids") | |
| if epd_prompt: | |
| # In EPD generation-only mode (decode), pass the SAME input format as prefill | |
| # This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...) | |
| # The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings | |
| if epd_token_ids: | |
| prompt_token_ids = epd_token_ids | |
| processed_input = { | |
| "prompt": epd_prompt, | |
| "prompt_token_ids": prompt_token_ids, | |
| } | |
| # Remove ALL multimodal data from request to avoid TRT-LLM validation error | |
| # In generation-only mode, ALL multimodal data must be in disaggregated_params only | |
| mm_keys_to_remove = ["multi_modal_data", "image_data", "mm_data"] | |
| for key in mm_keys_to_remove: | |
| if key in request: | |
| request.pop(key) | |
| logging.debug( | |
| f"DECODE: Removed {key} from request (already in disaggregated_params)" | |
| ) | |
| return processed_input | |
| elif self.multimodal_processor: | |
| # Encode/Prefill worker: Process multimodal content normally | |
| # In EPD flow, multimodal_processor should be called in PREFILL/ENCODE modes only | |
| # DECODE mode should skip this and use EPD metadata from prefill | |
| processed_input = await self.multimodal_processor.process_openai_request( | |
| request, embeddings, ep_disaggregated_params | |
| ) | |
| return processed_input | |
| else: | |
| logging.debug( | |
| "DECODE: No multimodal_processor found, using request token_ids" | |
| ) | |
| return None |
🧰 Tools
🪛 Ruff (0.14.7)
191-191: Unused method argument: prefill_result
(ARG002)
🤖 Prompt for AI Agents
components/src/dynamo/trtllm/request_handlers/handler_base.py around lines 187
to 251: the parameter prefill_result is declared but never used in
_prepare_decode_input; remove prefill_result from the method signature and any
type hints, update the docstring args list to drop it, and then update the call
site mentioned (lines ~417-423) to stop passing prefill_result (pass the
remaining args in their same order); ensure imports/types still match and run
tests/type checks to confirm no other callers remain.
| if epd_prompt: | ||
| # In EPD generation-only mode (decode), pass the SAME input format as prefill | ||
| # This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...) | ||
| # The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings | ||
| if epd_token_ids: | ||
| prompt_token_ids = epd_token_ids | ||
|
|
||
| processed_input = { | ||
| "prompt": epd_prompt, | ||
| "prompt_token_ids": prompt_token_ids, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential UnboundLocalError if epd_token_ids is falsy.
If epd_prompt is truthy but epd_token_ids is falsy, the code at line 226 references prompt_token_ids which may not be defined in this scope.
if epd_prompt:
# In EPD generation-only mode (decode), pass the SAME input format as prefill
# This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...)
# The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings
+ prompt_token_ids = None
if epd_token_ids:
prompt_token_ids = epd_token_ids
processed_input = {
"prompt": epd_prompt,
"prompt_token_ids": prompt_token_ids,
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if epd_prompt: | |
| # In EPD generation-only mode (decode), pass the SAME input format as prefill | |
| # This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...) | |
| # The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings | |
| if epd_token_ids: | |
| prompt_token_ids = epd_token_ids | |
| processed_input = { | |
| "prompt": epd_prompt, | |
| "prompt_token_ids": prompt_token_ids, | |
| } | |
| if epd_prompt: | |
| # In EPD generation-only mode (decode), pass the SAME input format as prefill | |
| # This matches TRT-LLM's test: llm_decode.generate(inputs, disaggregated_params=...) | |
| # The inputs dict provides prompt structure, disaggregated_params provide multimodal embeddings | |
| prompt_token_ids = None | |
| if epd_token_ids: | |
| prompt_token_ids = epd_token_ids | |
| processed_input = { | |
| "prompt": epd_prompt, | |
| "prompt_token_ids": prompt_token_ids, | |
| } |
🤖 Prompt for AI Agents
In components/src/dynamo/trtllm/request_handlers/handler_base.py around lines
217-227, the block that builds processed_input can reference prompt_token_ids
when epd_token_ids is falsy, causing an UnboundLocalError; ensure
prompt_token_ids is always defined before use by initializing it (e.g., set
prompt_token_ids = None or to the appropriate existing token variable) before
the if epd_token_ids check or by using a conditional expression when
constructing processed_input so prompt_token_ids is assigned a defined value
even if epd_token_ids is falsy.
| @@ -0,0 +1,67 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the script executable to pass pre-merge validation.
The pipeline check flags that the file has a shebang but is not marked executable.
Apply the following to fix:
git add --chmod=+x examples/backends/trtllm/launch/epd_multimodal.shAlternatively, if using a standard Unix environment:
chmod +x examples/backends/trtllm/launch/epd_multimodal.sh🧰 Tools
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/4668/merge) by indrajit96.
[error] 1-1: Check for executable shebang scripts: the file has a shebang but is not marked executable. Run 'chmod +x examples/backends/trtllm/launch/epd_multimodal.sh' (or 'git add --chmod=+x ...' on Windows if needed).
🤖 Prompt for AI Agents
In examples/backends/trtllm/launch/epd_multimodal.sh around line 1, the file has
a shebang but is not marked executable which fails pre-merge validation; fix it
by making the file executable before committing (for example, set the executable
bit on the file locally or stage it with the executable bit set in git) and
re-run the pipeline to ensure the check passes.
| export PREFILL_ENGINE_ARGS=${PREFILL_ENGINE_ARGS:-"$DYNAMO_HOME/examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/prefill.yaml"} | ||
| export DECODE_ENGINE_ARGS=${DECODE_ENGINE_ARGS:-"$DYNAMO_HOME/examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/decode.yaml"} | ||
| export ENCODE_ENGINE_ARGS=${ENCODE_ENGINE_ARGS:-"$DYNAMO_HOME/examples/backends/trtllm/engine_configs/llava-v1.6-mistral-7b-hf/encode.yaml"} | ||
| export PREFILL_CUDA_VISIBLE_DEVICES=${PREFILL_CUDA_VISIBLE_DEVICES:-"0"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does encode worker need a whole dedicated GPU? Or can it fit on same GPU as prefill? Reason I ask is because our CI (at least on gitlab side + pytest markers) has 1 and 2 gpu tests setup, but not sure about 3 gpu tests (we could use a 4 gpu runner if needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to https://github.com/ai-dynamo/dynamo/pull/4668/files#r2582944082, can we enable this script as a CI test like we do for the other scripts?
ex:
dynamo/tests/serve/test_trtllm.py
Lines 109 to 119 in af9ae79
| "disaggregated_multimodal": TRTLLMConfig( | |
| name="disaggregated_multimodal", | |
| directory=trtllm_dir, | |
| script_name="disagg_multimodal.sh", | |
| marks=[pytest.mark.gpu_2, pytest.mark.trtllm, pytest.mark.multimodal], | |
| model="Qwen/Qwen2-VL-7B-Instruct", | |
| models_port=8000, | |
| timeout=900, | |
| delayed_start=60, | |
| request_payloads=[multimodal_payload_default()], | |
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using a custom template because the default llava template is incompatible with dynamo's pre-processing? From my understanding, llava models will expect image instead image_url which is passed to the jinja template (More details: #4501).
| # Handle image URLs (full E-PD flow with MultimodalEncoder) | ||
| elif image_urls: | ||
| if self.encode_client and self.connector: | ||
| encode_response = await self.remote_encode_full_epd(request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relatively new to the EPD flow so please feel free to correct me, but based of this call, does the execution path look like this: Frontend -> Prefill -> Encode -> Prefill -> Decode?
If so, I was wondering how feasible a flow like this would look like: Frontend -> Encode -> Prefill -> Decode? Is this something we would want to do to save an extra network hop per request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krishung5 any thoughts on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KrishnanPrash yes this makes sense from performance POV.
The design currently is such due to initial prefill/decode first routing limitation (now it's gone).
I feel we can take this up in a new PR as it would need some rework in prefill worker.
We can use this PR for functinality and a new one for optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just wanted to get your thoughts on this:
From my understanding, the current EPD flow looks like this:
- Frontend -> Prefill Worker
- Within the prefill worker we make a blocking call to the encode worker (
await self.remote_encode_full_epd(request)) - Encode worker finishes work and yields to the prefill worker
At a request-level, we are currently blocking at the encode worker before calling the prefill worker, which makes sense. But if we need to need to implement a feature like batching in the future for this flow, would the prefill worker be blocked until all the requests are done on the encoder-side?
| engine=None, | ||
| ): | ||
| """ | ||
| Process embedding request by loading embeddings and creating NIXL readable operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comments for this function need to be updated.
| ) | ||
| await readable_op.wait_for_completion() | ||
| logging.debug("EncodeHelper completed readable operation.") | ||
| elif image_urls and text_prompt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add a else case to handle the case if either image_urls or text_prompt or both are empty?
Overview:
Adds standalone encoder support to the dynamo TRT-LLM backend, enabling the full E-P-D (Encode-Prefill-Decode) disaggregated multimodal serving architecture.
Details:
PrefillHandler: Calls encode worker for image URLs and passesep_disaggregated_paramsthrough the pipeline_epd_processed_prompt,_epd_prompt_token_ids) are packed into disaggregated_params for the decode workerepd_multimodal.shfor running the full E-P-D setup with LLaVAWhere should the reviewer start?
components/src/dynamo/trtllm/encode_helper.py- Core encoding logic and process_encode_requestcomponents/src/dynamo/trtllm/request_handlers/handlers.py-EncodeHandlerandPrefillHandler.remote_encode_full_epdcomponents/src/dynamo/trtllm/engine.py- MultimodalEncoder initializationcomponents/src/dynamo/trtllm/request_handlers/handler_base.py- EPD metadata handling in_prepare_decode_inputand_encode_and_pack_disaggregated_paramsSummary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.