Skip to content

Conversation

@keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Nov 16, 2025

Overview:

Enables support for multimodal models (like LLaVA) that use non-standard Jinja2 template tags unrecognized by minijinja. Implements a placeholder replacement strategy where the Rust frontend transforms custom tags for validation, and the Python backend restores them before vLLM processing.

Details:

  • Add replace_non_standard_blocks() in formatters.rs to convert {% generation %}__JINJA_BLOCK_GENERATION__ for minijinja validation
  • Restore placeholders back to original Jinja2 tags in chat_processor.py before passing to vLLM's custom AssistantTracker extension
  • Make num_hidden_layers optional in model card parsing for multimodal models that omit it in text_config
  • Enhance eos_token_id parsing to handle both single integer values and arrays in generation_config.json
  • Reduce Cargo.lock dependencies

Where should the reviewer start?

Start with lib/llm/src/preprocessor/prompt/template/formatters.rs to see the placeholder replacement logic, then review components/src/dynamo/vllm/multimodal_utils/chat_processor.py to see how placeholders are restored before vLLM processing.

Before the fix:

cd examples/backends/vllm/launch/
./agg_multimodal.sh --model llava-hf/llava-1.5-7b-hf
# and then you see this error
2025-11-14T03:20:18.481757Z ERROR dynamo_llm::discovery::watcher: Error adding model from discovery model_name="llava-hf/llava-1.5-7b-hf" namespace="dynamo" error="build_routed_pipeline: syntax error: unknown statement generation (in default:1)"

After the fix, you can test completions endpoint (proper format for LLaVA)

curl -X POST http://localhost:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"prompt": "USER: Tell me 3 knock knock jokes. ASSISTANT:",
"max_tokens": 150
}' | jq

/coderabbit profile chill

Summary by CodeRabbit

  • New Features

    • Enhanced support for multimodal language models with flexible configuration parsing.
    • Improved template processing for better compatibility with various model configurations.
  • Bug Fixes

    • Increased robustness of model configuration extraction to handle edge cases and variant data formats.

@keivenchang keivenchang self-assigned this Nov 16, 2025
@keivenchang keivenchang requested review from a team as code owners November 16, 2025 01:51
@github-actions github-actions bot added the fix label Nov 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 16, 2025

Walkthrough

The changes improve handling of non-standard Jinja2 templates across multiple components. A regex-based sanitization replaces non-standard tags with placeholders before minijinja validation. Token ID extraction becomes more flexible, handling both single values and arrays. Text config layers become optional for multimodal models.

Changes

Cohort / File(s) Summary
Jinja2 template sanitization
lib/llm/src/preprocessor/prompt/template/formatters.rs, components/src/dynamo/vllm/multimodal_utils/chat_processor.py
Introduces regex-based helper to replace non-standard Jinja2 tags with __JINJA_BLOCK_<TAG>__ placeholders before validation. Applied during template registration and chat template preprocessing to prevent minijinja validation failures on backend-specific blocks.
Configuration extraction flexibility
lib/llm/src/model_card.rs
Makes HFTextConfig.num_hidden_layers optional for multimodal models. Refactors eos_token_id extraction to handle mixed types (single numbers or arrays) from both config.json and generation_config.json using serde_json::Value for flexible parsing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Regex pattern correctness: Verify the non-standard tag detection regex correctly identifies and extracts tag names without false positives or negatives
  • Token ID type conversion: Ensure the flexible eos_token_id parsing correctly handles edge cases (empty arrays, nested structures, invalid types) and maintains type safety
  • Optional field implications: Confirm that making num_hidden_layers optional doesn't introduce downstream errors when the field is None

Poem

🐰 Templates dance with placeholders bright,
Non-standard blocks now hidden from sight,
Token IDs flow like streams of sand—
Single or many, all firmly planned!
With optional layers, the multimodal way,
Our configs now flourish in flexible array!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: adding support for multimodal models with non-standard Jinja2 tags, which is the core objective of the entire pull request.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed Pull request description is comprehensive and follows the template structure with all required sections including Overview, Details, and Where to start.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining why this PR is needed, why this solution was chosen, and what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
lib/llm/src/preprocessor/prompt/template/formatters.rs (2)

51-51: Consider compiling the regex once for better performance.

The regex is compiled on every function call. For a frequently used function, consider using lazy_static or std::sync::OnceLock to compile it once.

Apply this pattern:

use std::sync::OnceLock;

static BLOCK_REGEX: OnceLock<Regex> = OnceLock::new();

fn replace_non_standard_blocks(template: &str) -> String {
    let re = BLOCK_REGEX.get_or_init(|| {
        Regex::new(r"\{%\s*([a-zA-Z_][a-zA-Z0-9_]*)\s*%\}").unwrap()
    });
    // ... rest of the function
}

59-63: Note: Case information is lost during round-trip transformation.

The placeholder uses to_uppercase() (line 61) and the Python restoration uses .lower() (chat_processor.py line 157), which means original case is not preserved. For example, {% Generation %} would become {% generation %} after the round-trip.

This is acceptable if Jinja2 tag names are conventionally lowercase, but document this behavior if case preservation is ever needed.

components/src/dynamo/vllm/multimodal_utils/chat_processor.py (1)

152-152: Consider moving the import to the top of the file.

While importing re inside the conditional block is functionally correct, Python convention is to place all imports at the module level for better visibility and consistency.

Apply this change:

 import json
+import re
 import time
 from typing import AsyncIterator, List, Optional, Protocol, Union, runtime_checkable

And remove line 152.

lib/llm/src/model_card.rs (1)

673-730: Consider extracting the duplicated eos_token_id parsing logic.

The logic to parse eos_token_id as either a number or array (lines 678-701 and 711-730) is duplicated. This could be refactored into a helper function to improve maintainability.

Extract to a helper function:

fn parse_eos_token_id(v: &serde_json::Value) -> Option<Vec<TokenIdType>> {
    if v.is_number() {
        v.as_number()
            .and_then(|n| n.as_u64())
            .map(|n| vec![n as TokenIdType])
    } else if v.is_array() {
        let arr = v.as_array().unwrap(); // Safety: We just checked
        Some(
            arr.iter()
                .filter_map(|inner_v| {
                    inner_v
                        .as_number()
                        .and_then(|n| n.as_u64())
                        .map(|n| n as TokenIdType)
                })
                .collect(),
        )
    } else {
        None
    }
}

Then use it in both places:

let final_eos_token_ids: Vec<TokenIdType> = config
    .eos_token_id
    .as_ref()
    .or(text_config.eos_token_id.as_ref())
    .and_then(parse_eos_token_id)
    .or_else(|| {
        crate::file_json_field::<serde_json::Value>(&gencfg_path, "eos_token_id")
            .ok()
            .and_then(|v| parse_eos_token_id(&v))
    })
    .ok_or_else(|| {
        anyhow::anyhow!(
            "missing eos_token_id in config.json and generation_config.json, cannot load"
        )
    })?;
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e120ed and 879b1d5.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • components/src/dynamo/vllm/multimodal_utils/chat_processor.py (1 hunks)
  • lib/llm/src/model_card.rs (2 hunks)
  • lib/llm/src/preprocessor/prompt/template/formatters.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3067
File: lib/llm/src/preprocessor/prompt/template/oai.rs:87-134
Timestamp: 2025-09-16T19:47:30.312Z
Learning: In Dynamo, multimodal requests (containing image_url or other non-text content) are processed through a completely different workflow than text-only requests, so the may_be_fix_msg_content function in lib/llm/src/preprocessor/prompt/template/oai.rs will only encounter text-only content arrays.
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 2778
File: components/backends/vllm/src/dynamo/vllm/args.py:150-151
Timestamp: 2025-08-29T12:32:52.257Z
Learning: In the custom_jinja_template implementation, KrishnanPrash chose to defer file validation to maintain consistency with how other template files (chat_template.jinja and tokenizer_config.json) are processed in the template loading pipeline, rather than validating the file path during argument parsing.
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3165
File: components/backends/sglang/src/dynamo/sglang/args.py:201-202
Timestamp: 2025-09-22T18:09:23.513Z
Learning: The Rust validation for custom_jinja_template paths is already implemented in lib/bindings/python/rust/lib.rs using PathBuf::from() and path.exists() checks with PyFileNotFoundError. Both vLLM and SGLang benefit from this validation since they both call register_llm(). The missing piece is path expansion (~ and environment variables) in the Python argument processing before passing to the Rust layer.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3051
File: container/templates/Dockerfile.trtllm.j2:424-437
Timestamp: 2025-09-16T17:16:03.785Z
Learning: keivenchang prioritizes maintaining exact backward compatibility during migration/refactoring PRs, even when bugs are identified in the original code. Fixes should be deferred to separate PRs after the migration is complete.
📚 Learning: 2025-09-22T18:09:23.513Z
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3165
File: components/backends/sglang/src/dynamo/sglang/args.py:201-202
Timestamp: 2025-09-22T18:09:23.513Z
Learning: Both vLLM and SGLang implementations of --custom-jinja-template pass the path directly to register_llm() without any validation. KrishnanPrash suggested implementing early validation in the Rust layer (lib/bindings/python/rust/lib.rs) using PathBuf::from() and path.exists() checks with PyFileNotFoundError for consistent error handling across both backends.

Applied to files:

  • lib/llm/src/preprocessor/prompt/template/formatters.rs
📚 Learning: 2025-09-22T18:09:23.513Z
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3165
File: components/backends/sglang/src/dynamo/sglang/args.py:201-202
Timestamp: 2025-09-22T18:09:23.513Z
Learning: KrishnanPrash suggested adding early validation for custom Jinja template paths in the Rust layer (lib/bindings/python/rust/lib.rs) to benefit both vLLM and SGLang workflows, using PathBuf::from() and path.exists() checks with appropriate PyFileNotFoundError handling.

Applied to files:

  • lib/llm/src/preprocessor/prompt/template/formatters.rs
📚 Learning: 2025-09-22T18:09:23.513Z
Learnt from: KrishnanPrash
Repo: ai-dynamo/dynamo PR: 3165
File: components/backends/sglang/src/dynamo/sglang/args.py:201-202
Timestamp: 2025-09-22T18:09:23.513Z
Learning: The Rust validation for custom_jinja_template paths is already implemented in lib/bindings/python/rust/lib.rs using PathBuf::from() and path.exists() checks with PyFileNotFoundError. Both vLLM and SGLang benefit from this validation since they both call register_llm(). The missing piece is path expansion (~ and environment variables) in the Python argument processing before passing to the Rust layer.

Applied to files:

  • lib/llm/src/preprocessor/prompt/template/formatters.rs
📚 Learning: 2025-09-02T16:46:54.015Z
Learnt from: GuanLuo
Repo: ai-dynamo/dynamo PR: 2714
File: lib/llm/src/discovery/model_entry.rs:38-42
Timestamp: 2025-09-02T16:46:54.015Z
Learning: In lib/llm/src/discovery/model_entry.rs, GuanLuo prefers not to add serde defaults for model_type and model_input fields to keep the specification explicit and avoid user errors, relying on atomic deployment strategy to avoid backward compatibility issues.

Applied to files:

  • lib/llm/src/model_card.rs
🧬 Code graph analysis (1)
lib/llm/src/model_card.rs (1)
lib/llm/src/lib.rs (1)
  • file_json_field (63-100)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: operator (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (.)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (lib/runtime/examples)
🔇 Additional comments (6)
lib/llm/src/preprocessor/prompt/template/formatters.rs (3)

128-133: LGTM! Appropriate sanitization before template validation.

The template is correctly sanitized before being added to the minijinja Environment, allowing validation to pass even with non-standard blocks.


155-157: LGTM! Consistent sanitization applied to all templates.

The same sanitization is applied to templates in the map, ensuring consistent handling across different template types.


17-71: Remove review comment; the concern does not reflect actual vLLM template syntax.

vLLM's generation block is implemented as {% generation %} ... {% endgeneration %} without call-time arguments, not with arguments as the review suggests. minijinja does not expose custom block tag extensions in its public API, so the function's purpose is specifically to preprocess templates by removing non-standard blocks to allow minijinja validation of the remaining syntax. The regex correctly matches simple block tags like {% generation %}. Code search found no non-standard blocks with arguments in actual templates.

Likely an incorrect or invalid review comment.

lib/llm/src/model_card.rs (3)

705-730: LGTM! Enhanced eos_token_id parsing supports both single values and arrays.

The updated logic correctly handles both single integer and array formats for eos_token_id in generation_config.json, improving flexibility for different model configurations.


695-700: Good error handling for invalid eos_token_id formats.

The error logging when eos_token_id is neither a number nor an array provides helpful debugging information.


624-625: No issues found — field is not consumed by any code.

The num_hidden_layers field appears only at its definition and is never accessed anywhere in the codebase. Since no code reads this field, making it Option<usize> has no impact on existing consumers. The change is safe, and deserialization will work correctly as serde handles Option types natively.

@KrishnanPrash
Copy link
Contributor

KrishnanPrash commented Nov 16, 2025

Thank you for fixing the EPD flow. Just a few comments, about the other multimodal flow.

My mental model for multimodal models in dynamo+vLLM:
  • Dyanmo+vLLM EPD Path:
    • This flow is invoked with the command python -m dynamo.vllm --model <MODEL_NAME> --multiomodal-processor
    • The model is registered with register_llm(ModelInput.Text, …). This means the inference request is accepted and passed as-is to the backend.
    • We skip pre-processing (applying chat template + tokenization) in Rust frontend
    • We rely on vLLM to internally handle this [Ref].
  • Dyanmo+vLLM Normal Path:
    • This flow is invoked with the command python -m dynamo.vllm --model <MODEL_NAME>
    • This model is registered with register_llm(ModelInput.Tokens, …). This means that the inference request is accepted, the pre-processing happens on the frontend (chat template + tokenization), and we give the backend a PreprocessedRequest Object.
    • Dynamo Rust frontend applies chat template + tokenize

If we use the Normal Path in Dynamo + vLLM with this current PR, an inference request like this:

 curl -X POST http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "llava-hf/llava-1.5-7b-hf",
  "messages": [
    {"role": "user", "content": "Who is Jensen Huang?"},
    {"role": "assistant", "content": "Answer with clear detail. Do not omit details."}
  ]
}' | jq

After the application of the chat template, the rendered prompt will look like this:

USER: Who is Jensen Huang? ASSISTANT: __JINJA_BLOCK_GENERATION__Answer with clear detail. Do not omit details. __JINJA_BLOCK_ENDGENERATION__

The same cleanup that we do for the epd flow, in my opinion, is needed for the regular flow as well. But instead of re-introducing the custom block tags, we need to remove them all together.

This could look something like this:
In oai.rs:

fn render() {
   ....
   let rendered = tmpl.render(&ctx)?;
   
   let cleaned = super::formatters::remove_placeholder_blocks(&rendered);

   Ok(cleaned)
}

In formatters.rs:

fn remove_placeholder_blocks(rendered: &str) -> String {
    use regex::Regex;

    let re = Regex::new(r"__JINJA_BLOCK_[A-Z_]+__").unwrap();
    re.replace_all(rendered, "").to_string()
}

Or we can make it more specfic to only remove the generation tags? Let me know what you think.

@KrishnanPrash
Copy link
Contributor

Put up #4380 with some additional changes I needed to get this working. Still need to clean the PR up and do some more testing. Will ping you when it is ready for review.

@grahamking
Copy link
Contributor

grahamking commented Nov 17, 2025

Does this imply that tokenization is being done by vllm on the backend?

If yes, do we need to do anything with the template in the frontend? The frontend becomes a proxy, maybe we can pass the request straight though. To do that call register_llm with ModelInput.Text.

@krishung5
Copy link
Contributor

Can we update the doc here to remove the limitation of llava: https://github.com/ai-dynamo/dynamo/blob/main/docs/backends/vllm/multimodal.md

Copy link
Contributor

@GuanLuo GuanLuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. But as Graham brought it up, I believe the multi-modal path does register the model with ModelInput.Text as it needs to do something with the raw text (unless this part has been ported to the frontend preprocessor), I am curious if the chat template validation is performed, in the frontend, regardless of the model input type.

- Add remove_known_non_jinja2_tags to strip {% generation %} tags before minijinja validation
- Fixes LLaVA and other models using vLLM-specific template extensions
- Make num_hidden_layers optional in model card parsing for multimodal compatibility
- Handle eos_token_id as single value or array in generation_config.json

Signed-off-by: Keiven Chang <[email protected]>
Copy link
Contributor

@KrishnanPrash KrishnanPrash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍 . Great Work Here 💯

@keivenchang
Copy link
Contributor Author

keivenchang commented Nov 19, 2025

@grahamking

Does this imply that tokenization is being done by vllm on the backend?

I'm learning this as I'm reading along (@KrishnanPrash can correct me) but in watcher.rs, it looks like:

  • tokenization is being done by vLLM on the backend when ModelInput.Tokens is used (lines 360-458)
  • passthrough, when ModelInput.Text is used (lines 459-494)

If yes, do we need to do anything with the template in the frontend? The frontend becomes a proxy, maybe we can pass the request straight though. To do that call register_llm with ModelInput.Text.

Can you just arbitrarily register ModelInput type? I thought you need to match it with whatever the model is (e.g. Text if multimodal, otherwise Tokens). @KrishnanPrash?

@keivenchang keivenchang merged commit f395b64 into main Nov 19, 2025
32 of 36 checks passed
@keivenchang keivenchang deleted the keivenchang/MDC-fix-on-main-nvbugs5662072 branch November 19, 2025 22:55
@KrishnanPrash
Copy link
Contributor

KrishnanPrash commented Nov 20, 2025

tokenization is being done by vLLM on the backend when ModelInput.Tokens is used (lines 360-458)
passthrough, when ModelInput.Text is used (lines 459-494)

  • ModelInput.Tokens => Means Dynamo's Rust Frontend is doing pre-processing (applying chat template + tokenization) and the backend gets token_ids
  • ModelInput.Text => Dynamo's frontend does NONE of the pre-processing and the backend is responsible for pre-processing before calling the underlying engine. This is what Dynamo's EPD flow currently uses.

Can you just arbitrarily register ModelInput type? I thought you need to match it with whatever the model is (e.g. Text if multimodal, otherwise Tokens). @KrishnanPrash?

Copying something from a previous PR comment. Let me know if this addresses your doubts.

  • Dyanmo+vLLM EPD Path:
    • This flow is invoked with the command python -m dynamo.vllm --model <MODEL_NAME> --multiomodal-processor
    • The model is registered with register_llm(ModelInput.Text, …). This means the inference request is accepted and passed as-is to the backend.
    • We skip pre-processing (applying chat template + tokenization) in Rust frontend
    • We rely on vLLM to internally handle this [Ref].
  • Dyanmo+vLLM Normal Path:
    • This flow is invoked with the command python -m dynamo.vllm --model <MODEL_NAME>
    • This model is registered with register_llm(ModelInput.Tokens, …). This means that the inference request is accepted, the pre-processing happens on the frontend (chat template + tokenization), and we give the backend a PreprocessedRequest Object.
    • Dynamo Rust frontend applies chat template + tokenize

zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Nov 22, 2025
zxue2 pushed a commit to zxue2/dynamo that referenced this pull request Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants