Skip to content

[BUG]: Multimodal requests sent in dynamo are incompatible with llava chat templates. #4501

@KrishnanPrash

Description

@KrishnanPrash

Describe the Bug

Multimodal Inference requests sent to llava-hf/llava-1.5-7b-hf lead to the model crashing with the following error:

2025-11-20T08:30:58.906682Z  INFO dynamo_llm::discovery::watcher: added model model_name="llava-hf/llava-1.5-7b-hf" namespace="dynamo"
...
2025-11-20T08:31:13.806872Z  INFO http_client.get_http_client: Shared HTTP client initialized with timeout=30.0s
2025-11-20T08:31:14.115455Z  INFO _client._send_single_request: HTTP Request: GET http:/... "HTTP/1.1 200 OK"
...
Traceback (most recent call last):
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 370, in generate
    q = await self.add_request(
        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 284, in add_request
    prompt_str, request = self.processor.process_inputs(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 377, in process_inputs
    processed_inputs: ProcessorInputs = self.input_preprocessor.preprocess(
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 644, in preprocess
    return self._process_decoder_only_prompt(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 614, in _process_decoder_only_prompt
    prompt_comps = self._prompt_to_llm_inputs(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 388, in _prompt_to_llm_inputs
    return self._process_tokens(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 317, in _process_tokens
    inputs = self._process_multimodal(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 242, in _process_multimodal
    mm_input = mm_processor.apply(
               ^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/multimodal/processing.py", line 2045, in apply
    prompt_ids, prompt, mm_placeholders = self._maybe_apply_prompt_updates(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/multimodal/processing.py", line 1997, in _maybe_apply_prompt_updates
    ) = self._apply_prompt_updates(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/multimodal/processing.py", line 1919, in _apply_prompt_updates
    assert update_idx is not None, (
           ^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Failed to apply prompt replacement for mm_items['image'][0]

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/components/src/dynamo/vllm/handlers.py", line 319, in generate
    async for tok in self.generate_tokens(
  File "/workspace/components/src/dynamo/vllm/handlers.py", line 232, in generate_tokens
    async for res in gen:
  File "/opt/dynamo/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 420, in generate
    raise EngineGenerateError() from e
vllm.v1.engine.exceptions.EngineGenerateError

My mental model for what is missing: AssertionError: Failed to apply prompt replacement for mm_items['image'][0] means that vLLM attempted to expand/swap out the <image> tokens that were placed during chat template application. Which means that chat template failed to populate the <image> token after being applied to the inference request.

Steps to Reproduce

// Assuming etcd/NATS started. 
python -m dynamo.frontend &
python -m dynamo.vllm --model llava-hf/llava-1.5-7b-hf --max-model-len 4096 &

Then, we send an inference request that looks like:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "http://images.cocodataset.org/test2017/000000155781.jpg"
          }
        }
      ]
    }
  ],
  "model": "llava-hf/llava-1.5-7b-hf"
}'

Additional Context

The issue is that if you look at the llava-hf/llava-1.5-7b-hf chat template:

...
  {# Render all images first #}
  {% for content in message['content'] | selectattr('type', 'equalto', 'image') %}
    <image>
  {% endfor %}
...

Currently the dynamo frontend applies the chat template to the following messages object:

  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {
         "type": "image_url",
         "image_url": { "url": "http://images.cocodataset.org/test2017/000000155781.jpg"}
        }
      ]

If we look closely, dynamo expects the chat template to pick up the "image_url" field, but the chat template expects the field to be labeled "image".

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions