feat: Add vLLM multimodal audio support #2760

yuekaizhang · 2025-08-28T08:10:02Z

Overview:

This PR supports qwen2audio multimodal example based on VLLM.

Details:

Where should the reviewer start?

audio_encoder_worker.py and audio_agg.sh should be a good start.

@krishung5 I was wondering if you have some time to review the PR, many thanks.

Summary by CodeRabbit

New Features
- Added audio modality support to the multimodal pipeline, including processing of audio_url content and streaming responses.
- Introduced an audio encoder worker and integrated audio embeddings into the aggregated serving flow.
- Added launcher script to run an end-to-end audio aggregated setup.
- Enabled model support for Qwen/Qwen2-Audio-7B-Instruct.
- Improved input validation to accept audio alongside image/video.
Documentation
- Expanded README with “Multimodal Aggregated Audio Serving,” including architecture diagram, setup instructions, curl example, and sample response.

copy-pr-bot · 2025-08-28T08:10:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-08-28T08:10:10Z

Walkthrough

Adds audio modality support to the multimodal example: new audio encoder worker, audio loader utility, protocol/model updates for audio embeddings, processor and PD worker adjustments, a new launch script for audio aggregation, and README documentation describing the aggregated audio-serving workflow.

Changes

Cohort / File(s)	Summary
Documentation `examples/multimodal/README.md`	Adds “Multimodal Aggregated Audio Serving” section with topology, Mermaid diagram, curl example, and sample response.
Protocol and Model `examples/multimodal/utils/protocol.py`, `examples/multimodal/utils/model.py`	Introduces AudioURL types, extends MessageContent and MultiModalInput with audio_url, broadens embeddings shape to include 2D, adds Qwen2-Audio-7B constant, and updates construct_mm_data to accept/use audio_embeds.
Processor `examples/multimodal/components/processor.py`	Handles audio_url content, enforces exclusivity with image/video, and updates required-input validation/messages.
Workers `examples/multimodal/components/audio_encode_worker.py`, `examples/multimodal/components/worker.py`	Adds VllmEncodeWorker for audio embeddings and downstream orchestration; updates PDWorker to route audio inputs and clear audio_url after building multi_modal_data.
Utilities `examples/multimodal/utils/audio_loader.py`	Adds async AudioLoader with HTTP(S) fetch, librosa decoding on threadpool, and LRU-like in-memory cache.
Launch Scripts `examples/multimodal/launch/audio_agg.sh`, `examples/multimodal/launch/agg.sh`	New audio aggregation launcher orchestrating ingress, processor, audio encoder, and prefill; trivial EOF newline change to existing agg.sh.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Client
  participant F as Ingress Frontend
  participant P as Processor
  participant AE as AudioEncodeWorker
  participant PD as PD/LLM Worker

  U->>F: HTTP request (messages incl. audio_url)
  F->>P: vLLMMultimodalRequest
  Note right of P: Tokenize prompt<br/>Extract audio_url

  P->>AE: Encode request (audio_url)
  AE->>AE: Load audio (AudioLoader)
  AE->>AE: Compute audio embeddings
  AE-->>P: Embeddings descriptor

  P->>PD: Prefill/Decode request<br/>(multi_modal_data with audio)
  PD-->>F: Streamed tokens/results
  F-->>U: Streamed response

  rect rgba(200, 230, 255, 0.25)
    Note over AE,PD: New/updated interactions for audio modality
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: multi-modal example with vLLM v1 and UX v2 #2040 — Adds vLLM-based multimodal components and request types that this PR extends with audio-specific encoding and fields.
feat: Add vllm multimodal qwen aggregated support #2694 — Touches the same multimodal pipeline files (protocol, model, workers) and embeddings handling updated here for audio.
feat: Add vLLM multimodal video support #2738 — Mirrors the modality expansion pattern (video) that this PR applies to audio across processor/worker/utils.

Poem

In cables hums a whispered tune,
I hop through bytes beneath the moon.
From waves to words, I nibble sound,
Embeddings bloom where beats are found.
Two workers dance, a stereo pair—
Prefill, decode—responses share.
Thump-thump, the audio’s in the air! 🐇🎧

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-08-28T08:10:12Z

👋 Hi yuekaizhang! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

examples/multimodal/launch/agg.sh (2)

61-64: Wire up the audio encoder worker (and quote args).

Currently launches components/encode_worker.py, not the new audio encoder; also unquoted args risk word-splitting. Switch based on model name and quote.

-# run E/P/D workers
-CUDA_VISIBLE_DEVICES=0 python3 components/encode_worker.py --model $MODEL_NAME &
-CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill &
+# run E/P/D workers
+if [[ "${MODEL_NAME,,}" == *"audio"* ]]; then
+  CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model "$MODEL_NAME" &
+else
+  CUDA_VISIBLE_DEVICES=0 python3 components/encode_worker.py --model "$MODEL_NAME" &
+fi
+CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model "$MODEL_NAME" --worker-type prefill &

55-66: Background failure handling: exit fast on first crash.

set -e won’t catch background job failures. Use a wait loop to detect first failing PID and then kill others.

-# Wait for all background processes to complete
-wait
+# Wait for all background processes to complete; exit on first failure
+pids=($(jobs -p))
+status=0
+for pid in "${pids[@]}"; do
+  if ! wait "$pid"; then
+    status=$?
+    break
+  fi
+done
+exit "$status"

examples/multimodal/utils/model.py (1)

44-52: Fix dtype handling, None-guard, and return typing for audio path.

Don’t force bf16; respect embeddings_dtype to avoid mismatch with the encoder/PD buffer.
Guard None before .to(...)
Return type annotation doesn’t match actual values (audio: List[tensor], video: np.ndarray). Relax to Dict[str, Any].

-def construct_mm_data(
+def construct_mm_data(
     model: str,
     embeddings_dtype: torch.dtype,
     image_embeds: Optional[torch.Tensor] = None,
     video_numpy: Optional[Any] = None,
     image_grid_thw: Optional[List[Any]] = None,
     audio_embeds: Optional[torch.Tensor] = None,
-) -> Dict[str, torch.Tensor | Dict[str, Any]]:
+) -> Dict[str, Any]:
@@
-    if model == SupportedModels.QWEN_2_AUDIO_7B:
-        audio_embeds = audio_embeds.to(torch.bfloat16)
-        assert audio_embeds.ndim == 2, "Audio embeddings must be 2D"
-        return {"audio": [audio_embeds]}
+    if model == SupportedModels.QWEN_2_AUDIO_7B:
+        if audio_embeds is None:
+            raise ValueError("No audio embeddings provided.")
+        audio_embeds = audio_embeds.to(embeddings_dtype)
+        assert audio_embeds.ndim == 2, "Audio embeddings must be 2D"
+        return {"audio": [audio_embeds]}

Also applies to: 53-56

examples/multimodal/components/worker.py (2)

248-252: Set audio embeddings dtype explicitly.

Align PD buffer dtype with the model/audio path (and your model.construct_mm_data call). Recommend bf16 for audio to avoid double casting and potential precision loss.

-        if "video" in self.engine_args.model.lower():
+        if "video" in self.engine_args.model.lower():
             self.EMBEDDINGS_DTYPE = torch.uint8
-        else:
+        elif "audio" in self.engine_args.model.lower():
+            self.EMBEDDINGS_DTYPE = torch.bfloat16
+        else:
             self.EMBEDDINGS_DTYPE = torch.float16

285-312: Guard the URL path for audio/video; current else-branch assumes image and will mis-handle audio_url.

If audio_url survives to PD (e.g., encoder didn’t clear it), this code attempts to load an image with None URL. Make the URL branch explicit.

-        if (
+        if (
             request.multimodal_input.image_url is None
             and request.multimodal_input.video_url is None
             and request.multimodal_input.audio_url is None
         ):
@@
-        else:
-            # Use PIL image instead of image embeddings
-            multi_modal_data = {
-                "image": await self.image_loader.load_image(
-                    request.multimodal_input.image_url
-                )
-            }
+        else:
+            if request.multimodal_input.image_url is not None:
+                # Use PIL image instead of image embeddings
+                multi_modal_data = {
+                    "image": await self.image_loader.load_image(
+                        request.multimodal_input.image_url
+                    )
+                }
+            elif request.multimodal_input.video_url is not None:
+                raise ValueError(
+                    "video_url cannot be consumed by PD worker; provide precomputed embeddings"
+                )
+            elif request.multimodal_input.audio_url is not None:
+                raise ValueError(
+                    "audio_url cannot be consumed by PD worker; provide precomputed embeddings"
+                )

Also applies to: 319-331

🧹 Nitpick comments (14)

examples/multimodal/launch/agg.sh (2)

39-53: Add prompt-template handling for audio models or enforce explicit template.

No template branch for Qwen2-Audio; script exits for audio by default. Either add a model-specific template or make --prompt-template mandatory when MODEL includes "Audio".

58-60: Quote PROMPT_TEMPLATE and MODEL_NAME.

Prevent word-splitting or globbing in args.
-python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
+python3 components/processor.py --model "$MODEL_NAME" --prompt-template "$PROMPT_TEMPLATE" &

examples/multimodal/utils/protocol.py (1)

110-117: Validate audio_url scheme.

Add a light validator to reject non-http(s) URLs early (mirrors typical image_url hygiene).

 class AudioURLDetail(BaseModel):
-    url: str
+    url: str
+
+    @field_validator("url")
+    @classmethod
+    def _http_only(cls, v: str) -> str:
+        if not (v.startswith("http://") or v.startswith("https://")):
+            raise ValueError("audio_url must start with http:// or https://")
+        return v

examples/multimodal/components/worker.py (1)

274-284: Descriptor/embeddings creation assumes embeddings_shape is set.

If upstream forgets to set embeddings_shape for audio, this will break. Add a precondition check with a clearer error.

-        embeddings = torch.empty(
+        if request.embeddings_shape is None:
+            raise ValueError("embeddings_shape must be set for non-URL multimodal inputs")
+        embeddings = torch.empty(
             request.embeddings_shape,
             dtype=self.EMBEDDINGS_DTYPE,
             device=self.EMBEDDINGS_DEVICE,
         )

examples/multimodal/README.md (3)

511-514: Tighten wording and fix minor grammar.

Use “passes” (singular) and parallel phrasing.

Apply:

-- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
-- frontend: HTTP endpoint to handle incoming requests.
+- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker.
+- frontend: Provides an HTTP endpoint to handle incoming requests.

543-568: Align the example payload with the note.

If keeping the temporary image_url hack, explicitly label it in the example to prevent users from switching to audio_url and breaking the flow. Also, consider lowering max_tokens from 6000 unless known-safe.

Apply:

-            {
-              "type": "image_url",
-              "image_url": {
-                "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
-              }
-            }
+            {
+              "type": "image_url",
+              "image_url": {
+                "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav"
+              }
+            }
+            // Note: using image_url to carry audio temporarily.

543-545: Port inconsistency across sections.

Audio uses port 8000 (matches audio_agg.sh), earlier sections use 8080. Add a one-liner callout at the start of the audio section or standardize ports across examples.

Apply:

-### Client
+### Client
+Note: The audio example serves on port 8000 (audio_agg.sh). Other examples may use 8080.

examples/multimodal/launch/audio_agg.sh (3)

24-29: Help text mentions unrelated templates.

This launcher is audio-specific; drop LLaVA/Qwen2.5/Phi3V template note to avoid confusion.

Apply:
-            echo "  --prompt-template <template> Specify the multi-modal prompt template to use. LLaVA 1.5 7B, Qwen2.5-VL, and Phi3V models have predefined templates."
+            echo "  --prompt-template <template> Specify the multimodal prompt template to use for the selected audio model."
55-60: Quote args; unify python invocation.

Quote MODEL_NAME to avoid word-splitting and use python3 consistently.

Apply:
-python -m dynamo.frontend --http-port 8000 &
+python3 -m dynamo.frontend --http-port 8000 &

-python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
+python3 components/processor.py --model "$MODEL_NAME" --prompt-template "$PROMPT_TEMPLATE" &

-CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
-CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill &
+CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model "$MODEL_NAME" &
+CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model "$MODEL_NAME" --worker-type prefill &
57-60: Guard for single-GPU environments (optional). Defaults assume two GPUs (0 and 1); either document this requirement or add a runtime check (e.g. via nvidia-smi) to warn when fewer GPUs are available.

examples/multimodal/utils/audio_loader.py (3)

65-71: Honor sampling_rate and move CPU-bound load off-loop (already good).

Also ensure we pass the desired SR.

Apply:

-            def _load_audio():
-                return librosa.load(audio_data_stream, sr=16000)
+            def _load_audio():
+                return librosa.load(audio_data_stream, sr=sampling_rate or 16000)

73-81: Make cache mutation atomic.

Wrap eviction and insert with a lock.

Apply:

-            if parsed_url.scheme in ("http", "https"):
-                audio_url_lower = audio_url.lower()
-                # Cache the audio for future use, and evict the oldest audio if the cache is full
-                if self._cache_queue.full():
-                    oldest_audio_url = await self._cache_queue.get()
-                    del self._audio_cache[oldest_audio_url]
-
-                self._audio_cache[audio_url_lower] = (audio_data, sr)
-                await self._cache_queue.put(audio_url_lower)
+            if parsed_url.scheme in ("http", "https"):
+                audio_url_lower = audio_url.lower()
+                async with self._lock:
+                    if self._cache_queue.full():
+                        oldest_audio_url = await self._cache_queue.get()
+                        self._audio_cache.pop(oldest_audio_url, None)
+                    self._audio_cache[audio_url_lower] = (audio_data, sr)
+                    await self._cache_queue.put(audio_url_lower)

44-49: Cache lookup should also be locked (optional).

Prevents torn reads under concurrent evictions.

Apply:

-            if audio_url_lower in self._audio_cache:
-                logger.debug(f"Audio found in cache for URL: {audio_url}")
-                return self._audio_cache[audio_url_lower]
+            async with self._lock:
+                if audio_url_lower in self._audio_cache:
+                    logger.debug(f"Audio found in cache for URL: {audio_url}")
+                    return self._audio_cache[audio_url_lower]

examples/multimodal/components/audio_encode_worker.py (1)

39-55: Unused cupy/numpy selection block.

DEVICE/array_module are unused. Remove or wire into compute.

Apply:

-try:
-    import cupy as array_module
-
-    if not array_module.cuda.is_available():
-        raise ImportError("CUDA is not available.")
-    DEVICE = "cuda"
-    logger.info("Using cupy for array operations (GPU mode).")
-except ImportError as e:
-    logger.warning(f"Failed to import cupy, falling back to numpy: {e}.")
-    import numpy as array_module
-
-    DEVICE = "cpu"
+# (Optional) Remove cupy/numpy runtime selection; not used in this module.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e3619ce and 0aef825.

📒 Files selected for processing (9)

examples/multimodal/README.md (1 hunks)
examples/multimodal/components/audio_encode_worker.py (1 hunks)
examples/multimodal/components/processor.py (1 hunks)
examples/multimodal/components/worker.py (3 hunks)
examples/multimodal/launch/agg.sh (1 hunks)
examples/multimodal/launch/audio_agg.sh (1 hunks)
examples/multimodal/utils/audio_loader.py (1 hunks)
examples/multimodal/utils/model.py (2 hunks)
examples/multimodal/utils/protocol.py (3 hunks)

🧰 Additional context used

🪛 LanguageTool

examples/multimodal/README.md

[grammar] ~511-~511: There might be a mistake here.
Context: .../worker.py) for prefilling and decoding. - processor: Tokenizes the prompt and pass...

(QB_NEW_EN)

[grammar] ~512-~512: There might be a mistake here.
Context: ... and passes it to the AudioEncodeWorker. - frontend: HTTP endpoint to handle incomi...

(QB_NEW_EN)

[grammar] ~519-~519: There might be a mistake here.
Context: ...onents/backends/vllm/README.md) example. By separating the audio processing from ...

(QB_NEW_EN)

[grammar] ~520-~520: There might be a mistake here.
Context: ...a more flexible deployment and scale the AudioEncodeWorker independently from the...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (5)

examples/multimodal/utils/protocol.py (1)

128-129: Schema extension looks consistent with the pipeline.

Audio content union + MultiModalInput.audio_url + embeddings_shape 2-tuple are coherent with the PD worker and model helpers.

Please confirm the engine-side expectation for audio embeddings shape tuple order (e.g., [T, D]) matches the audio encoder output.

Also applies to: 145-149, 155-157

examples/multimodal/utils/model.py (1)

31-32: Add QWEN_2_AUDIO_7B constant — LGTM.

Clearly scoped addition.

examples/multimodal/components/worker.py (2)

306-312: Audio path construction — LGTM (given encoder provides 2D).

The new branch routes embeddings to construct_mm_data correctly.

248-256: No action needed: audio encoder and worker both use torch.float16.

The audio model is loaded with torch_dtype=torch.float16 and the worker’s EMBEDDINGS_DTYPE is set to torch.float16 when not handling video, so there’s no RDMA type mismatch.

examples/multimodal/components/processor.py (1)

257-263: Required multimodal input validation — LGTM.

Clear and correct inclusion of audio_url.

examples/multimodal/components/audio_encode_worker.py

examples/multimodal/components/processor.py

examples/multimodal/launch/audio_agg.sh

examples/multimodal/README.md

examples/multimodal/utils/audio_loader.py

krishung5

Thank you for enabling audio support! General workflow lgtm, left a few comments/questions.

Could we also add some pytest to here: https://github.com/ai-dynamo/dynamo/blob/main/tests/serve/test_vllm.py.

examples/multimodal/README.md

examples/multimodal/launch/agg.sh

examples/multimodal/README.md

examples/multimodal/components/audio_encode_worker.py

yuekaizhang · 2025-08-29T04:02:24Z

@krishung5 Would you mind checking the latest commit?

Also, the current recipe depends on vllm-project/vllm#23625. What do you think we add the test case in another PR after dynamo update the vllm commit tag?

whoisj

LGTM, at least all of the nixl_connector stuff looks correct.

Approved assuming all tests pass and @yuekaizhang has done some form of validation for both the aggregated and disaggregated use cases.

yuekaizhang · 2025-09-09T06:26:03Z

LGTM, at least all of the nixl_connector stuff looks correct.

Approved assuming all tests pass and @yuekaizhang has done some form of validation for both the aggregated and disaggregated use cases.

Hi @whoisj and @krishung5, I've just added a test case to this PR, which depends on vllm v0.10.2rc1 (from https://github.com/vllm-project/vllm/releases/tag/v0.10.2rc1). I checked the latest vllm pip index and saw that only version 0.10.1 is available. Before the official v0.10.2 is released, what do you think about merging this PR first? This would help avoid potential code conflicts.

krishung5 · 2025-09-17T17:36:28Z

Hi @yuekaizhang, sorry for the late reply, I was OOO for the past two weeks. Some questions:
If the audio path depends on vllm v0.10.2, would pip install vllm==v0.10.2 simply work, or do we need to rebuild dynamo with the vllm version as well? I think I ran into some symbol not found issues if pip installing a vllm version different from what dynamo is built with. Just wanted to double check on this.

btw - I think vllm released v0.10.2 on pip index.

Besides, could you help resolve the conflicts in the PR and rerun a pipeline? I can help on the CI side if you don't have access to the GitLab. Thank you!

pyproject.toml

examples/multimodal/utils/protocol.py

yuekaizhang · 2025-09-22T06:12:54Z

@krishung5 Hi, I have rebased the code. Also, I have verified the correctness of the whole model serving and client test process.

Would you mind helping check it again?

krishung5 · 2025-09-22T17:54:04Z

/ok to test 2e3a2b3

examples/multimodal/utils/audio_loader.py

Signed-off-by: Yuekai Zhang <[email protected]>

yuekaizhang · 2025-11-03T02:59:15Z

@krishung5 Sorry for the delay, would you mind helping check the CI/CD again? Many thanks!

Signed-off-by: Yuekai Zhang <[email protected]>

krishung5 · 2025-11-08T00:14:37Z

/ok to test 5ab7e9c

krishung5 · 2025-11-12T14:59:37Z

/ok to test 00484b8

krishung5 · 2025-11-12T18:32:54Z

/ok to test 00f8463

Signed-off-by: krishung5 <[email protected]>

krishung5 · 2025-11-13T16:15:26Z

/ok to test bcf5e9e

Signed-off-by: Yuekai Zhang <[email protected]> Signed-off-by: krishung5 <[email protected]> Co-authored-by: Kris Hung <[email protected]> Signed-off-by: Daiyaan <[email protected]>

Cherry-pick of PR #2760 to release/0.7.1. Adds audio multimodal support for vLLM including: - AudioEncodeWorker for processing audio inputs - Audio aggregated and disaggregated serving examples - Audio loader utilities - Documentation for audio multimodal serving Original PR: #2760 Original author: Yuekai Zhang <[email protected]> Co-authored-by: Kris Hung <[email protected]> Signed-off-by: Dan Gil <[email protected]>

yuekaizhang requested review from hhzhang16, indrajit96, krishung5 and whoisj as code owners August 28, 2025 08:10

pull-request-size bot added the size/XL label Aug 28, 2025

github-actions bot added the external-contribution Pull request is from an external contributor label Aug 28, 2025

yuekaizhang changed the title ~~[multimodal] support qwen2audio example~~ feat: Add vLLM multimodal audio support Aug 28, 2025

github-actions bot added the feat label Aug 28, 2025

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

krishung5 requested changes Aug 28, 2025

View reviewed changes

yuekaizhang requested a review from a team as a code owner August 29, 2025 03:59

whoisj reviewed Aug 29, 2025

View reviewed changes

yuekaizhang force-pushed the pr branch from 62d724a to edd4ea0 Compare September 9, 2025 06:19

yuekaizhang requested review from a team and biswapanda as code owners September 9, 2025 06:19

yuekaizhang force-pushed the pr branch from edd4ea0 to 13d8d7c Compare September 22, 2025 02:41

yuekaizhang requested a review from a team as a code owner September 22, 2025 05:59

yuekaizhang commented Sep 22, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

yuekaizhang commented Sep 22, 2025

View reviewed changes

examples/multimodal/utils/protocol.py Show resolved Hide resolved

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 17:54 Inactive

milesial requested changes Oct 23, 2025

View reviewed changes

examples/multimodal/utils/audio_loader.py Outdated Show resolved Hide resolved

examples/multimodal/utils/audio_loader.py Outdated Show resolved Hide resolved

fix audio url case

c02e867

Signed-off-by: Yuekai Zhang <[email protected]>

yuekaizhang and others added 2 commits November 4, 2025 16:00

add lru cache

95a4909

Signed-off-by: Yuekai Zhang <[email protected]>

Merge branch 'main' into pr

5ab7e9c

copy-pr-bot bot temporarily deployed to GITLAB November 8, 2025 00:14 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 8, 2025 00:18 Inactive

yuekaizhang requested a review from milesial November 10, 2025 01:13

milesial approved these changes Nov 11, 2025

View reviewed changes

Merge branch 'main' into pr

00484b8

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 14:59 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 15:01 Inactive

krishung5 enabled auto-merge (squash) November 12, 2025 15:14

krishung5 disabled auto-merge November 12, 2025 15:31

krishung5 force-pushed the pr branch from 79e10ce to cd957cb Compare November 12, 2025 18:32

copy-pr-bot bot temporarily deployed to GITLAB November 12, 2025 18:33 Inactive

Remove static from dynamo_worker

50d3684

Signed-off-by: krishung5 <[email protected]>

krishung5 force-pushed the pr branch from 00f8463 to 50d3684 Compare November 13, 2025 16:14

Merge branch 'main' into pr

bcf5e9e

copy-pr-bot bot temporarily deployed to GITLAB November 13, 2025 16:15 Inactive

copy-pr-bot bot temporarily deployed to GITLAB November 13, 2025 16:16 Inactive

krishung5 merged commit 0f5dd2b into ai-dynamo:main Nov 13, 2025
38 of 47 checks passed

dagil-nvidia mentioned this pull request Dec 10, 2025

fix: Cherry-pick multimodal audio support and vLLM v0.12.0 fixes to release/0.7.1 #4871

Closed

2 tasks

feat: Add vLLM multimodal audio support #2760

feat: Add vLLM multimodal audio support #2760

Uh oh!

Conversation

yuekaizhang commented Aug 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Aug 28, 2025

Uh oh!

coderabbitai bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

github-actions bot commented Aug 28, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krishung5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuekaizhang commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whoisj left a comment

Choose a reason for hiding this comment

Uh oh!

yuekaizhang commented Sep 9, 2025

Uh oh!

krishung5 commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

yuekaizhang commented Sep 22, 2025

Uh oh!

krishung5 commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

yuekaizhang commented Nov 3, 2025

Uh oh!

krishung5 commented Nov 8, 2025

Uh oh!

krishung5 commented Nov 12, 2025

Uh oh!

krishung5 commented Nov 12, 2025

Uh oh!

krishung5 commented Nov 13, 2025

Uh oh!

yuekaizhang commented Aug 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 28, 2025 •

edited

Loading

yuekaizhang commented Aug 29, 2025 •

edited

Loading