-
Notifications
You must be signed in to change notification settings - Fork 766
feat: Add vLLM multimodal audio support #2760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds audio modality support to the multimodal example: new audio encoder worker, audio loader utility, protocol/model updates for audio embeddings, processor and PD worker adjustments, a new launch script for audio aggregation, and README documentation describing the aggregated audio-serving workflow. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as Client
participant F as Ingress Frontend
participant P as Processor
participant AE as AudioEncodeWorker
participant PD as PD/LLM Worker
U->>F: HTTP request (messages incl. audio_url)
F->>P: vLLMMultimodalRequest
Note right of P: Tokenize prompt<br/>Extract audio_url
P->>AE: Encode request (audio_url)
AE->>AE: Load audio (AudioLoader)
AE->>AE: Compute audio embeddings
AE-->>P: Embeddings descriptor
P->>PD: Prefill/Decode request<br/>(multi_modal_data with audio)
PD-->>F: Streamed tokens/results
F-->>U: Streamed response
rect rgba(200, 230, 255, 0.25)
Note over AE,PD: New/updated interactions for audio modality
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
👋 Hi yuekaizhang! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
examples/multimodal/launch/agg.sh (2)
61-64: Wire up the audio encoder worker (and quote args).Currently launches components/encode_worker.py, not the new audio encoder; also unquoted args risk word-splitting. Switch based on model name and quote.
-# run E/P/D workers -CUDA_VISIBLE_DEVICES=0 python3 components/encode_worker.py --model $MODEL_NAME & -CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill & +# run E/P/D workers +if [[ "${MODEL_NAME,,}" == *"audio"* ]]; then + CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model "$MODEL_NAME" & +else + CUDA_VISIBLE_DEVICES=0 python3 components/encode_worker.py --model "$MODEL_NAME" & +fi +CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model "$MODEL_NAME" --worker-type prefill &
55-66: Background failure handling: exit fast on first crash.set -e won’t catch background job failures. Use a wait loop to detect first failing PID and then kill others.
-# Wait for all background processes to complete -wait +# Wait for all background processes to complete; exit on first failure +pids=($(jobs -p)) +status=0 +for pid in "${pids[@]}"; do + if ! wait "$pid"; then + status=$? + break + fi +done +exit "$status"examples/multimodal/utils/model.py (1)
44-52: Fix dtype handling, None-guard, and return typing for audio path.
- Don’t force bf16; respect embeddings_dtype to avoid mismatch with the encoder/PD buffer.
- Guard None before .to(...)
- Return type annotation doesn’t match actual values (audio: List[tensor], video: np.ndarray). Relax to Dict[str, Any].
-def construct_mm_data( +def construct_mm_data( model: str, embeddings_dtype: torch.dtype, image_embeds: Optional[torch.Tensor] = None, video_numpy: Optional[Any] = None, image_grid_thw: Optional[List[Any]] = None, audio_embeds: Optional[torch.Tensor] = None, -) -> Dict[str, torch.Tensor | Dict[str, Any]]: +) -> Dict[str, Any]: @@ - if model == SupportedModels.QWEN_2_AUDIO_7B: - audio_embeds = audio_embeds.to(torch.bfloat16) - assert audio_embeds.ndim == 2, "Audio embeddings must be 2D" - return {"audio": [audio_embeds]} + if model == SupportedModels.QWEN_2_AUDIO_7B: + if audio_embeds is None: + raise ValueError("No audio embeddings provided.") + audio_embeds = audio_embeds.to(embeddings_dtype) + assert audio_embeds.ndim == 2, "Audio embeddings must be 2D" + return {"audio": [audio_embeds]}Also applies to: 53-56
examples/multimodal/components/worker.py (2)
248-252: Set audio embeddings dtype explicitly.Align PD buffer dtype with the model/audio path (and your model.construct_mm_data call). Recommend bf16 for audio to avoid double casting and potential precision loss.
- if "video" in self.engine_args.model.lower(): + if "video" in self.engine_args.model.lower(): self.EMBEDDINGS_DTYPE = torch.uint8 - else: + elif "audio" in self.engine_args.model.lower(): + self.EMBEDDINGS_DTYPE = torch.bfloat16 + else: self.EMBEDDINGS_DTYPE = torch.float16
285-312: Guard the URL path for audio/video; current else-branch assumes image and will mis-handle audio_url.If audio_url survives to PD (e.g., encoder didn’t clear it), this code attempts to load an image with None URL. Make the URL branch explicit.
- if ( + if ( request.multimodal_input.image_url is None and request.multimodal_input.video_url is None and request.multimodal_input.audio_url is None ): @@ - else: - # Use PIL image instead of image embeddings - multi_modal_data = { - "image": await self.image_loader.load_image( - request.multimodal_input.image_url - ) - } + else: + if request.multimodal_input.image_url is not None: + # Use PIL image instead of image embeddings + multi_modal_data = { + "image": await self.image_loader.load_image( + request.multimodal_input.image_url + ) + } + elif request.multimodal_input.video_url is not None: + raise ValueError( + "video_url cannot be consumed by PD worker; provide precomputed embeddings" + ) + elif request.multimodal_input.audio_url is not None: + raise ValueError( + "audio_url cannot be consumed by PD worker; provide precomputed embeddings" + )Also applies to: 319-331
🧹 Nitpick comments (14)
examples/multimodal/launch/agg.sh (2)
39-53: Add prompt-template handling for audio models or enforce explicit template.No template branch for Qwen2-Audio; script exits for audio by default. Either add a model-specific template or make --prompt-template mandatory when MODEL includes "Audio".
58-60: Quote PROMPT_TEMPLATE and MODEL_NAME.Prevent word-splitting or globbing in args.
-python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" & +python3 components/processor.py --model "$MODEL_NAME" --prompt-template "$PROMPT_TEMPLATE" &examples/multimodal/utils/protocol.py (1)
110-117: Validate audio_url scheme.Add a light validator to reject non-http(s) URLs early (mirrors typical image_url hygiene).
class AudioURLDetail(BaseModel): - url: str + url: str + + @field_validator("url") + @classmethod + def _http_only(cls, v: str) -> str: + if not (v.startswith("http://") or v.startswith("https://")): + raise ValueError("audio_url must start with http:// or https://") + return vexamples/multimodal/components/worker.py (1)
274-284: Descriptor/embeddings creation assumes embeddings_shape is set.If upstream forgets to set embeddings_shape for audio, this will break. Add a precondition check with a clearer error.
- embeddings = torch.empty( + if request.embeddings_shape is None: + raise ValueError("embeddings_shape must be set for non-URL multimodal inputs") + embeddings = torch.empty( request.embeddings_shape, dtype=self.EMBEDDINGS_DTYPE, device=self.EMBEDDINGS_DEVICE, )examples/multimodal/README.md (3)
511-514: Tighten wording and fix minor grammar.
- Use “passes” (singular) and parallel phrasing.
Apply:
-- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker. -- frontend: HTTP endpoint to handle incoming requests. +- processor: Tokenizes the prompt and passes it to the AudioEncodeWorker. +- frontend: Provides an HTTP endpoint to handle incoming requests.
543-568: Align the example payload with the note.If keeping the temporary image_url hack, explicitly label it in the example to prevent users from switching to audio_url and breaking the flow. Also, consider lowering max_tokens from 6000 unless known-safe.
Apply:
- { - "type": "image_url", - "image_url": { - "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav" - } - } + { + "type": "image_url", + "image_url": { + "url": "https://raw.githubusercontent.com/yuekaizhang/Triton-ASR-Client/main/datasets/mini_en/wav/1221-135766-0002.wav" + } + } + // Note: using image_url to carry audio temporarily.
543-545: Port inconsistency across sections.Audio uses port 8000 (matches audio_agg.sh), earlier sections use 8080. Add a one-liner callout at the start of the audio section or standardize ports across examples.
Apply:
-### Client +### Client +Note: The audio example serves on port 8000 (audio_agg.sh). Other examples may use 8080.examples/multimodal/launch/audio_agg.sh (3)
24-29: Help text mentions unrelated templates.This launcher is audio-specific; drop LLaVA/Qwen2.5/Phi3V template note to avoid confusion.
Apply:
- echo " --prompt-template <template> Specify the multi-modal prompt template to use. LLaVA 1.5 7B, Qwen2.5-VL, and Phi3V models have predefined templates." + echo " --prompt-template <template> Specify the multimodal prompt template to use for the selected audio model."
55-60: Quote args; unify python invocation.Quote MODEL_NAME to avoid word-splitting and use python3 consistently.
Apply:
-python -m dynamo.frontend --http-port 8000 & +python3 -m dynamo.frontend --http-port 8000 & -python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" & +python3 components/processor.py --model "$MODEL_NAME" --prompt-template "$PROMPT_TEMPLATE" & -CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME & -CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill & +CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model "$MODEL_NAME" & +CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model "$MODEL_NAME" --worker-type prefill &
57-60: Guard for single-GPU environments (optional). Defaults assume two GPUs (0 and 1); either document this requirement or add a runtime check (e.g. via nvidia-smi) to warn when fewer GPUs are available.examples/multimodal/utils/audio_loader.py (3)
65-71: Honor sampling_rate and move CPU-bound load off-loop (already good).Also ensure we pass the desired SR.
Apply:
- def _load_audio(): - return librosa.load(audio_data_stream, sr=16000) + def _load_audio(): + return librosa.load(audio_data_stream, sr=sampling_rate or 16000)
73-81: Make cache mutation atomic.Wrap eviction and insert with a lock.
Apply:
- if parsed_url.scheme in ("http", "https"): - audio_url_lower = audio_url.lower() - # Cache the audio for future use, and evict the oldest audio if the cache is full - if self._cache_queue.full(): - oldest_audio_url = await self._cache_queue.get() - del self._audio_cache[oldest_audio_url] - - self._audio_cache[audio_url_lower] = (audio_data, sr) - await self._cache_queue.put(audio_url_lower) + if parsed_url.scheme in ("http", "https"): + audio_url_lower = audio_url.lower() + async with self._lock: + if self._cache_queue.full(): + oldest_audio_url = await self._cache_queue.get() + self._audio_cache.pop(oldest_audio_url, None) + self._audio_cache[audio_url_lower] = (audio_data, sr) + await self._cache_queue.put(audio_url_lower)
44-49: Cache lookup should also be locked (optional).Prevents torn reads under concurrent evictions.
Apply:
- if audio_url_lower in self._audio_cache: - logger.debug(f"Audio found in cache for URL: {audio_url}") - return self._audio_cache[audio_url_lower] + async with self._lock: + if audio_url_lower in self._audio_cache: + logger.debug(f"Audio found in cache for URL: {audio_url}") + return self._audio_cache[audio_url_lower]examples/multimodal/components/audio_encode_worker.py (1)
39-55: Unused cupy/numpy selection block.DEVICE/array_module are unused. Remove or wire into compute.
Apply:
-try: - import cupy as array_module - - if not array_module.cuda.is_available(): - raise ImportError("CUDA is not available.") - DEVICE = "cuda" - logger.info("Using cupy for array operations (GPU mode).") -except ImportError as e: - logger.warning(f"Failed to import cupy, falling back to numpy: {e}.") - import numpy as array_module - - DEVICE = "cpu" +# (Optional) Remove cupy/numpy runtime selection; not used in this module.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (9)
examples/multimodal/README.md(1 hunks)examples/multimodal/components/audio_encode_worker.py(1 hunks)examples/multimodal/components/processor.py(1 hunks)examples/multimodal/components/worker.py(3 hunks)examples/multimodal/launch/agg.sh(1 hunks)examples/multimodal/launch/audio_agg.sh(1 hunks)examples/multimodal/utils/audio_loader.py(1 hunks)examples/multimodal/utils/model.py(2 hunks)examples/multimodal/utils/protocol.py(3 hunks)
🧰 Additional context used
🪛 LanguageTool
examples/multimodal/README.md
[grammar] ~511-~511: There might be a mistake here.
Context: .../worker.py) for prefilling and decoding. - processor: Tokenizes the prompt and pass...
(QB_NEW_EN)
[grammar] ~512-~512: There might be a mistake here.
Context: ... and passes it to the AudioEncodeWorker. - frontend: HTTP endpoint to handle incomi...
(QB_NEW_EN)
[grammar] ~519-~519: There might be a mistake here.
Context: ...onents/backends/vllm/README.md) example. By separating the audio processing from ...
(QB_NEW_EN)
[grammar] ~520-~520: There might be a mistake here.
Context: ...a more flexible deployment and scale the AudioEncodeWorker independently from the...
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
examples/multimodal/utils/protocol.py (1)
128-129: Schema extension looks consistent with the pipeline.Audio content union + MultiModalInput.audio_url + embeddings_shape 2-tuple are coherent with the PD worker and model helpers.
Please confirm the engine-side expectation for audio embeddings shape tuple order (e.g., [T, D]) matches the audio encoder output.
Also applies to: 145-149, 155-157
examples/multimodal/utils/model.py (1)
31-32: Add QWEN_2_AUDIO_7B constant — LGTM.Clearly scoped addition.
examples/multimodal/components/worker.py (2)
306-312: Audio path construction — LGTM (given encoder provides 2D).The new branch routes embeddings to construct_mm_data correctly.
248-256: No action needed: audio encoder and worker both use torch.float16.The audio model is loaded with
torch_dtype=torch.float16and the worker’sEMBEDDINGS_DTYPEis set totorch.float16when not handling video, so there’s no RDMA type mismatch.examples/multimodal/components/processor.py (1)
257-263: Required multimodal input validation — LGTM.Clear and correct inclusion of audio_url.
krishung5
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for enabling audio support! General workflow lgtm, left a few comments/questions.
Could we also add some pytest to here: https://github.com/ai-dynamo/dynamo/blob/main/tests/serve/test_vllm.py.
|
@krishung5 Would you mind checking the latest commit? Also, the current recipe depends on vllm-project/vllm#23625. What do you think we add the test case in another PR after dynamo update the vllm commit tag? |
whoisj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, at least all of the nixl_connector stuff looks correct.
Approved assuming all tests pass and @yuekaizhang has done some form of validation for both the aggregated and disaggregated use cases.
Hi @whoisj and @krishung5, I've just added a test case to this PR, which depends on vllm v0.10.2rc1 (from https://github.com/vllm-project/vllm/releases/tag/v0.10.2rc1). I checked the latest vllm pip index and saw that only version 0.10.1 is available. Before the official v0.10.2 is released, what do you think about merging this PR first? This would help avoid potential code conflicts. |
|
Hi @yuekaizhang, sorry for the late reply, I was OOO for the past two weeks. Some questions: btw - I think vllm released v0.10.2 on pip index. Besides, could you help resolve the conflicts in the PR and rerun a pipeline? I can help on the CI side if you don't have access to the GitLab. Thank you! |
|
@krishung5 Hi, I have rebased the code. Also, I have verified the correctness of the whole model serving and client test process. Would you mind helping check it again? |
|
/ok to test 2e3a2b3 |
Signed-off-by: Yuekai Zhang <[email protected]>
|
@krishung5 Sorry for the delay, would you mind helping check the CI/CD again? Many thanks! |
Signed-off-by: Yuekai Zhang <[email protected]>
|
/ok to test 5ab7e9c |
|
/ok to test 00484b8 |
|
/ok to test 00f8463 |
Signed-off-by: krishung5 <[email protected]>
|
/ok to test bcf5e9e |
Signed-off-by: Yuekai Zhang <[email protected]> Signed-off-by: krishung5 <[email protected]> Co-authored-by: Kris Hung <[email protected]> Signed-off-by: Daiyaan <[email protected]>
Cherry-pick of PR #2760 to release/0.7.1. Adds audio multimodal support for vLLM including: - AudioEncodeWorker for processing audio inputs - Audio aggregated and disaggregated serving examples - Audio loader utilities - Documentation for audio multimodal serving Original PR: #2760 Original author: Yuekai Zhang <[email protected]> Co-authored-by: Kris Hung <[email protected]> Signed-off-by: Dan Gil <[email protected]>
Overview:
This PR supports qwen2audio multimodal example based on VLLM.
Details:
Where should the reviewer start?
audio_encoder_worker.pyandaudio_agg.shshould be a good start.@krishung5 I was wondering if you have some time to review the PR, many thanks.
Summary by CodeRabbit
New Features
Documentation