chore: vllm 0.10.1.1 #2691

dmitry-tokarev-nv · 2025-08-25T17:17:42Z

Overview:

cherry-pick #2641

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Default deployments and scripts now target Qwen/Qwen3-0.6B.
- Helm chart adds component-aware startup, ports, and health checks.
- vLLM runtime now includes Prometheus.
- New multimodal LLava aggregated deployment example.
- Frontend auto-skips tokenizer init when appropriate.
Bug Fixes
- More robust streaming decode handling with clearer errors.
Documentation
- Updated versions (TensorRT-LLM rc6, vLLM 0.10.1.1), guides, and examples; simplified multimodal/Eagle docs; HiCache flag updated.
Tests
- Adjusted timeouts and markers; updated model references.
Chores
- Container/base image and dependency bumps; UCX pinned; GPU resource placement refined.

coderabbitai · 2025-08-25T17:29:18Z

Caution

Review failed

Failed to post review comments.

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 135dc82 and 28c061a.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock
lib/bindings/python/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (59)

Cargo.toml (0 hunks)
README.md (2 hunks)
components/backends/sglang/README.md (1 hunks)
components/backends/sglang/deploy/agg.yaml (1 hunks)
components/backends/sglang/deploy/agg_router.yaml (1 hunks)
components/backends/sglang/deploy/disagg-multinode.yaml (1 hunks)
components/backends/sglang/deploy/disagg.yaml (2 hunks)
components/backends/sglang/deploy/disagg_planner.yaml (2 hunks)
components/backends/sglang/docs/multinode-examples.md (1 hunks)
components/backends/sglang/docs/sgl-hicache-example.md (3 hunks)
components/backends/sglang/launch/agg.sh (1 hunks)
components/backends/sglang/launch/agg_router.sh (2 hunks)
components/backends/sglang/launch/disagg.sh (2 hunks)
components/backends/sglang/slurm_jobs/scripts/h100.sh (1 hunks)
components/backends/sglang/slurm_jobs/scripts/worker_setup.py (1 hunks)
components/backends/sglang/src/dynamo/sglang/args.py (1 hunks)
components/backends/sglang/src/dynamo/sglang/request_handlers/decode_handler.py (1 hunks)
components/backends/trtllm/README.md (1 hunks)
components/backends/trtllm/deploy/README.md (0 hunks)
components/backends/trtllm/deploy/agg.yaml (1 hunks)
components/backends/trtllm/deploy/agg_router.yaml (1 hunks)
components/backends/trtllm/deploy/disagg.yaml (2 hunks)
components/backends/trtllm/deploy/disagg_router.yaml (2 hunks)
components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml (0 hunks)
components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yml (1 hunks)
components/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yaml (1 hunks)
components/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yaml (1 hunks)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml (0 hunks)
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml (0 hunks)
components/backends/trtllm/gemma3_sliding_window_attention.md (1 hunks)
components/backends/trtllm/gpt-oss.md (1 hunks)
components/backends/trtllm/launch/agg.sh (1 hunks)
components/backends/trtllm/launch/agg_router.sh (1 hunks)
components/backends/trtllm/launch/disagg.sh (1 hunks)
components/backends/trtllm/launch/disagg_router.sh (1 hunks)
components/backends/trtllm/llama4_plus_eagle.md (1 hunks)
components/backends/vllm/deploy/agg_router.yaml (1 hunks)
container/Dockerfile (2 hunks)
container/Dockerfile.kvbm (1 hunks)
container/Dockerfile.sglang (1 hunks)
container/Dockerfile.sglang-wideep (2 hunks)
container/Dockerfile.trtllm (5 hunks)
container/Dockerfile.vllm (3 hunks)
container/build.sh (4 hunks)
container/deps/vllm/install_vllm.sh (2 hunks)
deploy/helm/chart/templates/deployment.yaml (5 hunks)
deploy/helm/chart/templates/grove-podgangset.yaml (3 hunks)
deploy/helm/chart/templates/service.yaml (1 hunks)
docs/support_matrix.md (1 hunks)
examples/multimodal/deploy/agg_llava.yaml (1 hunks)
examples/runtime/hello_world/client.py (1 hunks)
examples/runtime/hello_world/deploy/hello_world.yaml (3 hunks)
lib/async-openai-macros/Cargo.toml (0 hunks)
lib/async-openai-macros/src/lib.rs (0 hunks)
lib/async-openai/Cargo.toml (1 hunks)
pyproject.toml (1 hunks)
tests/kvbm/test_determinism.py (1 hunks)
tests/serve/test_sglang.py (2 hunks)
tests/serve/test_vllm.py (1 hunks)

💤 Files with no reviewable changes (7)

Cargo.toml
components/backends/trtllm/deploy/README.md
components/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yaml
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_decode.yaml
lib/async-openai-macros/src/lib.rs
components/backends/trtllm/engine_configs/llama4/eagle_one_model/eagle_agg.yml
lib/async-openai-macros/Cargo.toml

🧰 Additional context used

🧠 Learnings (5)

📚 Learning: 2025-07-28T17:00:07.968Z

Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.

Applied to files:

components/backends/sglang/deploy/agg_router.yaml
components/backends/trtllm/deploy/agg_router.yaml

📚 Learning: 2025-07-03T10:14:30.570Z

Learnt from: fsaady
PR: ai-dynamo/dynamo#1730
File: examples/sglang/slurm_jobs/scripts/worker_setup.py:230-244
Timestamp: 2025-07-03T10:14:30.570Z
Learning: In examples/sglang/slurm_jobs/scripts/worker_setup.py, background processes (like nats-server, etcd) are intentionally left running even if later processes fail. This design choice allows users to manually connect to nodes and debug issues without having to restart the entire SLURM job from scratch, providing operational flexibility for troubleshooting in cluster environments.

Applied to files:

components/backends/sglang/slurm_jobs/scripts/worker_setup.py

📚 Learning: 2025-07-01T15:33:53.262Z

Learnt from: nnshah1
PR: ai-dynamo/dynamo#1444
File: tests/fault_tolerance/configs/agg_tp_1_dp_8.yaml:31-38
Timestamp: 2025-07-01T15:33:53.262Z
Learning: In fault tolerance test configurations, the `resources` section under `ServiceArgs` specifies resources per individual worker, not total resources for all workers. So `workers: 8` with `gpu: '1'` means 8 workers × 1 GPU each = 8 GPUs total.

Applied to files:

components/backends/vllm/deploy/agg_router.yaml

📚 Learning: 2025-08-18T16:52:15.659Z

Learnt from: nnshah1
PR: ai-dynamo/dynamo#2489
File: container/deps/vllm/install_vllm.sh:151-152
Timestamp: 2025-08-18T16:52:15.659Z
Learning: The VLLM_PRECOMPILED_WHEEL_LOCATION environment variable, when exported, automatically triggers vLLM's build system to use the precompiled wheel instead of building from source, even when using standard `uv pip install .` commands in container/deps/vllm/install_vllm.sh.

Applied to files:

container/deps/vllm/install_vllm.sh

📚 Learning: 2025-08-18T16:52:15.659Z

Learnt from: nnshah1
PR: ai-dynamo/dynamo#2489
File: container/deps/vllm/install_vllm.sh:151-152
Timestamp: 2025-08-18T16:52:15.659Z
Learning: The VLLM_PRECOMPILED_WHEEL_LOCATION environment variable is an official vLLM environment variable that, when exported, automatically triggers vLLM's build system to use the specified precompiled wheel instead of building from source. This works even with standard `uv pip install .` commands without requiring explicit reference to the variable in the install command. The vLLM build system internally detects and uses this environment variable.

Applied to files:

container/deps/vllm/install_vllm.sh

🪛 Shellcheck (0.10.0)

container/build.sh

[warning] 62-62: TRTLLM_BASE_IMAGE_TAG appears unused. Verify use (or export if used externally).

(SC2034)

🪛 YAMLlint (1.37.1)

examples/multimodal/deploy/agg_llava.yaml

[error] 68-68: no new line character at the end of file

(new-line-at-end-of-file)

🪛 LanguageTool

docs/support_matrix.md

[grammar] ~70-~70: There might be a mistake here.
Context: ... | | NIXL | 0.4.1 ...

(QB_NEW_EN)

[grammar] ~73-~73: There might be a mistake here.
Context: ... | > [!Important] > Specific versions of TensorRT-LLM supp...

(QB_NEW_EN)

(QB_NEW_EN)

[grammar] ~81-~81: There might be a mistake here.
Context: ...---- | :--------------- | :----------- | | Amazon Linux | 2023 ...

(QB_NEW_EN)

components/backends/sglang/docs/sgl-hicache-example.md

[grammar] ~26-~26: There might be a mistake here.
Context: ...-hicache-ratio**: The ratio of the size of host KV cache memory pool to the size o...

(QB_NEW_EN)

[grammar] ~26-~26: There might be a mistake here.
Context: ...f host KV cache memory pool to the size of device pool. Lower this number if your ...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)

components/backends/trtllm/gpt-oss.md

217-217: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

222-222: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

components/backends/trtllm/llama4_plus_eagle.md

33-33: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Trigger CI Pipeline

Walkthrough

This PR updates defaults and docs to switch models to Qwen/Qwen3-0.6B, bumps several container/tooling versions (TensorRT-LLM rc6, PyTorch 25.06, UCX v1.19.0, vLLM 0.10.1.1), restructures Helm templates for component-type-aware commands/probes, adds SGLang CLI behavior (auto skip_tokenizer_init) and decode error handling, removes a local Rust proc-macro crate, and adjusts examples/tests.

Changes

Cohort / File(s)	Summary
Rust workspace/macros `Cargo.toml`, `lib/async-openai-macros/Cargo.toml`, `lib/async-openai-macros/src/lib.rs`, `lib/async-openai/Cargo.toml`	Remove local proc-macro crate from workspace; delete its sources; switch `async-openai` to `async-openai-macros = "0.1.0"` from crates.io.
SGLang model/defaults & docs `components/backends/sglang/deploy/.yaml`, `components/backends/sglang/launch/.sh`, `components/backends/sglang/README.md`, `components/backends/sglang/docs/*`	Replace model references with `Qwen/Qwen3-0.6B`; minor entrypoint tweak in multinode doc; HiCache example switches `--hicache-size` to `--hicache-ratio`; SLURM h100 decode CUDA graph batch 256→128; ingress start cmd updated to `python3 -m dynamo.frontend`.
SGLang code behavior `components/backends/sglang/src/dynamo/sglang/args.py`, `.../request_handlers/decode_handler.py`	After parse, default `skip_tokenizer_init=True` with warning when unset; add KeyError guard for `output_ids`, raising ValueError with guidance.
TRTLLM model/defaults & docs `components/backends/trtllm/deploy/.yaml`, `components/backends/trtllm/launch/.sh`, `components/backends/trtllm/README.md`, `.../deploy/README.md`, `components/backends/trtllm/gpt-oss.md`, `components/backends/trtllm/llama4_plus_eagle.md`	Switch model to `Qwen/Qwen3-0.6B`; remove MTP experimental notes; condense multimodal docs; add readiness/health verification flow; simplify Eagle guidance.
TRTLLM engine configs (Llama4/Eagle) `components/backends/trtllm/engine_configs/llama4/...`	Delete eagle one-model configs; adjust eagle agg/decode/prefill YAMLs (parallel sizes, batch sizes, tokens, cuda_graph config, speculative params, memory fractions; remove some keys).
vLLM adjustments `container/Dockerfile.vllm`, `container/deps/vllm/install_vllm.sh`, `components/backends/vllm/deploy/agg_router.yaml`	Bump VLLM ref to 0.10.1.1; fix install script help; add Prometheus binary to runtime; move GPU limit to worker-level in agg_router.
Helm chart templates `deploy/helm/chart/templates/deployment.yaml`, `.../grove-podgangset.yaml`, `.../service.yaml`	Component-type aware command/args defaults, env, ports; frontend/worker-specific liveness/readiness probes; condition Service on `componentType: frontend`; add terminationDelay to PodGangSet.
Container/base versions & pins `container/Dockerfile*`, `container/build.sh`, `pyproject.toml`, `README.md`, `docs/support_matrix.md`	Bump TRTLLM base to 25.06, wheel to 1.0.0rc6; UCX pinned to v1.19.0 across Dockerfiles; Torch/TorchVision minor bumps; update notes/docs; switch some instructions to `uv pip`; add cuda-python pin note; support matrix caution for AL2023.
Examples `examples/runtime/hello_world/client.py`, `examples/runtime/hello_world/deploy/hello_world.yaml`, `examples/multimodal/deploy/agg_llava.yaml`	Hello world client gains infinite loop with exponential backoff; deployment sets backendFramework=vllm, simplifies readiness and args; add LLava vLLM multimodal deployment example.
Tests `tests/kvbm/test_determinism.py`, `tests/serve/test_sglang.py`, `tests/serve/test_vllm.py`	Narrow pytest markers to kvbm; switch test model to Qwen and relax one assertion; increase vLLM test timeout 300→500.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Frontend as Dynamo Frontend
  participant SGLang as SGLang Worker
  Note over Frontend,SGLang: New default: skip_tokenizer_init=true when using frontend
  User->>Frontend: Request (chat/completions)
  Frontend->>Frontend: Tokenize/Detokenize (frontend)
  Frontend->>SGLang: Generate stream (skip tokenizer init)
  SGLang-->>Frontend: Stream chunks (may omit output_ids)
  alt Missing output_ids
    SGLang-->>Frontend: Error (ValueError with keys and guidance)
    Frontend-->>User: Error response
  else Normal stream
    Frontend-->>User: Stream tokens
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

fix: fix manual helm chart #2648 — Touches the same Helm templates, adding component-type defaults and probes.
chore: Remove async-openai-macros #2554 — Removes local async-openai-macros and switches to crates.io, matching this PR’s Rust changes.
feat: use consistent small models across all deploy examples #2573 — Updates default/deployed models to Qwen across backends, overlapping with this model switch.

Poem

I hopped through charts and YAML seas,
Swapped DeepSeek’s leaves for Qwen’s breeze.
UCX pinned, the wheels now spin,
Frontends probe and workers grin.
Macros packed and shipped away—
Streams now whisper what to say.
— a busy bun on release day 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

chore: vllm 0.10.1.1 (#2641)

28c061a

copy-pr-bot bot temporarily deployed to GITLAB August 25, 2025 17:17 Inactive

pull-request-size bot added the size/XXL label Aug 25, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 25, 2025 17:19 Inactive

dmitry-tokarev-nv changed the base branch from main to release/0.4.1 August 25, 2025 17:22

pull-request-size bot added size/S and removed size/XXL labels Aug 25, 2025

dmitry-tokarev-nv changed the title ~~Dtokarev cp vllm 0.10.1.1~~ chore: vllm 0.10.1.1 Aug 25, 2025

github-actions bot added the chore label Aug 25, 2025

krishung5 approved these changes Aug 25, 2025

View reviewed changes

dmitry-tokarev-nv merged commit fd8b52f into release/0.4.1 Aug 25, 2025
5 of 11 checks passed

dmitry-tokarev-nv deleted the dtokarev-cp-vllm-0.10.1.1 branch August 25, 2025 20:44

coderabbitai bot mentioned this pull request Aug 26, 2025

fix: cp metrics docs fix #2720

Merged

coderabbitai bot mentioned this pull request Sep 8, 2025

chore: Update trtllm version to 1.1.0rc3 #2930

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: vllm 0.10.1.1 #2691

chore: vllm 0.10.1.1 #2691

Uh oh!

dmitry-tokarev-nv commented Aug 25, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 25, 2025 •

edited

Loading

Review failed

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore: vllm 0.10.1.1 #2691

chore: vllm 0.10.1.1 #2691

Uh oh!

Conversation

dmitry-tokarev-nv commented Aug 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dmitry-tokarev-nv commented Aug 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 25, 2025 •

edited

Loading