Skip to content

Conversation

@biswapanda
Copy link
Contributor

@biswapanda biswapanda commented Aug 26, 2025

Overview:

Cherry pick: #2727

liveness and readiness check for hello world should be exit 0

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Per-component /dev/shm configuration via new SharedMemory field in CRDs/Helm; frontend/worker-aware defaults for commands, env, ports, and probes.
    • Readiness gating for SGLang requests until model registration completes.
    • New multimodal example deployment (LLaVA, aggregated).
  • Bug Fixes

    • Safer handling of missing output tokens in SGLang decode stream.
    • Reduced CUDA graph batch size on H100 job script; increased readiness timeouts.
  • Documentation

    • New Quickstart (local), Installation, Examples gallery; major docs reorg.
    • Updated health-check guidance; numerous model examples switched to Qwen/Qwen3-0.6B.
  • Chores

    • Version bumps: TensorRT-LLM 1.0.0rc6, vLLM 0.10.1.1, UCX v1.19.0; images include Prometheus.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@biswapanda biswapanda changed the base branch from main to release/0.4.1 August 26, 2025 22:59
@biswapanda biswapanda closed this Aug 26, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

The PR updates docs and configs broadly, switches default/example models to Qwen/Qwen3-0.6B, bumps multiple container/dependency versions, removes an internal Rust macro crate, adds shared-memory configuration to CRDs/operator, enhances Helm templates for frontend/worker roles, and introduces SGLang worker readiness gating and tokenizer-init behavior changes.

Changes

Cohort / File(s) Summary
Licenses
ATTRIBUTIONS-Go.md
Adds MIT/BSD-3-Clause attributions for testify and go-difflib (duplicated blocks).
Rust workspace/macros
Cargo.toml, lib/async-openai-macros/*, lib/async-openai/Cargo.toml
Removes local proc-macro crate; updates dependency to crates.io async-openai-macros = "0.1.0".
SGLang runtime behavior
components/backends/sglang/src/dynamo/sglang/{args.py,main.py,register.py,request_handlers/decode_handler.py}, components/backends/sglang/slurm_jobs/scripts/worker_setup.py
Forces --skip-tokenizer-init true with warning; adds readiness gate before serving; register_llm_with_runtime_config now returns bool; defensive handling for missing output_ids; frontend start cmd updated.
SGLang deploy/docs/model switch
components/backends/sglang/{README.md,deploy/*,launch/*.sh,docs/*}
Switches example/deploy model refs to Qwen/Qwen3-0.6B; link fixes; doc updates (HiCache flag --hicache-size--hicache-ratio); multinode doc tweaks.
TRTLLM deploy/engine configs
components/backends/trtllm/{README.md,deploy/*,engine_configs/llama4/**,llama4_plus_eagle.md,gpt-oss.md}
Switches model refs to Qwen; removes/adjusts Eagle configs (deletions and parameter changes); streamlines multimodal docs; adds readiness guidance.
vLLM updates
components/backends/vllm/deploy/README.md, components/backends/vllm/deploy/agg_router.yaml
Moves GPU limit to service-level; doc links updated; adds architecture links.
Operator & CRDs: shared memory + backend detection
deploy/cloud/helm/crds/templates/nvidia.com_*.yaml, deploy/cloud/operator/api/v1alpha1/*, deploy/cloud/operator/internal/{consts/consts.go,dynamo/graph.go}, deploy/cloud/operator/config/crd/bases/nvidia.com_*.yaml, deploy/cloud/operator/api/v1alpha1/zz_generated.deepcopy.go, deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go, deploy/cloud/operator/internal/dynamo/graph_test.go
Adds sharedMemory spec (disabled/size) to CRDs/types; defaults /dev/shm and 8Gi; generates deepcopy; integrates per-component shm volume/mount; refines env merge precedence; introduces BackendFrameworkNoop and relaxed detection.
Helm templates (frontend/worker-aware)
deploy/helm/chart/templates/{deployment.yaml,grove-podgangset.yaml,service.yaml}
Adds componentType-aware defaults for command/args, env, ports, and probes; services render for frontend; adds readiness/liveness behavior per role; terminationDelay in PodGangSet.
Container builds & deps
container/Dockerfile*, container/build.sh, container/deps/vllm/install_vllm.sh
Pins UCX to v1.19.0; updates TRT-LLM base/runtime tags and PyTorch/dep versions; copies Prometheus into runtime images; updates vLLM ref/wheel; bumps NGC/TRT-LLM vars in build script.
Docs restructure & links
docs/**/*, README.md, deploy/inference-gateway/README.md, docs/conf.py
Major docs reorg (index, sections, includes, install/quickstart); updates support matrix; removes legacy pages; Sphinx config overhauled; adds example gallery and installation/architecture pages.
Examples
examples/runtime/hello_world/{README.md,client.py,deploy/hello_world.yaml}, examples/basics/multinode/README.md, examples/multimodal/deploy/agg_llava.yaml
Adds retry loop in client; adjusts probes/args and backendFramework in hello_world; minor code snippet fix; adds vLLM multimodal LLaVA aggregated deployment manifest.
Tests
tests/kvbm/test_determinism.py, tests/serve/{test_sglang.py,test_vllm.py}
Updates model refs in SGLang tests; relaxes chat content assertion; extends vLLM timeout; simplifies kvbm markers.
Python packaging
pyproject.toml
Bumps optional deps: tensorrt-llm to 1.0.0rc6; vllm[flashinfer] to 0.10.1.1.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Frontend as Frontend (dynamo.frontend)
  participant Worker as SGLang Worker
  participant Runtime as Runtime Registry

  rect rgba(200,220,255,0.25)
    note over Worker: Startup
    Worker->>Worker: parse args
    alt skip_tokenizer_init not set
      Worker->>Worker: warn and set skip_tokenizer_init=true
    end
    par
      Worker->>Runtime: register_llm_with_runtime_config()
      Runtime-->>Worker: success (bool)
    and
      Worker->>Worker: start endpoints (generate via gate)
    end
    alt registration failed
      Worker->>Worker: shutdown runtime, raise error
    else registration succeeded
      Worker->>Worker: set ready_event
    end
  end

  rect rgba(200,255,200,0.25)
    note over Client,Worker: Request flow after readiness
    Client->>Frontend: /v1/chat/completions
    Frontend->>Worker: dyn://sglang.generate (queued until ready)
    Worker-->>Frontend: stream tokens
    Frontend-->>Client: response
  end

  rect rgba(255,230,200,0.25)
    note over Worker: Decode stream safety
    Worker->>Worker: process stream
    alt output_ids missing
      Worker->>Worker: raise ValueError (descriptive)
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

Poem

A bunny taps the keys with glee,
Swaps DeepSeek paths for Qwen3,
Charts grow wise to shm’s new size,
Workers wait till regs arise.
Docs realign, containers shine—
Hop, hop! Releases hop in time.
(carrot-shaped commits ☺)

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.2.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/product/migration-guide for migration instructions

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants