Skip to content

Conversation

@atchernych
Copy link
Contributor

@atchernych atchernych commented Aug 6, 2025

Overview:

Move probes from the backward-compatible to new locations and add the StartupProbe.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Chores
    • Added startup health checks to the Frontend service across multiple deployment configurations.
    • Adjusted liveness and readiness health check intervals and thresholds to improve monitoring and responsiveness during startup and operation.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 6, 2025

Walkthrough

The deployment YAML files for various backend Frontend services were updated to add a startupProbe and modify the parameters of existing livenessProbe and readinessProbe health checks. These changes adjust probe timings, thresholds, and methods across sglang, trtllm, and vllm backend configurations.

Changes

Cohort / File(s) Change Summary
sglang Frontend Probes
components/backends/sglang/deploy/agg.yaml, components/backends/sglang/deploy/agg_router.yaml, components/backends/sglang/deploy/disagg.yaml
Added startupProbe, adjusted livenessProbe (removed initial delay, increased period), and tightened readinessProbe (shorter period, lower timeout and failure threshold, removed initial delay) for Frontend service.
trtllm Frontend Probes
components/backends/trtllm/deploy/agg.yaml, components/backends/trtllm/deploy/agg_router.yaml, components/backends/trtllm/deploy/disagg.yaml, components/backends/trtllm/deploy/disagg_router.yaml
Added startupProbe, increased livenessProbe period and timeout, and made readinessProbe more frequent and sensitive (removed initial delay, reduced period, adjusted timeout and failure threshold) for Frontend service.
vllm Frontend Probes
components/backends/vllm/deploy/agg.yaml, components/backends/vllm/deploy/agg_router.yaml, components/backends/vllm/deploy/disagg.yaml, components/backends/vllm/deploy/disagg_planner.yaml, components/backends/vllm/deploy/disagg_router.yaml
Introduced startupProbe, removed livenessProbe initial delay and increased period, and made readinessProbe more frequent and strict (removed initial delay, reduced period, timeout, and failure threshold) for Frontend service.

Sequence Diagram(s)

sequenceDiagram
    participant Kubelet
    participant Frontend

    Kubelet->>Frontend: startupProbe (HTTP GET /health)
    Note right of Kubelet: Every 5s, timeout 5s, threshold 60
    alt Startup success
        Kubelet->>Frontend: livenessProbe (HTTP GET /health or exec)
        Note right of Kubelet: Every 10s, timeout 5s
        Kubelet->>Frontend: readinessProbe (HTTP GET /health or exec)
        Note right of Kubelet: Every 15s, timeout 5s, threshold 3
    else Startup fails
        Kubelet--xFrontend: Restart container
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Poem

In YAML fields the probes now grow,
Startup, liveness, readiness in tow.
With every check, our Frontend thrives,
Healthier pods—no more surprise!
The bunnies cheer, their tails a-flutter,
For robust checks make uptime utter.
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (3)
components/backends/sglang/deploy/agg.yaml (1)

22-24: Liveness interval doubled – verify detection latency

periodSeconds was increased from 5 → 10 s. This slows down failure detection and recovery. If the pod is expected to crash‐loop quickly after the startup phase (e.g. OOM on first request), you may prefer to keep the original 5 s to minimise MTTR.

components/backends/sglang/deploy/disagg.yaml (1)

22-24: Confirm longer liveness period is acceptable

As with the aggregate variant, doubling the liveness interval increases the time to notice failures. Please confirm this aligns with SLOs.

components/backends/sglang/deploy/agg_router.yaml (1)

22-24: Re-evaluate liveness interval change (5 → 10 s)

See earlier comments – ensure the increased interval still meets router availability requirements.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6be5c19 and 2380ad0.

📒 Files selected for processing (12)
  • components/backends/sglang/deploy/agg.yaml (2 hunks)
  • components/backends/sglang/deploy/agg_router.yaml (2 hunks)
  • components/backends/sglang/deploy/disagg.yaml (2 hunks)
  • components/backends/trtllm/deploy/agg.yaml (2 hunks)
  • components/backends/trtllm/deploy/agg_router.yaml (1 hunks)
  • components/backends/trtllm/deploy/disagg.yaml (1 hunks)
  • components/backends/trtllm/deploy/disagg_router.yaml (1 hunks)
  • components/backends/vllm/deploy/agg.yaml (2 hunks)
  • components/backends/vllm/deploy/agg_router.yaml (2 hunks)
  • components/backends/vllm/deploy/disagg.yaml (2 hunks)
  • components/backends/vllm/deploy/disagg_planner.yaml (2 hunks)
  • components/backends/vllm/deploy/disagg_router.yaml (2 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
Learnt from: biswapanda
PR: ai-dynamo/dynamo#1890
File: examples/vllm/deploy/agg.yaml:63-70
Timestamp: 2025-07-14T23:01:16.218Z
Learning: In vLLM worker deployments, grep-based log checks for "VllmWorker.*has been initialized" are appropriate for readiness probes to verify worker startup, but should not be used for liveness probes which need to detect ongoing worker health.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.
Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.
📚 Learning: in vllm worker deployments, startup probes (with longer periods and higher failure thresholds like p...
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

  • components/backends/vllm/deploy/disagg_planner.yaml
  • components/backends/vllm/deploy/agg.yaml
  • components/backends/sglang/deploy/agg_router.yaml
  • components/backends/vllm/deploy/agg_router.yaml
  • components/backends/trtllm/deploy/agg_router.yaml
  • components/backends/sglang/deploy/agg.yaml
  • components/backends/vllm/deploy/disagg.yaml
  • components/backends/trtllm/deploy/disagg_router.yaml
  • components/backends/trtllm/deploy/agg.yaml
  • components/backends/trtllm/deploy/disagg.yaml
  • components/backends/vllm/deploy/disagg_router.yaml
  • components/backends/sglang/deploy/disagg.yaml
📚 Learning: in vllm worker deployments, grep-based log checks for "vllmworker.*has been initialized" are appropr...
Learnt from: biswapanda
PR: ai-dynamo/dynamo#1890
File: examples/vllm/deploy/agg.yaml:63-70
Timestamp: 2025-07-14T23:01:16.218Z
Learning: In vLLM worker deployments, grep-based log checks for "VllmWorker.*has been initialized" are appropriate for readiness probes to verify worker startup, but should not be used for liveness probes which need to detect ongoing worker health.

Applied to files:

  • components/backends/vllm/deploy/disagg_planner.yaml
  • components/backends/vllm/deploy/agg.yaml
  • components/backends/sglang/deploy/agg_router.yaml
  • components/backends/vllm/deploy/agg_router.yaml
  • components/backends/trtllm/deploy/agg_router.yaml
  • components/backends/sglang/deploy/agg.yaml
  • components/backends/vllm/deploy/disagg.yaml
  • components/backends/trtllm/deploy/disagg_router.yaml
  • components/backends/trtllm/deploy/agg.yaml
  • components/backends/trtllm/deploy/disagg.yaml
  • components/backends/vllm/deploy/disagg_router.yaml
  • components/backends/sglang/deploy/disagg.yaml
📚 Learning: in components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally d...
Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.

Applied to files:

  • components/backends/sglang/deploy/agg_router.yaml
  • components/backends/vllm/deploy/agg_router.yaml
  • components/backends/trtllm/deploy/agg_router.yaml
  • components/backends/trtllm/deploy/disagg_router.yaml
  • components/backends/vllm/deploy/disagg_router.yaml
📚 Learning: the stopsignal field under lifecycle in dynamocomponentdeployment crds is autogenerated due to kuber...
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

  • components/backends/sglang/deploy/agg.yaml
  • components/backends/trtllm/deploy/agg.yaml
  • components/backends/sglang/deploy/disagg.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (9)
components/backends/vllm/deploy/disagg_planner.yaml (2)

32-32: Liveness period change is sane

Increasing to 10 s aligns with worker patterns and reduces noise without hurting detection time.


41-43: Readiness probe tuning looks good

Tighter period/timeout/failure thresholds are appropriate once a startup probe is in place.

components/backends/vllm/deploy/agg.yaml (2)

22-22: Liveness-probe period bump LGTM


31-33: Readiness-probe timing LGTM

components/backends/vllm/deploy/disagg.yaml (1)

35-37: Readiness tweaks approved

components/backends/vllm/deploy/agg_router.yaml (2)

22-24: Liveness period update looks fine


31-33: Readiness probe params OK

components/backends/vllm/deploy/disagg_router.yaml (2)

25-27: Liveness probe change approved


34-36: Readiness probe tuning approved

@pull-request-size pull-request-size bot added size/XL and removed size/L labels Aug 7, 2025
@atchernych atchernych marked this pull request as draft August 8, 2025 01:59
@atchernych
Copy link
Contributor Author

dup

@atchernych atchernych closed this Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants