fix: Change FrontEnd Health thresholds #2336

atchernych · 2025-08-06T17:44:38Z

Overview:

Move probes from the backward-compatible to new locations and add the StartupProbe.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Chores
- Added startup health checks to the Frontend service across multiple deployment configurations.
- Adjusted liveness and readiness health check intervals and thresholds to improve monitoring and responsiveness during startup and operation.

coderabbitai · 2025-08-06T17:52:18Z

Walkthrough

The deployment YAML files for various backend Frontend services were updated to add a startupProbe and modify the parameters of existing livenessProbe and readinessProbe health checks. These changes adjust probe timings, thresholds, and methods across sglang, trtllm, and vllm backend configurations.

Changes

Cohort / File(s)	Change Summary
sglang Frontend Probes `components/backends/sglang/deploy/agg.yaml`, `components/backends/sglang/deploy/agg_router.yaml`, `components/backends/sglang/deploy/disagg.yaml`	Added `startupProbe`, adjusted `livenessProbe` (removed initial delay, increased period), and tightened `readinessProbe` (shorter period, lower timeout and failure threshold, removed initial delay) for Frontend service.
trtllm Frontend Probes `components/backends/trtllm/deploy/agg.yaml`, `components/backends/trtllm/deploy/agg_router.yaml`, `components/backends/trtllm/deploy/disagg.yaml`, `components/backends/trtllm/deploy/disagg_router.yaml`	Added `startupProbe`, increased `livenessProbe` period and timeout, and made `readinessProbe` more frequent and sensitive (removed initial delay, reduced period, adjusted timeout and failure threshold) for Frontend service.
vllm Frontend Probes `components/backends/vllm/deploy/agg.yaml`, `components/backends/vllm/deploy/agg_router.yaml`, `components/backends/vllm/deploy/disagg.yaml`, `components/backends/vllm/deploy/disagg_planner.yaml`, `components/backends/vllm/deploy/disagg_router.yaml`	Introduced `startupProbe`, removed `livenessProbe` initial delay and increased period, and made `readinessProbe` more frequent and strict (removed initial delay, reduced period, timeout, and failure threshold) for Frontend service.

Sequence Diagram(s)

sequenceDiagram
    participant Kubelet
    participant Frontend

    Kubelet->>Frontend: startupProbe (HTTP GET /health)
    Note right of Kubelet: Every 5s, timeout 5s, threshold 60
    alt Startup success
        Kubelet->>Frontend: livenessProbe (HTTP GET /health or exec)
        Note right of Kubelet: Every 10s, timeout 5s
        Kubelet->>Frontend: readinessProbe (HTTP GET /health or exec)
        Note right of Kubelet: Every 15s, timeout 5s, threshold 3
    else Startup fails
        Kubelet--xFrontend: Restart container
    end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

feat: update DynamoGraphDeployments for vllm_v1 #1890: Also modifies Kubernetes deployment YAMLs to adjust health probes for Frontend services, directly overlapping in scope and intent.
fix: Adjust frontEnd thresholds #2288: Adjusts livenessProbe thresholds for Frontend, related by probe configuration changes but less comprehensive.
chore: sglang k8s health/live, update doc #2272: Modifies probe settings for sglang deployments, including Frontend and worker services, indicating related configuration changes.

Poem

In YAML fields the probes now grow,
Startup, liveness, readiness in tow.
With every check, our Frontend thrives,
Healthier pods—no more surprise!
The bunnies cheer, their tails a-flutter,
For robust checks make uptime utter.
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (3)

components/backends/sglang/deploy/agg.yaml (1)

22-24: Liveness interval doubled – verify detection latency

periodSeconds was increased from 5 → 10 s. This slows down failure detection and recovery. If the pod is expected to crash‐loop quickly after the startup phase (e.g. OOM on first request), you may prefer to keep the original 5 s to minimise MTTR.

components/backends/sglang/deploy/disagg.yaml (1)

22-24: Confirm longer liveness period is acceptable

As with the aggregate variant, doubling the liveness interval increases the time to notice failures. Please confirm this aligns with SLOs.

components/backends/sglang/deploy/agg_router.yaml (1)

22-24: Re-evaluate liveness interval change (5 → 10 s)

See earlier comments – ensure the increased interval still meets router availability requirements.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6be5c19 and 2380ad0.

📒 Files selected for processing (12)

components/backends/sglang/deploy/agg.yaml (2 hunks)
components/backends/sglang/deploy/agg_router.yaml (2 hunks)
components/backends/sglang/deploy/disagg.yaml (2 hunks)
components/backends/trtllm/deploy/agg.yaml (2 hunks)
components/backends/trtllm/deploy/agg_router.yaml (1 hunks)
components/backends/trtllm/deploy/disagg.yaml (1 hunks)
components/backends/trtllm/deploy/disagg_router.yaml (1 hunks)
components/backends/vllm/deploy/agg.yaml (2 hunks)
components/backends/vllm/deploy/agg_router.yaml (2 hunks)
components/backends/vllm/deploy/disagg.yaml (2 hunks)
components/backends/vllm/deploy/disagg_planner.yaml (2 hunks)
components/backends/vllm/deploy/disagg_router.yaml (2 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Learnt from: biswapanda
PR: ai-dynamo/dynamo#1890
File: examples/vllm/deploy/agg.yaml:63-70
Timestamp: 2025-07-14T23:01:16.218Z
Learning: In vLLM worker deployments, grep-based log checks for "VllmWorker.*has been initialized" are appropriate for readiness probes to verify worker startup, but should not be used for liveness probes which need to detect ongoing worker health.

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Learnt from: KrishnanPrash
PR: ai-dynamo/dynamo#2217
File: components/backends/trtllm/engine_configs/deepseek_r1/wide_ep/wide_ep_prefill.yaml:18-0
Timestamp: 2025-07-31T11:26:48.422Z
Learning: TRTLLM LLM-API expects all caps for backend field names in configuration files. When migrating TRTLLM configurations, backend values like "WideEP" should be changed to "WIDEEP" to comply with the API requirements.

📚 Learning: in vllm worker deployments, startup probes (with longer periods and higher failure thresholds like p...

Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.

Applied to files:

components/backends/vllm/deploy/disagg_planner.yaml
components/backends/vllm/deploy/agg.yaml
components/backends/sglang/deploy/agg_router.yaml
components/backends/vllm/deploy/agg_router.yaml
components/backends/trtllm/deploy/agg_router.yaml
components/backends/sglang/deploy/agg.yaml
components/backends/vllm/deploy/disagg.yaml
components/backends/trtllm/deploy/disagg_router.yaml
components/backends/trtllm/deploy/agg.yaml
components/backends/trtllm/deploy/disagg.yaml
components/backends/vllm/deploy/disagg_router.yaml
components/backends/sglang/deploy/disagg.yaml

📚 Learning: in vllm worker deployments, grep-based log checks for "vllmworker.*has been initialized" are appropr...

Learnt from: biswapanda
PR: ai-dynamo/dynamo#1890
File: examples/vllm/deploy/agg.yaml:63-70
Timestamp: 2025-07-14T23:01:16.218Z
Learning: In vLLM worker deployments, grep-based log checks for "VllmWorker.*has been initialized" are appropriate for readiness probes to verify worker startup, but should not be used for liveness probes which need to detect ongoing worker health.

Applied to files:

components/backends/vllm/deploy/disagg_planner.yaml
components/backends/vllm/deploy/agg.yaml
components/backends/sglang/deploy/agg_router.yaml
components/backends/vllm/deploy/agg_router.yaml
components/backends/trtllm/deploy/agg_router.yaml
components/backends/sglang/deploy/agg.yaml
components/backends/vllm/deploy/disagg.yaml
components/backends/trtllm/deploy/disagg_router.yaml
components/backends/trtllm/deploy/agg.yaml
components/backends/trtllm/deploy/disagg.yaml
components/backends/vllm/deploy/disagg_router.yaml
components/backends/sglang/deploy/disagg.yaml

📚 Learning: in components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally d...

Learnt from: biswapanda
PR: ai-dynamo/dynamo#2137
File: components/backends/sglang/deploy/agg_router.yaml:0-0
Timestamp: 2025-07-28T17:00:07.968Z
Learning: In components/backends/sglang/deploy/agg_router.yaml, the clear_namespace command is intentionally designed to block the router from starting if it fails (using &&). This is a deliberate design decision where namespace clearing is a critical prerequisite and the router should not start with an uncleared namespace.

Applied to files:

components/backends/sglang/deploy/agg_router.yaml
components/backends/vllm/deploy/agg_router.yaml
components/backends/trtllm/deploy/agg_router.yaml
components/backends/trtllm/deploy/disagg_router.yaml
components/backends/vllm/deploy/disagg_router.yaml

📚 Learning: the stopsignal field under lifecycle in dynamocomponentdeployment crds is autogenerated due to kuber...

Learnt from: julienmancuso
PR: ai-dynamo/dynamo#2012
File: deploy/cloud/helm/crds/templates/nvidia.com_dynamocomponentdeployments.yaml:1178-1180
Timestamp: 2025-07-18T16:05:05.534Z
Learning: The stopSignal field under lifecycle in DynamoComponentDeployment CRDs is autogenerated due to Kubernetes library upgrades (k8s.io/api and k8s.io/apimachinery from v0.32.3 to v0.33.1), not a manual design decision by the user.

Applied to files:

components/backends/sglang/deploy/agg.yaml
components/backends/trtllm/deploy/agg.yaml
components/backends/sglang/deploy/disagg.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: Build and Test - vllm
GitHub Check: pre-merge-rust (.)

🔇 Additional comments (9)

components/backends/vllm/deploy/disagg_planner.yaml (2)

32-32: Liveness period change is sane

Increasing to 10 s aligns with worker patterns and reduces noise without hurting detection time.

41-43: Readiness probe tuning looks good

Tighter period/timeout/failure thresholds are appropriate once a startup probe is in place.

components/backends/vllm/deploy/agg.yaml (2)

22-22: Liveness-probe period bump LGTM

31-33: Readiness-probe timing LGTM

components/backends/vllm/deploy/disagg.yaml (1)

35-37: Readiness tweaks approved

components/backends/vllm/deploy/agg_router.yaml (2)

22-24: Liveness period update looks fine

31-33: Readiness probe params OK

components/backends/vllm/deploy/disagg_router.yaml (2)

25-27: Liveness probe change approved

34-36: Readiness probe tuning approved

components/backends/sglang/deploy/agg_router.yaml

components/backends/sglang/deploy/agg.yaml

components/backends/sglang/deploy/disagg.yaml

components/backends/trtllm/deploy/agg_router.yaml

components/backends/trtllm/deploy/agg.yaml

components/backends/vllm/deploy/agg_router.yaml

components/backends/vllm/deploy/agg.yaml

components/backends/vllm/deploy/disagg_planner.yaml

components/backends/vllm/deploy/disagg_router.yaml

components/backends/vllm/deploy/disagg.yaml

atchernych · 2025-08-14T15:59:49Z

dup

Change FrontEnd Health thresholds

2380ad0

pull-request-size bot added the size/L label Aug 6, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 6, 2025 17:44 Inactive

github-actions bot added the fix label Aug 6, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 6, 2025 17:45 Inactive

coderabbitai bot reviewed Aug 6, 2025

View reviewed changes

fix identation

2d7d8c9

copy-pr-bot bot temporarily deployed to GITLAB August 6, 2025 18:15 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 6, 2025 18:20 Inactive

biswapanda requested a review from dmitry-tokarev-nv August 6, 2025 18:34

move probes from backrd compat to new locations

1135bb7

pull-request-size bot added size/XL and removed size/L labels Aug 7, 2025

copy-pr-bot bot temporarily deployed to GITLAB August 7, 2025 18:21 Inactive

copy-pr-bot bot temporarily deployed to GITLAB August 7, 2025 18:22 Inactive

atchernych marked this pull request as draft August 8, 2025 01:59

atchernych closed this Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Change FrontEnd Health thresholds #2336

fix: Change FrontEnd Health thresholds #2336

Uh oh!

atchernych commented Aug 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Aug 6, 2025

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atchernych commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Change FrontEnd Health thresholds #2336

fix: Change FrontEnd Health thresholds #2336

Uh oh!

Conversation

atchernych commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 6, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

atchernych commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

atchernych commented Aug 6, 2025 •

edited

Loading