Skip to content

Conversation

julienmancuso
Copy link
Contributor

@julienmancuso julienmancuso commented Oct 3, 2025

Overview:

fix some defaults

Summary by CodeRabbit

  • New Features

    • Increased default NATS max payload to 10MB for larger message support.
  • Chores

    • Extended worker startup probe tolerance to 2 hours to prevent premature restarts during slow startups.
    • Increased default termination/teardown delay from 15m to 4h for smoother shutdowns.
  • Documentation

    • Expanded etcd configuration guidance, including notes on external etcd and reference links.
    • Clarified NATS configuration, added reference link, and noted that settings should be prefixed with “nats.”.
    • Minor text improvements for clarity without structural changes.

Signed-off-by: Julien Mancuso <[email protected]>
@julienmancuso julienmancuso requested a review from a team as a code owner October 3, 2025 22:02
@github-actions github-actions bot added the fix label Oct 3, 2025
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

Walkthrough

Updates adjust startup probe FailureThresholds from 60 to 720 for worker-related pods, change a Helm default groveTerminationDelay from 15m to 4h, and expand Helm comments/documentation for etcd and NATS, including adding a 10MB max_payload setting in NATS config. Tests are updated to match probe threshold changes.

Changes

Cohort / File(s) Summary of Changes
Helm Platform Docs & Values
deploy/cloud/helm/platform/README.md, deploy/cloud/helm/platform/values.yaml
Updated docs: expanded etcd/NATS guidance and links; clarified prefixing for NATS. Changed dynamo-operator.dynamo.groveTerminationDelay default to 4h. In values, set dynamo.platform.dynamo.groveTerminationDelay: 4h; added NATS max_payload: 10485760 (10MB); comments expanded.
Dynamo Worker Probe Threshold
deploy/cloud/operator/internal/dynamo/component_worker.go
Increased StartupProbe.FailureThreshold from 60 to 720 for worker container.
Tests: Probe Threshold Updates
deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go, deploy/cloud/operator/internal/dynamo/graph_test.go
Updated test expectations: StartupProbe.FailureThreshold from 60 to 720 across affected leader/worker and worker-related cases; readiness thresholds unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I thump my paws: delays now long, not brief—
Two hours to wake, no hurried leaf.
Helm scrolls whisper: NATS and etcd aligned,
Ten meg payloads, values refined.
I hop through tests that now agree—
Carrots committed, CI set free.

Pre-merge checks

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description currently only includes the Overview section with a single line and is missing the required Details, Where should the reviewer start?, and Related Issues sections from the repository’s template. Please fill out the template by adding a Details section that describes the specific changes in each file, a Where should the reviewer start? section highlighting key files or tests, and a Related Issues section linking any GitHub issues this PR addresses.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The title “fix: fix some defaults” is too vague to convey the primary changes, as it does not specify which defaults were adjusted or the scope of the fixes, making it unclear to reviewers what the core update entails. Please update the title to clearly reflect the main changes, for example: “Increase startup probe failure thresholds and update Helm chart default values,” so that reviewers immediately understand the focus of the pull request.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c48f49a and 7090cc2.

📒 Files selected for processing (5)
  • deploy/cloud/helm/platform/README.md (2 hunks)
  • deploy/cloud/helm/platform/values.yaml (4 hunks)
  • deploy/cloud/operator/internal/controller/dynamocomponentdeployment_controller_test.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/component_worker.go (1 hunks)
  • deploy/cloud/operator/internal/dynamo/graph_test.go (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2124
File: components/backends/vllm/deploy/disagg.yaml:54-60
Timestamp: 2025-07-25T22:34:11.384Z
Learning: In vLLM worker deployments, startup probes (with longer periods and higher failure thresholds like periodSeconds: 10, failureThreshold: 60) are used to handle the slow model loading startup phase, while liveness probes are intentionally kept aggressive (periodSeconds: 5, failureThreshold: 1) for quick failure detection once the worker is operational. This pattern separates startup concerns from operational health monitoring in GPU-heavy workloads.
Learnt from: julienmancuso
PR: ai-dynamo/dynamo#1474
File: deploy/cloud/operator/internal/controller/dynamocomponent_controller.go:1308-1312
Timestamp: 2025-06-11T21:29:28.650Z
Learning: User julienmancuso expects replies in English; avoid switching languages unless explicitly requested.
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3417/merge) by julienmancuso.
deploy/cloud/helm/platform/values.yaml

[error] 336-336: Trailing whitespace detected by pre-commit hook 'trailing-whitespace' and fixed. Process exited with code 1.

🪛 markdownlint-cli2 (0.18.1)
deploy/cloud/helm/platform/README.md

137-137: Bare URL used

(MD034, no-bare-urls)


138-138: Bare URL used

(MD034, no-bare-urls)


139-139: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Julien Mancuso <[email protected]>
Signed-off-by: Julien Mancuso <[email protected]>
@julienmancuso julienmancuso merged commit 73b0cdb into main Oct 7, 2025
17 checks passed
@julienmancuso julienmancuso deleted the jsm/fixDefaults branch October 7, 2025 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants