Skip to content

Conversation

@tzulingk
Copy link
Contributor

@tzulingk tzulingk commented Dec 9, 2025

Overview:

Move K8s fault tolerance tests from PR CI to nightly CI workflow to reduce resource contention and PR merge delays.

Details:

The deploy-test-fault-tolerance job runs 3 sequential K8s deployment tests (vllm, trtllm, sglang) that simulate pod failures in disaggregated prefill/decode scenarios. Running these tests on every PR causes significant delays due to:

  • Resource Contention: The K8s cluster has limited capacity, and these tests require significant resources
  • Pod Scheduling Delays: Pods are very hard to get ready during peak CI hours, often taking 1-2+ hours just to schedule
  • Sequential Execution: Tests run sequentially (max-parallel: 1) to avoid Helm ClusterRole conflicts, multiplying the wait time
  • Total Impact: Combined with pod scheduling + test execution, PRs experience ~3 hour delays waiting for these tests

See related discussion: https://nvidia.slack.com/archives/C08UQCG3RNV/p1764963329433749?thread_ts=1764946443.307469&cid=C08UQCG3RNV

Changes Made:

  • Moved job to nightly CI (.github/workflows/nightly-ci.yml):
  • Added new fault-tolerance-tests job after e2e tests (line ~656)
  • Runs 3 test scenarios: vllm, trtllm, sglang (disaggregated prefill/decode with pod failure simulation)
  • Uses nightly image tags (nightly-{framework}-amd64) instead of SHA-based tags
  • Pulls pre-built images from AWS ECR
  • Uses @v4 checkout action (matches nightly workflow pattern)
  • Namespace prefix: gh-nightly-{run_id}-ft-{framework} for easy identification
  • Includes build status check to fail fast if builds failed
  • Added to results-summary job dependencies for reporting

Removed from backends workflow (.github/workflows/container-validation-backends.yml):

  • Deleted entire deploy-test-fault-tolerance job (~180 lines)
  • Job no longer runs on PRs, pushes to main, or scheduled runs in this workflow

Where should the reviewer start?

  1. .github/workflows/nightly-ci.yml (lines 656-846):
  • New fault-tolerance-tests job with 3 framework test scenarios
  • Note use of nightly image tags and ECR registry
    Build status check pattern (consistent with other nightly test jobs)
  1. .github/workflows/nightly-ci.yml (line 853):
  • Added fault-tolerance-tests to results-summary dependencies
  1. .github/workflows/container-validation-backends.yml:
  • Verify deploy-test-fault-tolerance job is completely removed
  • Confirm no schedule trigger (was temporarily added, then removed)

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-1156

Summary by CodeRabbit

  • Chores
    • Enhanced automated testing infrastructure with scheduled daily fault tolerance test execution.

✏️ Tip: You can customize this high-level summary in your review settings.

@tzulingk tzulingk requested a review from a team as a code owner December 9, 2025 04:22
@tzulingk tzulingk changed the title Move k8 fault tolerance to nightly feat: Move k8 fault tolerance to nightly Dec 9, 2025
@github-actions github-actions bot added the feat label Dec 9, 2025
@tzulingk tzulingk requested review from nnshah1 and rmccorm4 December 9, 2025 04:25
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 9, 2025

Walkthrough

A GitHub Actions workflow file was modified to add a daily schedule trigger (8 AM UTC) and update job execution conditions to run the deploy-test-fault-tolerance job on schedule or when explicitly requested via run_deploy_operator flag.

Changes

Cohort / File(s) Summary
Workflow Scheduling & Conditions
.github/workflows/container-validation-backends.yml
Added daily cron schedule trigger (0 8 * * *). Updated deploy-test-fault-tolerance job condition to execute on schedule or explicit run_deploy_operator request.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify cron schedule syntax is correct for intended daily execution time
  • Confirm conditional logic properly combines schedule and manual trigger flags
  • Review any potential overlap with existing has_code_changes condition behavior

Poem

🐰 A schedule awakens each morning at eight,
Daily fault tolerance tests won't be late!
With flags and with conditions, the workflow runs free,
Automation and testing—how grand it will be! 🚀

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title 'feat: Move k8 fault tolerance to nightly' clearly and specifically describes the main change: moving Kubernetes fault tolerance tests from regular PR CI to a nightly schedule.
Description check ✅ Passed The PR description is comprehensive and well-structured, following the template with clear Overview, Details, Changes Made, and Where to Start sections.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ca1679 and 5f60d44.

📒 Files selected for processing (1)
  • .github/workflows/container-validation-backends.yml (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-02T18:13:40.065Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.065Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

  • .github/workflows/container-validation-backends.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: operator (amd64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: lychee
  • GitHub Check: pre-commit
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
.github/workflows/container-validation-backends.yml (1)

19-20: Schedule trigger correctly configured for nightly runs.

The cron expression 0 8 * * * is valid and will trigger daily at 08:00 UTC as intended.

Copy link
Contributor

@nv-anants nv-anants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the workflow to nightly-ci yaml file please. It would show up separately otherwise

@pvijayakrish pvijayakrish self-requested a review December 9, 2025 17:32
Copy link
Contributor

@pvijayakrish pvijayakrish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add it to the existing nightly. Also I see lot of redundant build and test jobs across the container-validation-backends.yml‎ and nightly-ci.yml.

@tzulingk tzulingk enabled auto-merge (squash) December 9, 2025 23:03
Copy link
Contributor

@pvijayakrish pvijayakrish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where was this changes verified? Is there a pipeline run?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants