Skip to content

fix(recipes): drop hook-succeeded from torch-distributed runtime#719

Merged
yuanchen8911 merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/kubeflow-trainer-runtime-hook
Apr 30, 2026
Merged

fix(recipes): drop hook-succeeded from torch-distributed runtime#719
yuanchen8911 merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/kubeflow-trainer-runtime-hook

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 30, 2026

Summary

Remove hook-succeeded from the torch-distributed ClusterTrainingRuntime's Helm hook delete policy so the resource persists after install instead of being deleted, unblocking any TrainJob that references it (e.g. the pytorch-mnist demo in demos/cuj1-eks.md).

Motivation / Context

recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml declared its delete policy as:

"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded

Helm interprets hook-succeeded literally — after the post-install hook runs successfully, the resource is deleted. So every install creates the CTR and immediately deletes it.

Symptom (reproduced on a fresh eks/h100/training/ubuntu/kubeflow deploy):

admission webhook "validator.trainjob.trainer.kubeflow.org" denied the request:
ClusterTrainingRuntime.trainer.kubeflow.org "torch-distributed" not found:
specified clusterTrainingRuntime must be created before the TrainJob is created

Bug origin: the manifest was introduced in #94 (Feb 2026) with this delete policy already present; the move-in #114 preserved the file content unchanged. Confirmed via git log -p on both paths.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

Implementation Notes

Drop only ,hook-succeeded from the delete policy. The other hook annotations are retained:

  • helm.sh/hook: post-install,post-upgrade — required by pkg/recipe.TestManifestHelmHooksRequired, which enforces every CR-typed manifest in recipes/components/*/manifests/ carry a helm.sh/hook annotation (or an explicit aicr/skip-hook-validation: "true" opt-out).
  • helm.sh/hook-weight: "5" — unchanged.
  • helm.sh/hook-delete-policy: before-hook-creation — keeps re-install idempotent (Helm deletes the previous CTR before re-applying on helm upgrade).

Without hook-succeeded, the CTR persists between installs.

Testing

YAML manifest change with no Go code touched, but it interacts with pkg/recipe.TestManifestHelmHooksRequired, so I ran pkg/recipe tests in addition to lint:

yamllint -c .yamllint.yaml \
  recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml
# OK

go test ./pkg/recipe/
# ok  github.com/NVIDIA/aicr/pkg/recipe  1.110s

End-to-end on a real EKS H100 cluster (aicr1):

  1. Deployed full eks/h100/training/ubuntu/kubeflow bundle.
  2. Stripped hook-succeeded from the rendered CTR manifest, applied via kubectl apply (proxy for the fix).
  3. kubectl get clustertrainingruntime returns torch-distributed (previously: No resources found).
  4. Submitted the pytorch-mnist TrainJob from demos/cuj1-eks.md — admission accepts, runtime resolves, worker pod schedules and runs to completion (1-epoch MNIST, accuracy 0.74).

Risk Assessment

  • Low — One-character pair removed from a Helm hook annotation; isolated to a single manifest.

Rollout notes: On clusters that previously deployed with the broken hook, the CTR is currently absent (deleted by hook-succeeded). On the next helm upgrade, Helm will create the CTR fresh as a hook resource — there is no broken state to adopt. No migration steps required.

Checklist

  • Tests pass locally (go test ./pkg/recipe/, yamllint)
  • Linter passes (yamllint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — TestManifestHelmHooksRequired exists and now passes)
  • I updated docs if user-facing behavior changed (N/A — silent runtime fix; the demo it unblocks is documented in demos/cuj1-eks.md)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 346ff242-2116-4577-ba7f-caa7dc589d72

📥 Commits

Reviewing files that changed from the base of the PR and between b7f7b15 and 7c4d02a.

📒 Files selected for processing (1)
  • recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml

📝 Walkthrough

Walkthrough

A single-line annotation in recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml was changed: the metadata.annotations entry helm.sh/hook-delete-policy no longer includes hook-succeeded and now contains only before-hook-creation. No other manifest fields or specifications were modified.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: removing hook-succeeded from the torch-distributed runtime's Helm hook delete policy, which directly addresses the bug preventing the ClusterTrainingRuntime from persisting.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the bug, root cause, fix rationale, testing approach, and risk assessment with concrete evidence from end-to-end testing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

The torch-distributed ClusterTrainingRuntime declared its delete policy as
hook-delete-policy: before-hook-creation,hook-succeeded. Helm interprets
hook-succeeded literally — after the post-install hook runs successfully,
the resource is deleted. So every install would create the CTR and
immediately delete it, leaving the cluster with no torch-distributed
runtime.

Symptom: any TrainJob referencing runtimeRef.name=torch-distributed (e.g.
the pytorch-mnist demo in demos/cuj1-eks.md) is rejected by the trainer
admission webhook with "ClusterTrainingRuntime torch-distributed not found".

Fix: drop only ,hook-succeeded. Keep the helm.sh/hook annotation (project
convention enforced by pkg/recipe.TestManifestHelmHooksRequired) and
before-hook-creation (idempotent re-install). Without hook-succeeded, the
CTR persists between installs.

Verified end-to-end on a real EKS H100 cluster: with the fix applied the
CTR is present after install, the demo TrainJob is admitted, and a
1-epoch pytorch-mnist run completes (accuracy 0.74).

The bug has existed since the manifest was first introduced in NVIDIA#94
(Feb 2026); confirmed by git log -p on both the original embedded path
and the current recipes/ path.
@yuanchen8911 yuanchen8911 force-pushed the fix/kubeflow-trainer-runtime-hook branch from b7f7b15 to 7c4d02a Compare April 30, 2026 16:41
@yuanchen8911 yuanchen8911 changed the title fix(recipes): drop hook annotations from torch-distributed runtime fix(recipes): drop hook-succeeded from torch-distributed runtime Apr 30, 2026
@yuanchen8911 yuanchen8911 requested a review from mchmarny April 30, 2026 16:43
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) April 30, 2026 16:45
@yuanchen8911 yuanchen8911 requested a review from lockwobr April 30, 2026 16:46
@yuanchen8911 yuanchen8911 merged commit d66ba76 into NVIDIA:main Apr 30, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants