fix(recipes): drop hook-succeeded from torch-distributed runtime by yuanchen8911 · Pull Request #719 · NVIDIA/aicr

yuanchen8911 · 2026-04-30T16:34:43Z

Summary

Remove hook-succeeded from the torch-distributed ClusterTrainingRuntime's Helm hook delete policy so the resource persists after install instead of being deleted, unblocking any TrainJob that references it (e.g. the pytorch-mnist demo in demos/cuj1-eks.md).

Motivation / Context

recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml declared its delete policy as:

"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded

Helm interprets hook-succeeded literally — after the post-install hook runs successfully, the resource is deleted. So every install creates the CTR and immediately deletes it.

Symptom (reproduced on a fresh eks/h100/training/ubuntu/kubeflow deploy):

admission webhook "validator.trainjob.trainer.kubeflow.org" denied the request:
ClusterTrainingRuntime.trainer.kubeflow.org "torch-distributed" not found:
specified clusterTrainingRuntime must be created before the TrainJob is created

Bug origin: the manifest was introduced in #94 (Feb 2026) with this delete policy already present; the move-in #114 preserved the file content unchanged. Confirmed via git log -p on both paths.

Fixes: N/A
Related: N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

Recipe engine / data (pkg/recipe)

Implementation Notes

Drop only ,hook-succeeded from the delete policy. The other hook annotations are retained:

helm.sh/hook: post-install,post-upgrade — required by pkg/recipe.TestManifestHelmHooksRequired, which enforces every CR-typed manifest in recipes/components/*/manifests/ carry a helm.sh/hook annotation (or an explicit aicr/skip-hook-validation: "true" opt-out).
helm.sh/hook-weight: "5" — unchanged.
helm.sh/hook-delete-policy: before-hook-creation — keeps re-install idempotent (Helm deletes the previous CTR before re-applying on helm upgrade).

Without hook-succeeded, the CTR persists between installs.

Testing

YAML manifest change with no Go code touched, but it interacts with pkg/recipe.TestManifestHelmHooksRequired, so I ran pkg/recipe tests in addition to lint:

yamllint -c .yamllint.yaml \
  recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml
# OK

go test ./pkg/recipe/
# ok  github.com/NVIDIA/aicr/pkg/recipe  1.110s

End-to-end on a real EKS H100 cluster (aicr1):

Deployed full eks/h100/training/ubuntu/kubeflow bundle.
Stripped hook-succeeded from the rendered CTR manifest, applied via kubectl apply (proxy for the fix).
kubectl get clustertrainingruntime returns torch-distributed (previously: No resources found).
Submitted the pytorch-mnist TrainJob from demos/cuj1-eks.md — admission accepts, runtime resolves, worker pod schedules and runs to completion (1-epoch MNIST, accuracy 0.74).

Risk Assessment

Low — One-character pair removed from a Helm hook annotation; isolated to a single manifest.

Rollout notes: On clusters that previously deployed with the broken hook, the CTR is currently absent (deleted by hook-succeeded). On the next helm upgrade, Helm will create the CTR fresh as a hook resource — there is no broken state to adopt. No migration steps required.

Checklist

Tests pass locally (go test ./pkg/recipe/, yamllint)
Linter passes (yamllint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (N/A — TestManifestHelmHooksRequired exists and now passes)
I updated docs if user-facing behavior changed (N/A — silent runtime fix; the demo it unblocks is documented in demos/cuj1-eks.md)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-04-30T16:35:28Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 346ff242-2116-4577-ba7f-caa7dc589d72

📥 Commits

Reviewing files that changed from the base of the PR and between b7f7b15 and 7c4d02a.

📒 Files selected for processing (1)

recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml

📝 Walkthrough

Walkthrough

A single-line annotation in recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml was changed: the metadata.annotations entry helm.sh/hook-delete-policy no longer includes hook-succeeded and now contains only before-hook-creation. No other manifest fields or specifications were modified.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: removing hook-succeeded from the torch-distributed runtime's Helm hook delete policy, which directly addresses the bug preventing the ClusterTrainingRuntime from persisting.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the bug, root cause, fix rationale, testing approach, and risk assessment with concrete evidence from end-to-end testing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

The torch-distributed ClusterTrainingRuntime declared its delete policy as hook-delete-policy: before-hook-creation,hook-succeeded. Helm interprets hook-succeeded literally — after the post-install hook runs successfully, the resource is deleted. So every install would create the CTR and immediately delete it, leaving the cluster with no torch-distributed runtime. Symptom: any TrainJob referencing runtimeRef.name=torch-distributed (e.g. the pytorch-mnist demo in demos/cuj1-eks.md) is rejected by the trainer admission webhook with "ClusterTrainingRuntime torch-distributed not found". Fix: drop only ,hook-succeeded. Keep the helm.sh/hook annotation (project convention enforced by pkg/recipe.TestManifestHelmHooksRequired) and before-hook-creation (idempotent re-install). Without hook-succeeded, the CTR persists between installs. Verified end-to-end on a real EKS H100 cluster: with the fix applied the CTR is present after install, the demo TrainJob is admitted, and a 1-epoch pytorch-mnist run completes (accuracy 0.74). The bug has existed since the manifest was first introduced in NVIDIA#94 (Feb 2026); confirmed by git log -p on both the original embedded path and the current recipes/ path.

yuanchen8911 requested a review from a team as a code owner April 30, 2026 16:34

yuanchen8911 added area/recipes bug labels Apr 30, 2026

github-actions Bot added the size/XS label Apr 30, 2026

yuanchen8911 mentioned this pull request Apr 30, 2026

chore(recipes): bump 6 components to upstream latest (phase 1) #715

Merged

11 tasks

yuanchen8911 force-pushed the fix/kubeflow-trainer-runtime-hook branch from b7f7b15 to 7c4d02a Compare April 30, 2026 16:41

yuanchen8911 changed the title ~~fix(recipes): drop hook annotations from torch-distributed runtime~~ fix(recipes): drop hook-succeeded from torch-distributed runtime Apr 30, 2026

yuanchen8911 requested a review from mchmarny April 30, 2026 16:43

yuanchen8911 enabled auto-merge (squash) April 30, 2026 16:45

yuanchen8911 requested a review from lockwobr April 30, 2026 16:46

Merge branch 'main' into fix/kubeflow-trainer-runtime-hook

f9bab36

mchmarny approved these changes Apr 30, 2026

View reviewed changes

yuanchen8911 merged commit d66ba76 into NVIDIA:main Apr 30, 2026
48 checks passed

yuanchen8911 mentioned this pull request Apr 30, 2026

fix(recipes): handle kubeflow-trainer v2.2.0 API changes #724

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recipes): drop hook-succeeded from torch-distributed runtime#719

fix(recipes): drop hook-succeeded from torch-distributed runtime#719
yuanchen8911 merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/kubeflow-trainer-runtime-hook

yuanchen8911 commented Apr 30, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Apr 30, 2026 •

edited

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading