fix(recipes): drop hook-succeeded from torch-distributed runtime#719
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA single-line annotation in Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
The torch-distributed ClusterTrainingRuntime declared its delete policy as hook-delete-policy: before-hook-creation,hook-succeeded. Helm interprets hook-succeeded literally — after the post-install hook runs successfully, the resource is deleted. So every install would create the CTR and immediately delete it, leaving the cluster with no torch-distributed runtime. Symptom: any TrainJob referencing runtimeRef.name=torch-distributed (e.g. the pytorch-mnist demo in demos/cuj1-eks.md) is rejected by the trainer admission webhook with "ClusterTrainingRuntime torch-distributed not found". Fix: drop only ,hook-succeeded. Keep the helm.sh/hook annotation (project convention enforced by pkg/recipe.TestManifestHelmHooksRequired) and before-hook-creation (idempotent re-install). Without hook-succeeded, the CTR persists between installs. Verified end-to-end on a real EKS H100 cluster: with the fix applied the CTR is present after install, the demo TrainJob is admitted, and a 1-epoch pytorch-mnist run completes (accuracy 0.74). The bug has existed since the manifest was first introduced in NVIDIA#94 (Feb 2026); confirmed by git log -p on both the original embedded path and the current recipes/ path.
b7f7b15 to
7c4d02a
Compare
Summary
Remove
hook-succeededfrom thetorch-distributedClusterTrainingRuntime's Helm hook delete policy so the resource persists after install instead of being deleted, unblocking any TrainJob that references it (e.g. thepytorch-mnistdemo indemos/cuj1-eks.md).Motivation / Context
recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yamldeclared its delete policy as:Helm interprets
hook-succeededliterally — after the post-install hook runs successfully, the resource is deleted. So every install creates the CTR and immediately deletes it.Symptom (reproduced on a fresh
eks/h100/training/ubuntu/kubeflowdeploy):Bug origin: the manifest was introduced in #94 (Feb 2026) with this delete policy already present; the move-in #114 preserved the file content unchanged. Confirmed via
git log -pon both paths.Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
pkg/recipe)Implementation Notes
Drop only
,hook-succeededfrom the delete policy. The other hook annotations are retained:helm.sh/hook: post-install,post-upgrade— required bypkg/recipe.TestManifestHelmHooksRequired, which enforces every CR-typed manifest inrecipes/components/*/manifests/carry ahelm.sh/hookannotation (or an explicitaicr/skip-hook-validation: "true"opt-out).helm.sh/hook-weight: "5"— unchanged.helm.sh/hook-delete-policy: before-hook-creation— keeps re-install idempotent (Helm deletes the previous CTR before re-applying onhelm upgrade).Without
hook-succeeded, the CTR persists between installs.Testing
YAML manifest change with no Go code touched, but it interacts with
pkg/recipe.TestManifestHelmHooksRequired, so I ranpkg/recipetests in addition to lint:End-to-end on a real EKS H100 cluster (aicr1):
eks/h100/training/ubuntu/kubeflowbundle.hook-succeededfrom the rendered CTR manifest, applied viakubectl apply(proxy for the fix).kubectl get clustertrainingruntimereturnstorch-distributed(previously:No resources found).pytorch-mnistTrainJob fromdemos/cuj1-eks.md— admission accepts, runtime resolves, worker pod schedules and runs to completion (1-epoch MNIST, accuracy 0.74).Risk Assessment
Rollout notes: On clusters that previously deployed with the broken hook, the CTR is currently absent (deleted by
hook-succeeded). On the nexthelm upgrade, Helm will create the CTR fresh as a hook resource — there is no broken state to adopt. No migration steps required.Checklist
go test ./pkg/recipe/, yamllint)yamllint)TestManifestHelmHooksRequiredexists and now passes)demos/cuj1-eks.md)git commit -S)