Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
fix(recipes): drop hook-succeeded from torch-distributed runtime
The torch-distributed ClusterTrainingRuntime declared its delete policy as
hook-delete-policy: before-hook-creation,hook-succeeded. Helm interprets
hook-succeeded literally — after the post-install hook runs successfully,
the resource is deleted. So every install would create the CTR and
immediately delete it, leaving the cluster with no torch-distributed
runtime.

Symptom: any TrainJob referencing runtimeRef.name=torch-distributed (e.g.
the pytorch-mnist demo in demos/cuj1-eks.md) is rejected by the trainer
admission webhook with "ClusterTrainingRuntime torch-distributed not found".

Fix: drop only ,hook-succeeded. Keep the helm.sh/hook annotation (project
convention enforced by pkg/recipe.TestManifestHelmHooksRequired) and
before-hook-creation (idempotent re-install). Without hook-succeeded, the
CTR persists between installs.

Verified end-to-end on a real EKS H100 cluster: with the fix applied the
CTR is present after install, the demo TrainJob is admitted, and a
1-epoch pytorch-mnist run completes (accuracy 0.74).

The bug has existed since the manifest was first introduced in #94
(Feb 2026); confirmed by git log -p on both the original embedded path
and the current recipes/ path.
  • Loading branch information
yuanchen8911 committed Apr 30, 2026
commit 7c4d02a29ecfc8a60da8ccdbe67d3ec10fa71c3e
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ metadata:
annotations:
"helm.sh/hook": post-install,post-upgrade
"helm.sh/hook-weight": "5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
"helm.sh/hook-delete-policy": before-hook-creation
spec:
mlPolicy:
numNodes: 1
Expand Down
Loading