Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix(recipes): handle kubeflow-trainer v2.2.0 API changes #724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uh oh!
There was an error while loading. Please reload this page.
fix(recipes): handle kubeflow-trainer v2.2.0 API changes #724
Changes from all commits
1d77c49File filter
Filter by extension
Conversations
Uh oh!
There was an error while loading. Please reload this page.
Jump to
Uh oh!
There was an error while loading. Please reload this page.
The pytorch demo TrainJobs in demos/cuj1-{eks,gke}.md carry per-cluster scheduling boilerplate (`podTemplateOverrides` with cluster-specific tolerations) so the resulting pods land on AICR's tainted GPU nodes. Each TrainJob author has to repeat this; each demo has to be edited per-cluster vocabulary; and the override mechanism keeps changing upstream (PodTemplateOverrides was deprecated in v2.1, replaced by RuntimePatches in v2.2 — kubeflow/trainer#3309). Move the per-cluster scheduling into the runtime instead. AICR's existing `nodeScheduling.accelerated` bundler injection (already used by gpu-operator, nfd, nodewright-customizations, kgateway) writes the CLI flag values into the chart's values.yaml at the listed paths. kubeflow-trainer was the only manifestFiles-using component without an `accelerated:` block. This commit adds it and templates the torch-distributed ClusterTrainingRuntime to consume the injected values, mirroring nodewright-customizations/manifests/tuning.yaml. Three coordinated changes: 1. recipes/registry.yaml — add `nodeScheduling.accelerated` block to the kubeflow-trainer entry. Targets top-level keys `acceleratedNodeSelector` and `acceleratedTolerations`. 2. recipes/components/kubeflow-trainer/manifests/ torch-distributed-cluster-training-runtime.yaml — replace the static pod-spec scheduling region with Helm template directives: {{- $kft := index .Values "kubeflow-trainer" }} {{- with $kft.acceleratedNodeSelector }} nodeSelector: {{- toYaml . | nindent 20 }} {{- end }} {{- with $kft.acceleratedTolerations }} tolerations: {{- toYaml . | nindent 20 }} {{- end }} `index .Values "kubeflow-trainer"` matches the bundler's `manifest.RenderInput.Values` shape (values nested under ComponentName). The bundler renders this template at bundle time — the artifact in `bundle/<NNN>-kubeflow-trainer-post/templates/` is plain YAML with concrete values substituted. 3. demos/cuj1-eks.md and demos/cuj1-gke.md — drop the entire `podTemplateOverrides` block. Demo TrainJob is just `trainer:` + `runtimeRef:`. API-version-agnostic: works on kubeflow-trainer v2.1 (PodTemplateOverrides era) and v2.2+ (RuntimePatches era) identically, because the TrainJob no longer overrides anything — the runtime carries the scheduling. Validated end-to-end on a real EKS H100 cluster: helm-upgrade kubeflow-trainer-post → CTR live with baked tolerations + nodeSelector → bare pytorch-mnist TrainJob admits, schedules with the correct tolerations + nodeSelector inherited from the runtime, trains to completion (accuracy=0.7424 in 21s). `pkg/recipe.TestManifestHelmHooksRequired` still passes — the `helm.sh/hook` annotations are preserved.Uh oh!
There was an error while loading. Please reload this page.
There are no files selected for viewing
Uh oh!
There was an error while loading. Please reload this page.