fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724
Conversation
06c2225 to
c904fed
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughExamples no longer inject nodeSelector and tolerations via Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
|
@yuanchen8911 this PR now has merge conflicts with |
The pytorch demo TrainJobs in demos/cuj1-{eks,gke}.md carry per-cluster
scheduling boilerplate (`podTemplateOverrides` with cluster-specific
tolerations) so the resulting pods land on AICR's tainted GPU nodes.
Each TrainJob author has to repeat this; each demo has to be edited
per-cluster vocabulary; and the override mechanism keeps changing
upstream (PodTemplateOverrides was deprecated in v2.1, replaced by
RuntimePatches in v2.2 — kubeflow/trainer#3309).
Move the per-cluster scheduling into the runtime instead. AICR's
existing `nodeScheduling.accelerated` bundler injection (already used
by gpu-operator, nfd, nodewright-customizations, kgateway) writes the
CLI flag values into the chart's values.yaml at the listed paths.
kubeflow-trainer was the only manifestFiles-using component without an
`accelerated:` block. This commit adds it and templates the
torch-distributed ClusterTrainingRuntime to consume the injected
values, mirroring nodewright-customizations/manifests/tuning.yaml.
Three coordinated changes:
1. recipes/registry.yaml — add `nodeScheduling.accelerated` block to
the kubeflow-trainer entry. Targets top-level keys
`acceleratedNodeSelector` and `acceleratedTolerations`.
2. recipes/components/kubeflow-trainer/manifests/
torch-distributed-cluster-training-runtime.yaml — replace the
static pod-spec scheduling region with Helm template directives:
{{- $kft := index .Values "kubeflow-trainer" }}
{{- with $kft.acceleratedNodeSelector }}
nodeSelector:
{{- toYaml . | nindent 20 }}
{{- end }}
{{- with $kft.acceleratedTolerations }}
tolerations:
{{- toYaml . | nindent 20 }}
{{- end }}
`index .Values "kubeflow-trainer"` matches the bundler's
`manifest.RenderInput.Values` shape (values nested under
ComponentName). The bundler renders this template at bundle time —
the artifact in `bundle/<NNN>-kubeflow-trainer-post/templates/`
is plain YAML with concrete values substituted.
3. demos/cuj1-eks.md and demos/cuj1-gke.md — drop the entire
`podTemplateOverrides` block. Demo TrainJob is just `trainer:` +
`runtimeRef:`.
API-version-agnostic: works on kubeflow-trainer v2.1 (PodTemplateOverrides
era) and v2.2+ (RuntimePatches era) identically, because the TrainJob
no longer overrides anything — the runtime carries the scheduling.
Validated end-to-end on a real EKS H100 cluster:
helm-upgrade kubeflow-trainer-post → CTR live with baked tolerations
+ nodeSelector → bare pytorch-mnist TrainJob admits, schedules with
the correct tolerations + nodeSelector inherited from the runtime,
trains to completion (accuracy=0.7424 in 21s).
`pkg/recipe.TestManifestHelmHooksRequired` still passes — the
`helm.sh/hook` annotations are preserved.
c904fed to
1d77c49
Compare
Summary
Bake the cluster-aware
nodeSelector+tolerationsinto thetorch-distributedClusterTrainingRuntime, using AICR's existingnodeScheduling.acceleratedbundler injection. Demos go back to bare-bones TrainJobs (nopodTemplateOverrides, noruntimePatches).Motivation / Context
The pytorch demo TrainJobs in
demos/cuj1-{eks,gke}.mdcurrently carry per-cluster scheduling boilerplate so the resulting pods land on AICR's tainted GPU nodes. Each TrainJob author has to repeat it; each demo has to be edited per-cluster vocabulary; and the override mechanism keeps changing upstream (PodTemplateOverridesdeprecated in v2.1 → replaced byRuntimePatchesin v2.2; see kubeflow/trainer#3309).This PR moves the per-cluster scheduling into the runtime itself. The bundler already supports this via
nodeScheduling.acceleratedpaths declared inrecipes/registry.yaml— already used bygpu-operator,nfd,nodewright-customizations, andkgateway.kubeflow-trainerwas the only manifestFiles-using component without anaccelerated:block. This PR adds it.End state for users: same
--accelerated-node-selector/--accelerated-node-tolerationCLI flags at bundle time. Different cluster, different vocabulary, same demo TrainJob YAML.API-version-agnostic. Works on kubeflow-trainer v2.1 (
PodTemplateOverridesera) and v2.2+ (RuntimePatchesera) identically, because the TrainJob no longer overrides anything — the runtime carries the scheduling.Fixes: N/A
Related: kubeflow/trainer#3309
Type of Change
Component(s) Affected
Implementation Notes
Three coordinated changes (4 files, +34/-12 net):
`recipes/registry.yaml` — add `nodeScheduling.accelerated` block to the `kubeflow-trainer` component entry. `nodeSelectorPaths: [acceleratedNodeSelector]` and `tolerationPaths: [acceleratedTolerations]` (top-level keys). Identical pattern to `gpu-operator` (`daemonsets.nodeSelector` / `daemonsets.tolerations`); just chose top-level for readability.
`recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml` — replace the static pod-spec scheduling region with Helm template directives:
```yaml
{{- $kft := index .Values "kubeflow-trainer" }}
{{- with $kft.acceleratedNodeSelector }}
nodeSelector:
{{- toYaml . | nindent 20 }}
{{- end }}
{{- with $kft.acceleratedTolerations }}
tolerations:
{{- toYaml . | nindent 20 }}
{{- end }}
```
`index .Values "kubeflow-trainer"` matches the bundler's `manifest.RenderInput.Values` shape (values nested under ComponentName). Same access pattern as `nodewright-customizations/manifests/tuning.yaml`.
The bundler renders this template at bundle time, so the `bundle/-kubeflow-trainer-post/templates/` artifact is plain YAML with concrete values substituted — Helm at install time just applies it as-is.
What stays the same:
Testing
```bash
yamllint -c .yamllint.yaml
recipes/registry.yaml
recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml
OK
go test ./pkg/recipe/
ok github.com/NVIDIA/aicr/pkg/recipe 0.845s
```
End-to-end on a real EKS H100 cluster (kubeflow-trainer v2.2.0):
Risk Assessment
Rollout notes: Existing clusters re-bundling get the new templated CTR on the next `helm upgrade kubeflow-trainer-post`. Backwards-compatible: TrainJobs that still use `podTemplateOverrides` (v2.1) or `runtimePatches` (v2.2) continue to work — those override mechanisms are additive, this PR just removes the need for them in the AICR-standard demo flow.
Checklist