fix(recipes): handle kubeflow-trainer v2.2.0 API changes by yuanchen8911 · Pull Request #724 · NVIDIA/aicr

yuanchen8911 · 2026-04-30T22:19:49Z

Summary

Bake the cluster-aware nodeSelector + tolerations into the torch-distributed ClusterTrainingRuntime, using AICR's existing nodeScheduling.accelerated bundler injection. Demos go back to bare-bones TrainJobs (no podTemplateOverrides, no runtimePatches).

Motivation / Context

The pytorch demo TrainJobs in demos/cuj1-{eks,gke}.md currently carry per-cluster scheduling boilerplate so the resulting pods land on AICR's tainted GPU nodes. Each TrainJob author has to repeat it; each demo has to be edited per-cluster vocabulary; and the override mechanism keeps changing upstream (PodTemplateOverrides deprecated in v2.1 → replaced by RuntimePatches in v2.2; see kubeflow/trainer#3309).

This PR moves the per-cluster scheduling into the runtime itself. The bundler already supports this via nodeScheduling.accelerated paths declared in recipes/registry.yaml — already used by gpu-operator, nfd, nodewright-customizations, and kgateway. kubeflow-trainer was the only manifestFiles-using component without an accelerated: block. This PR adds it.

End state for users: same --accelerated-node-selector / --accelerated-node-toleration CLI flags at bundle time. Different cluster, different vocabulary, same demo TrainJob YAML.

API-version-agnostic. Works on kubeflow-trainer v2.1 (PodTemplateOverrides era) and v2.2+ (RuntimePatches era) identically, because the TrainJob no longer overrides anything — the runtime carries the scheduling.

Fixes: N/A
Related: kubeflow/trainer#3309

Type of Change

Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

Recipe engine / data (`pkg/recipe`)

Implementation Notes

Three coordinated changes (4 files, +34/-12 net):

`recipes/registry.yaml` — add `nodeScheduling.accelerated` block to the `kubeflow-trainer` component entry. `nodeSelectorPaths: [acceleratedNodeSelector]` and `tolerationPaths: [acceleratedTolerations]` (top-level keys). Identical pattern to `gpu-operator` (`daemonsets.nodeSelector` / `daemonsets.tolerations`); just chose top-level for readability.
`recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml` — replace the static pod-spec scheduling region with Helm template directives:

```yaml
{{- $kft := index .Values "kubeflow-trainer" }}
{{- with $kft.acceleratedNodeSelector }}
nodeSelector:
{{- toYaml . | nindent 20 }}
{{- end }}
{{- with $kft.acceleratedTolerations }}
tolerations:
{{- toYaml . | nindent 20 }}
{{- end }}
```

`index .Values "kubeflow-trainer"` matches the bundler's `manifest.RenderInput.Values` shape (values nested under ComponentName). Same access pattern as `nodewright-customizations/manifests/tuning.yaml`.

The bundler renders this template at bundle time, so the `bundle/-kubeflow-trainer-post/templates/` artifact is plain YAML with concrete values substituted — Helm at install time just applies it as-is.

`demos/cuj1-eks.md` and `demos/cuj1-gke.md` — drop the entire `podTemplateOverrides` block. Demo TrainJob is just `trainer:` + `runtimeRef:`.

What stays the same:

`helm.sh/hook` annotations (still required by `pkg/recipe.TestManifestHelmHooksRequired`).
Bundler CLI flags (`--accelerated-node-selector`, `--accelerated-node-toleration`).
No bundler Go changes; no new patterns; no precedent broken.

Testing

```bash
yamllint -c .yamllint.yaml
recipes/registry.yaml
recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml

OK

go test ./pkg/recipe/

ok github.com/NVIDIA/aicr/pkg/recipe 0.845s

```

End-to-end on a real EKS H100 cluster (kubeflow-trainer v2.2.0):

`helm upgrade kubeflow-trainer-post` from this branch's bundle -> CTR live with baked tolerations + nodeSelector.
Apply the bare-bones TrainJob from `demos/cuj1-eks.md` literally (no `podTemplateOverrides`, no `runtimePatches`). Admission accepts; pod scheduled to GPU node with `dedicated=worker-workload:NoSchedule|NoExecute` tolerations and `nodeGroup=gpu-worker` nodeSelector inherited from the runtime; `pytorch-mnist` runs to completion in 21s with `accuracy=0.7424`.

Risk Assessment

Low — Isolated change, validated end-to-end, easy to revert.

Rollout notes: Existing clusters re-bundling get the new templated CTR on the next `helm upgrade kubeflow-trainer-post`. Backwards-compatible: TrainJobs that still use `podTemplateOverrides` (v2.1) or `runtimePatches` (v2.2) continue to work — those override mechanisms are additive, this PR just removes the need for them in the AICR-standard demo flow.

Checklist

Tests pass locally (`go test ./pkg/recipe/`, `yamllint`)
Linter passes (`yamllint`)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (N/A — uses existing `nodeScheduling.accelerated` injection paths covered by existing bundler tests)
I updated docs if user-facing behavior changed (`demos/cuj1-{eks,gke}.md` updated)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (`git commit -S`)

coderabbitai · 2026-04-30T22:23:37Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6bd7ff49-51e4-4721-a72a-af97cff94b28

📥 Commits

Reviewing files that changed from the base of the PR and between c904fed and 1d77c49.

📒 Files selected for processing (4)

demos/cuj1-eks.md
demos/cuj1-gke.md
recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml
recipes/registry.yaml

📝 Walkthrough

Walkthrough

Examples no longer inject nodeSelector and tolerations via runtimePatches/podTemplateOverrides; they rely on runtimeRef to the torch-distributed ClusterTrainingRuntime and on scheduling values applied at bundle time (--accelerated-node-selector, --accelerated-node-toleration). The ClusterTrainingRuntime Helm template now conditionally renders nodeSelector and tolerations from acceleratedNodeSelector and acceleratedTolerations. The component registry (recipes/registry.yaml) gained a nodeScheduling.accelerated section to expose those Helm values for per-cluster scheduling injection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title describes a fix for kubeflow-trainer v2.2.0 API changes, but the changeset is actually about refactoring demo scheduling constraints from podTemplateOverrides/runtimePatches into the torch-distributed ClusterTrainingRuntime via bundler injection, which is orthogonal to upstream API breakage.	Revise title to reflect the actual change: e.g., 'fix(recipes): bake cluster scheduling into kubeflow-trainer runtime' or 'refactor(demos): move scheduling from podTemplateOverrides to bundler injection'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description comprehensively explains the motivation (API churn from v2.1→v2.2), the implementation (registry.yaml, Helm templating, demo updates), testing results, and risk assessment, all directly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-30T22:29:18Z

@yuanchen8911 this PR now has merge conflicts with main. Please rebase to resolve them.

The pytorch demo TrainJobs in demos/cuj1-{eks,gke}.md carry per-cluster scheduling boilerplate (`podTemplateOverrides` with cluster-specific tolerations) so the resulting pods land on AICR's tainted GPU nodes. Each TrainJob author has to repeat this; each demo has to be edited per-cluster vocabulary; and the override mechanism keeps changing upstream (PodTemplateOverrides was deprecated in v2.1, replaced by RuntimePatches in v2.2 — kubeflow/trainer#3309). Move the per-cluster scheduling into the runtime instead. AICR's existing `nodeScheduling.accelerated` bundler injection (already used by gpu-operator, nfd, nodewright-customizations, kgateway) writes the CLI flag values into the chart's values.yaml at the listed paths. kubeflow-trainer was the only manifestFiles-using component without an `accelerated:` block. This commit adds it and templates the torch-distributed ClusterTrainingRuntime to consume the injected values, mirroring nodewright-customizations/manifests/tuning.yaml. Three coordinated changes: 1. recipes/registry.yaml — add `nodeScheduling.accelerated` block to the kubeflow-trainer entry. Targets top-level keys `acceleratedNodeSelector` and `acceleratedTolerations`. 2. recipes/components/kubeflow-trainer/manifests/ torch-distributed-cluster-training-runtime.yaml — replace the static pod-spec scheduling region with Helm template directives: {{- $kft := index .Values "kubeflow-trainer" }} {{- with $kft.acceleratedNodeSelector }} nodeSelector: {{- toYaml . | nindent 20 }} {{- end }} {{- with $kft.acceleratedTolerations }} tolerations: {{- toYaml . | nindent 20 }} {{- end }} `index .Values "kubeflow-trainer"` matches the bundler's `manifest.RenderInput.Values` shape (values nested under ComponentName). The bundler renders this template at bundle time — the artifact in `bundle/<NNN>-kubeflow-trainer-post/templates/` is plain YAML with concrete values substituted. 3. demos/cuj1-eks.md and demos/cuj1-gke.md — drop the entire `podTemplateOverrides` block. Demo TrainJob is just `trainer:` + `runtimeRef:`. API-version-agnostic: works on kubeflow-trainer v2.1 (PodTemplateOverrides era) and v2.2+ (RuntimePatches era) identically, because the TrainJob no longer overrides anything — the runtime carries the scheduling. Validated end-to-end on a real EKS H100 cluster: helm-upgrade kubeflow-trainer-post → CTR live with baked tolerations + nodeSelector → bare pytorch-mnist TrainJob admits, schedules with the correct tolerations + nodeSelector inherited from the runtime, trains to completion (accuracy=0.7424 in 21s). `pkg/recipe.TestManifestHelmHooksRequired` still passes — the `helm.sh/hook` annotations are preserved.

yuanchen8911 added area/recipes bug labels Apr 30, 2026

github-actions Bot added area/docs area/bundler size/M labels Apr 30, 2026

yuanchen8911 force-pushed the fix/kubeflow-trainer-v2.2-durable branch from 06c2225 to c904fed Compare April 30, 2026 22:23

github-actions Bot removed area/docs area/bundler size/M labels Apr 30, 2026

github-actions Bot added the size/S label Apr 30, 2026

yuanchen8911 marked this pull request as ready for review April 30, 2026 22:29

yuanchen8911 requested review from a team as code owners April 30, 2026 22:29

yuanchen8911 requested a review from mchmarny April 30, 2026 22:29

github-actions Bot added the needs-rebase label Apr 30, 2026

yuanchen8911 requested a review from lockwobr April 30, 2026 22:30

yuanchen8911 changed the title ~~fix(recipes): bake AICR scheduling into torch-distributed runtime~~ fix(recipes): handle kubeflow-trainer v2.2.0 API changes Apr 30, 2026

yuanchen8911 force-pushed the fix/kubeflow-trainer-v2.2-durable branch from c904fed to 1d77c49 Compare April 30, 2026 22:32

github-actions Bot added size/M and removed size/S labels Apr 30, 2026

yuanchen8911 mentioned this pull request Apr 30, 2026

chore(recipes): check and update runtime component versions across all recipes #698

Open

7 tasks

lockwobr approved these changes Apr 30, 2026

View reviewed changes

yuanchen8911 merged commit 604a324 into NVIDIA:main Apr 30, 2026
85 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724

fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kubeflow-trainer-v2.2-durable

yuanchen8911 commented Apr 30, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

OK

ok github.com/NVIDIA/aicr/pkg/recipe 0.845s

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Apr 30, 2026 •

edited

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading