Skip to content

fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kubeflow-trainer-v2.2-durable
Apr 30, 2026
Merged

fix(recipes): handle kubeflow-trainer v2.2.0 API changes#724
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kubeflow-trainer-v2.2-durable

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 30, 2026

Summary

Bake the cluster-aware nodeSelector + tolerations into the torch-distributed ClusterTrainingRuntime, using AICR's existing nodeScheduling.accelerated bundler injection. Demos go back to bare-bones TrainJobs (no podTemplateOverrides, no runtimePatches).

Motivation / Context

The pytorch demo TrainJobs in demos/cuj1-{eks,gke}.md currently carry per-cluster scheduling boilerplate so the resulting pods land on AICR's tainted GPU nodes. Each TrainJob author has to repeat it; each demo has to be edited per-cluster vocabulary; and the override mechanism keeps changing upstream (PodTemplateOverrides deprecated in v2.1 → replaced by RuntimePatches in v2.2; see kubeflow/trainer#3309).

This PR moves the per-cluster scheduling into the runtime itself. The bundler already supports this via nodeScheduling.accelerated paths declared in recipes/registry.yaml — already used by gpu-operator, nfd, nodewright-customizations, and kgateway. kubeflow-trainer was the only manifestFiles-using component without an accelerated: block. This PR adds it.

End state for users: same --accelerated-node-selector / --accelerated-node-toleration CLI flags at bundle time. Different cluster, different vocabulary, same demo TrainJob YAML.

API-version-agnostic. Works on kubeflow-trainer v2.1 (PodTemplateOverrides era) and v2.2+ (RuntimePatches era) identically, because the TrainJob no longer overrides anything — the runtime carries the scheduling.

Fixes: N/A
Related: kubeflow/trainer#3309

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (`pkg/recipe`)

Implementation Notes

Three coordinated changes (4 files, +34/-12 net):

  1. `recipes/registry.yaml` — add `nodeScheduling.accelerated` block to the `kubeflow-trainer` component entry. `nodeSelectorPaths: [acceleratedNodeSelector]` and `tolerationPaths: [acceleratedTolerations]` (top-level keys). Identical pattern to `gpu-operator` (`daemonsets.nodeSelector` / `daemonsets.tolerations`); just chose top-level for readability.

  2. `recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml` — replace the static pod-spec scheduling region with Helm template directives:

```yaml
{{- $kft := index .Values "kubeflow-trainer" }}
{{- with $kft.acceleratedNodeSelector }}
nodeSelector:
{{- toYaml . | nindent 20 }}
{{- end }}
{{- with $kft.acceleratedTolerations }}
tolerations:
{{- toYaml . | nindent 20 }}
{{- end }}
```

`index .Values "kubeflow-trainer"` matches the bundler's `manifest.RenderInput.Values` shape (values nested under ComponentName). Same access pattern as `nodewright-customizations/manifests/tuning.yaml`.

The bundler renders this template at bundle time, so the `bundle/-kubeflow-trainer-post/templates/` artifact is plain YAML with concrete values substituted — Helm at install time just applies it as-is.

  1. `demos/cuj1-eks.md` and `demos/cuj1-gke.md` — drop the entire `podTemplateOverrides` block. Demo TrainJob is just `trainer:` + `runtimeRef:`.

What stays the same:

  • `helm.sh/hook` annotations (still required by `pkg/recipe.TestManifestHelmHooksRequired`).
  • Bundler CLI flags (`--accelerated-node-selector`, `--accelerated-node-toleration`).
  • No bundler Go changes; no new patterns; no precedent broken.

Testing

```bash
yamllint -c .yamllint.yaml
recipes/registry.yaml
recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml

OK

go test ./pkg/recipe/

ok github.com/NVIDIA/aicr/pkg/recipe 0.845s

```

End-to-end on a real EKS H100 cluster (kubeflow-trainer v2.2.0):

  1. `helm upgrade kubeflow-trainer-post` from this branch's bundle -> CTR live with baked tolerations + nodeSelector.
  2. Apply the bare-bones TrainJob from `demos/cuj1-eks.md` literally (no `podTemplateOverrides`, no `runtimePatches`). Admission accepts; pod scheduled to GPU node with `dedicated=worker-workload:NoSchedule|NoExecute` tolerations and `nodeGroup=gpu-worker` nodeSelector inherited from the runtime; `pytorch-mnist` runs to completion in 21s with `accuracy=0.7424`.

Risk Assessment

  • Low — Isolated change, validated end-to-end, easy to revert.

Rollout notes: Existing clusters re-bundling get the new templated CTR on the next `helm upgrade kubeflow-trainer-post`. Backwards-compatible: TrainJobs that still use `podTemplateOverrides` (v2.1) or `runtimePatches` (v2.2) continue to work — those override mechanisms are additive, this PR just removes the need for them in the AICR-standard demo flow.

Checklist

  • Tests pass locally (`go test ./pkg/recipe/`, `yamllint`)
  • Linter passes (`yamllint`)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — uses existing `nodeScheduling.accelerated` injection paths covered by existing bundler tests)
  • I updated docs if user-facing behavior changed (`demos/cuj1-{eks,gke}.md` updated)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (`git commit -S`)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 6bd7ff49-51e4-4721-a72a-af97cff94b28

📥 Commits

Reviewing files that changed from the base of the PR and between c904fed and 1d77c49.

📒 Files selected for processing (4)
  • demos/cuj1-eks.md
  • demos/cuj1-gke.md
  • recipes/components/kubeflow-trainer/manifests/torch-distributed-cluster-training-runtime.yaml
  • recipes/registry.yaml

📝 Walkthrough

Walkthrough

Examples no longer inject nodeSelector and tolerations via runtimePatches/podTemplateOverrides; they rely on runtimeRef to the torch-distributed ClusterTrainingRuntime and on scheduling values applied at bundle time (--accelerated-node-selector, --accelerated-node-toleration). The ClusterTrainingRuntime Helm template now conditionally renders nodeSelector and tolerations from acceleratedNodeSelector and acceleratedTolerations. The component registry (recipes/registry.yaml) gained a nodeScheduling.accelerated section to expose those Helm values for per-cluster scheduling injection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title describes a fix for kubeflow-trainer v2.2.0 API changes, but the changeset is actually about refactoring demo scheduling constraints from podTemplateOverrides/runtimePatches into the torch-distributed ClusterTrainingRuntime via bundler injection, which is orthogonal to upstream API breakage. Revise title to reflect the actual change: e.g., 'fix(recipes): bake cluster scheduling into kubeflow-trainer runtime' or 'refactor(demos): move scheduling from podTemplateOverrides to bundler injection'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description comprehensively explains the motivation (API churn from v2.1→v2.2), the implementation (registry.yaml, Helm templating, demo updates), testing results, and risk assessment, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 marked this pull request as ready for review April 30, 2026 22:29
@yuanchen8911 yuanchen8911 requested review from a team as code owners April 30, 2026 22:29
@yuanchen8911 yuanchen8911 requested a review from mchmarny April 30, 2026 22:29
@github-actions
Copy link
Copy Markdown
Contributor

@yuanchen8911 this PR now has merge conflicts with main. Please rebase to resolve them.

@yuanchen8911 yuanchen8911 requested a review from lockwobr April 30, 2026 22:30
@yuanchen8911 yuanchen8911 changed the title fix(recipes): bake AICR scheduling into torch-distributed runtime fix(recipes): handle kubeflow-trainer v2.2.0 API changes Apr 30, 2026
The pytorch demo TrainJobs in demos/cuj1-{eks,gke}.md carry per-cluster
scheduling boilerplate (`podTemplateOverrides` with cluster-specific
tolerations) so the resulting pods land on AICR's tainted GPU nodes.
Each TrainJob author has to repeat this; each demo has to be edited
per-cluster vocabulary; and the override mechanism keeps changing
upstream (PodTemplateOverrides was deprecated in v2.1, replaced by
RuntimePatches in v2.2 — kubeflow/trainer#3309).

Move the per-cluster scheduling into the runtime instead. AICR's
existing `nodeScheduling.accelerated` bundler injection (already used
by gpu-operator, nfd, nodewright-customizations, kgateway) writes the
CLI flag values into the chart's values.yaml at the listed paths.
kubeflow-trainer was the only manifestFiles-using component without an
`accelerated:` block. This commit adds it and templates the
torch-distributed ClusterTrainingRuntime to consume the injected
values, mirroring nodewright-customizations/manifests/tuning.yaml.

Three coordinated changes:

1. recipes/registry.yaml — add `nodeScheduling.accelerated` block to
   the kubeflow-trainer entry. Targets top-level keys
   `acceleratedNodeSelector` and `acceleratedTolerations`.

2. recipes/components/kubeflow-trainer/manifests/
   torch-distributed-cluster-training-runtime.yaml — replace the
   static pod-spec scheduling region with Helm template directives:

       {{- $kft := index .Values "kubeflow-trainer" }}
       {{- with $kft.acceleratedNodeSelector }}
       nodeSelector:
         {{- toYaml . | nindent 20 }}
       {{- end }}
       {{- with $kft.acceleratedTolerations }}
       tolerations:
         {{- toYaml . | nindent 20 }}
       {{- end }}

   `index .Values "kubeflow-trainer"` matches the bundler's
   `manifest.RenderInput.Values` shape (values nested under
   ComponentName). The bundler renders this template at bundle time —
   the artifact in `bundle/<NNN>-kubeflow-trainer-post/templates/`
   is plain YAML with concrete values substituted.

3. demos/cuj1-eks.md and demos/cuj1-gke.md — drop the entire
   `podTemplateOverrides` block. Demo TrainJob is just `trainer:` +
   `runtimeRef:`.

API-version-agnostic: works on kubeflow-trainer v2.1 (PodTemplateOverrides
era) and v2.2+ (RuntimePatches era) identically, because the TrainJob
no longer overrides anything — the runtime carries the scheduling.

Validated end-to-end on a real EKS H100 cluster:
helm-upgrade kubeflow-trainer-post → CTR live with baked tolerations
+ nodeSelector → bare pytorch-mnist TrainJob admits, schedules with
the correct tolerations + nodeSelector inherited from the runtime,
trains to completion (accuracy=0.7424 in 21s).

`pkg/recipe.TestManifestHelmHooksRequired` still passes — the
`helm.sh/hook` annotations are preserved.
@yuanchen8911 yuanchen8911 merged commit 604a324 into NVIDIA:main Apr 30, 2026
85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants