Changelog

v2.1.0 (2025-11-07)

This is Kubeflow Trainer v2.1.0 release.

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0"
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0"

You can now install controller manager with Helm charts 🚀

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0

For more information, please see the Kubeflow Trainer docs

Breaking Changes

feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

feat(cache): KEP-2655: Adding default runtime with cache and example (#2928 by @akshaychitneni)
feat(cache): KEP-2655 - Supporting readiness probes on cache nodes (#2920 by @akshaychitneni)
feat(cache): KEP-2655 - Add build pipeline and address vulnerabilities for data_cache (#2890 by @akshaychitneni)
feat(cache): KEP-2655: Adding cache initializer (#2793 by @akshaychitneni)
feat: KEP-2655: Add data cache system (#2755 by @akshaychitneni)

LLM Post-Training

feat(runtimes): Add LoRA/QLoRA/DoRA support in LLM Trainer V2 (#2832 by @Electronic-Waste)
feat: Add Qwen 2.5 1.5b runtime, example and fix gpu e2e test (#2835 by @jaiakash)
feat(runtimes): Support Distributed MLX on CUDA (#2790 by @andreyvelich)

Kueue Enhancements

Support Topology Aware Scheduling for TrainJobs (kubernetes-sigs/kueue#7249 by @kaisoz)
fix: Allow multiple podSpec overrides to target the same TargetJob (#2880 by @kaisoz)
feat: support affinity in TrainJob pod spec overrides (#2796 by @toVersus)
feat: Add schedulingGates to PodSpecOverrides (#2700 by @astefanutti)

Volcano Scheduler

feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

[release-2.1] fix(ci): Fix the Kubeflow SDK installation with Docker (#2927 by @andreyvelich)
fix(manifests): Add RBAC rules for Leases in Helm Charts (#2901 by @astefanutti)
fix(docs): correct example usage in KEP-2437-Support-Volcano-Scheduler (#2898 by @Doris-xm)
fix(api): Keep mpiImplementation field a pointer (#2897 by @astefanutti)
fix(api): Fix lint errors for the config API (#2896 by @astefanutti)
fix: charts dependencies (#2892 by @ls-2018)
fix(runtimes): fix missing dependency in torchtune trainer image. (#2887 by @Electronic-Waste)
fix(ci): Add latest image tag only for the master branch (#2854 by @andreyvelich)
fix: read only permission for PRs (#2829 by @jaiakash)
fix: read only permission for PRs (#2827 by @jaiakash)
fix: update examples to reflect func_args now being unpacked (#2815 by @briangallagher)
fix(examples): Update get_job_logs() API in examples (#2813 by @andreyvelich)
fix: teraform for oci gpu based vm (#2810 by @jaiakash)
fix(api): Regenerate TrainJob CRD (#2805 by @astefanutti)
fix(ci): disable Unit and Integration Test - Go gh action in forked repos (#2746 by @milinddethe15)
fix(manifests): Add missing permissions for the RuntimeClass and LimitRange (#2787 by @tenzen-y)
fix: update kubeflow sdk reference (#2780 by @kramaranya)
fix(api): update license path for kubeflow_trainer_api (#2778 by @kramaranya)
fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2774 by @andreyvelich)
fix(docs): update KEP-2401 according to current implementation. (#2765 by @Electronic-Waste)
fix(ci): Remove coverage from Go integration tests (#2773 by @andreyvelich)
fix(api): Fix license path for Kubeflow Trainer Python API (#2771 by @andreyvelich)
fix(examples): Update the argument for Runtime framework (#2766 by @andreyvelich)
fix(test): Fix Ginkgo command for integration tests (#2758 by @astefanutti)
fix: fix the command for fetching Kubeflow Trainer version in the issue template (#2732 by @rudeigerc)
fix(manifests): add rbac config of events for event recorders (#2731 by @rudeigerc)
fix(manifests): fix position of labels of dataset-initializer from pod to job (#2719 by @rudeigerc)
fix(module): Change Go module name to v2 (#2707 by @andreyvelich)
fix(plugins): Fix some errors in torchtune mutation process. (#2675 by @Electronic-Waste)
fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files (#2669 by @Electronic-Waste)
fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2682 by @astefanutti)

Misc

[release-2.1] feat: Adding local execution example notebook (#2924 by @Fiona-Waters)
feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
[release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
[release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)
feat(operator): Add validation for required containers in replicatedJobs (#2722 by @Electronic-Waste)
feat: add controller manager configuration helm chart (#2895 by @kapil27)
chore(ci): Enable Kubernetes API Linter (#2858 by @astefanutti)
feat(runtimes): implement clusterTrainingRuntime deprecation process (#2791 by @tdn21)
feat: add HF token and allow gpu workflow to run from pull request target (#2818 by @jaiakash)
feat(docs): KEP-2442-Support JAX Training Runtime (#2643 by @mahdikhashan)
chore(test): Support e2e cluster setup with Podman (#2861 by @astefanutti)
chore(runtimes): Upgrade torchtune version to v0.6.1 (#2876 by @Electronic-Waste)
chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
feat(docs): Update Trainer diagram and SDK release (#2867 by @andreyvelich)
feat(docs): Add changelog for Kubeflow Trainer v2.0.1 (#2864 by @andreyvelich)
fix(docs): Update the release document to push all changes (#2865 by @andreyvelich)
chore: Install released version of Kubeflow SDK (#2857 by @kramaranya)
chore(ci): Ignore generated files in .gitattributes (#2855 by @andreyvelich)
feat: Add a public function to create runtime info objects (#2837 by @kaisoz)
chore(test): add uts for coscheduling plugin. (#2582 by @IRONICBo)
feat(ci): Add Trivy Vulnerability Scan (#2826 by @andreyvelich)
chore: merge test cases using PodSpecOverrides into a single case (#2822 by @toVersus)
chore(runtimes): update torchtune CTRs with multiple dependson feature in jobset v0.9.0 (#2823 by @Electronic-Waste)
chore(operator): Bump JobSet to v0.9.0 version (#2821 by @andreyvelich)
feat(docs): How to release Python API modules (#2786 by @andreyvelich)
feat: support for managing gpu enabled self runner infra (#2762 by @jaiakash)
chore: Nominate @astefanutti as Kubeflow Trainer approver (#2808 by @andreyvelich)
chore: deflake test to ensure runtime is created before creating trainjob (#2807 by @toVersus)
feat: KEP-2432: GPU Testing for LLM Blueprints (#2689 by @jaiakash)
chore(docs): Add license scan report and status (#2788 by @fossabot)
chore: Remove tool.hatch.build.targets.wheel from pyproject (#2803 by @kramaranya)
chore: Add unit tests for pkg/apply (#2479 by @akagami-harsh)
chore(runtimes): Remove MPI pi Runtime (#2760 by @andreyvelich)
chore(runtimes): Update packages in DeepSpeed runtime and fix T5 example (#2781 by @andreyvelich)
feat: run workflows on /ok-to-test label (#2639 by @milinddethe15)
feat: Add security contexts to controller managers (#2759 by @kunal-511)
feat(docs): Introduce latest news to the README (#2769 by @andreyvelich)
feat(runtimes): Add Framework Label to the Runtimes (#2761 by @andreyvelich)
feat(runtimes): Remove command from the Runtimes with CustomTrainer (#2754 by @andreyvelich)
feat(docs): Kubeflow Trainer ROADMAP 2025 (#2748 by @andreyvelich)
chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 (#2743 by @andreyvelich)
chore: update github runners to oci gh arc runners (#2739 by @koksay)
feat(operator): force trainjob name to be compliant with RFC 1035 for jobset (#2734 by @rudeigerc)
chore(ci): Add GitHub action to verify PR titles (#2724 by @andreyvelich)
feat(docs): Guide to report security vulnerability (#2718 by @andreyvelich)
chore: Upgrade JobSet to version 0.8.2 (#2726 by @astefanutti)
Add Red Hat to ADOPTERS.md (#2714 by @terrytangyuan)
chore(docs): Add Changelog for v2.0.0-rc.1 (#2709 by @andreyvelich)
chore(docs): Update Release Guide (#2710 by @andreyvelich)
chore: Copy generated CRDs into Helm charts (#2703 by @astefanutti)
feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670 by @Electronic-Waste)
feat: Mutable PodSpecOverrides for suspended TrainJob (#2683 by @astefanutti)
chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2695 by @tenzen-y)
chore: Remove the vendor specific parameters (#2691 by @tenzen-y)
KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2382 by @Doris-xm)
chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2685 by @andreyvelich)
chore(helm): Sync ClusterRule in Helm chart (#2686 by @astefanutti)
Add Changelog for Trainer v2.0.0-rc.0 (#2666 by @kramaranya)
feat(initializer): Updated base image to Debian image and changed install commands compatible with Debian image (#2528 by @Debabrata47)

Full Changelog

v2.1.0-rc.1 (2025-11-03)

New Features

feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
[release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
[release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)

Bug Fixes

[release-2.1] fix(manifests): Fix boolean values defaulting in Helm charts (#2914 by @astefanutti)
[release-2.1] fix(runtimes): Update pip version in the MLX runtime (#2910 by @andreyvelich)

Full Changelog

v2.1.0-rc.0 (2025-10-21)

Breaking Changes

feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

feat(cache): KEP-2655 - Add build pipeline and address vulnerabilities for data_cache (#2890 by @akshaychitneni)
feat(cache): KEP-2655: Adding cache initializer (#2793 by @akshaychitneni)
feat: KEP-2655: Add data cache system (#2755 by @akshaychitneni)

LLM Post-Training

feat(runtimes): Add LoRA/QLoRA/DoRA support in LLM Trainer V2 (#2832 by @Electronic-Waste)
feat: Add Qwen 2.5 1.5b runtime, example and fix gpu e2e test (#2835 by @jaiakash)
feat(runtimes): Support Distributed MLX on CUDA (#2790 by @andreyvelich)

Kueue Enhancements

Support Topology Aware Scheduling for TrainJobs (kubernetes-sigs/kueue#7249 by @kaisoz)
fix: Allow multiple podSpec overrides to target the same TargetJob (#2880 by @kaisoz)
feat: support affinity in TrainJob pod spec overrides (#2796 by @toVersus)
feat: Add schedulingGates to PodSpecOverrides (#2700 by @astefanutti)

Volcano Scheduler

feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

fix(manifests): Add RBAC rules for Leases in Helm Charts (#2901 by @astefanutti)
fix(docs): correct example usage in KEP-2437-Support-Volcano-Scheduler (#2898 by @Doris-xm)
fix(api): Keep mpiImplementation field a pointer (#2897 by @astefanutti)
fix(api): Fix lint errors for the config API (#2896 by @astefanutti)
fix: charts dependencies (#2892 by @ls-2018)
fix(runtimes): fix missing dependency in torchtune trainer image. (#2887 by @Electronic-Waste)
fix(ci): Add latest image tag only for the master branch (#2854 by @andreyvelich)
fix: read only permission for PRs (#2829 by @jaiakash)
fix: read only permission for PRs (#2827 by @jaiakash)
fix: update examples to reflect func_args now being unpacked (#2815 by @briangallagher)
fix(examples): Update get_job_logs() API in examples (#2813 by @andreyvelich)
fix: teraform for oci gpu based vm (#2810 by @jaiakash)
fix(api): Regenerate TrainJob CRD (#2805 by @astefanutti)
fix(ci): disable Unit and Integration Test - Go gh action in forked repos (#2746 by @milinddethe15)
fix(manifests): Add missing permissions for the RuntimeClass and LimitRange (#2787 by @tenzen-y)
fix: update kubeflow sdk reference (#2780 by @kramaranya)
fix(api): update license path for kubeflow_trainer_api (#2778 by @kramaranya)
fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2774 by @andreyvelich)
fix(docs): update KEP-2401 according to current implementation. (#2765 by @Electronic-Waste)
fix(ci): Remove coverage from Go integration tests (#2773 by @andreyvelich)
fix(api): Fix license path for Kubeflow Trainer Python API (#2771 by @andreyvelich)
fix(examples): Update the argument for Runtime framework (#2766 by @andreyvelich)
fix(test): Fix Ginkgo command for integration tests (#2758 by @astefanutti)
fix: fix the command for fetching Kubeflow Trainer version in the issue template (#2732 by @rudeigerc)
fix(manifests): add rbac config of events for event recorders (#2731 by @rudeigerc)
fix(manifests): fix position of labels of dataset-initializer from pod to job (#2719 by @rudeigerc)
fix(module): Change Go module name to v2 (#2707 by @andreyvelich)
fix(plugins): Fix some errors in torchtune mutation process. (#2675 by @Electronic-Waste)
fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files (#2669 by @Electronic-Waste)
fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2682 by @astefanutti)

Misc

feat(operator): Add validation for required containers in replicatedJobs (#2722 by @Electronic-Waste)
feat: add controller manager configuration helm chart (#2895 by @kapil27)
chore(ci): Enable Kubernetes API Linter (#2858 by @astefanutti)
feat(runtimes): implement clusterTrainingRuntime deprecation process (#2791 by @tdn21)
feat: add HF token and allow gpu workflow to run from pull request target (#2818 by @jaiakash)
feat(docs): KEP-2442-Support JAX Training Runtime (#2643 by @mahdikhashan)
chore(test): Support e2e cluster setup with Podman (#2861 by @astefanutti)
chore(runtimes): Upgrade torchtune version to v0.6.1 (#2876 by @Electronic-Waste)
chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
feat(docs): Update Trainer diagram and SDK release (#2867 by @andreyvelich)
feat(docs): Add changelog for Kubeflow Trainer v2.0.1 (#2864 by @andreyvelich)
fix(docs): Update the release document to push all changes (#2865 by @andreyvelich)
chore: Install released version of Kubeflow SDK (#2857 by @kramaranya)
chore(ci): Ignore generated files in .gitattributes (#2855 by @andreyvelich)
feat: Add a public function to create runtime info objects (#2837 by @kaisoz)
chore(test): add uts for coscheduling plugin. (#2582 by @IRONICBo)
feat(ci): Add Trivy Vulnerability Scan (#2826 by @andreyvelich)
chore: merge test cases using PodSpecOverrides into a single case (#2822 by @toVersus)
chore(runtimes): update torchtune CTRs with multiple dependson feature in jobset v0.9.0 (#2823 by @Electronic-Waste)
chore(operator): Bump JobSet to v0.9.0 version (#2821 by @andreyvelich)
feat(docs): How to release Python API modules (#2786 by @andreyvelich)
feat: support for managing gpu enabled self runner infra (#2762 by @jaiakash)
chore: Nominate @astefanutti as Kubeflow Trainer approver (#2808 by @andreyvelich)
chore: deflake test to ensure runtime is created before creating trainjob (#2807 by @toVersus)
feat: KEP-2432: GPU Testing for LLM Blueprints (#2689 by @jaiakash)
chore(docs): Add license scan report and status (#2788 by @fossabot)
chore: Remove tool.hatch.build.targets.wheel from pyproject (#2803 by @kramaranya)
chore: Add unit tests for pkg/apply (#2479 by @akagami-harsh)
chore(runtimes): Remove MPI pi Runtime (#2760 by @andreyvelich)
chore(runtimes): Update packages in DeepSpeed runtime and fix T5 example (#2781 by @andreyvelich)
feat: run workflows on /ok-to-test label (#2639 by @milinddethe15)
feat: Add security contexts to controller managers (#2759 by @kunal-511)
feat(docs): Introduce latest news to the README (#2769 by @andreyvelich)
feat(runtimes): Add Framework Label to the Runtimes (#2761 by @andreyvelich)
feat(runtimes): Remove command from the Runtimes with CustomTrainer (#2754 by @andreyvelich)
feat(docs): Kubeflow Trainer ROADMAP 2025 (#2748 by @andreyvelich)
chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 (#2743 by @andreyvelich)
chore: update github runners to oci gh arc runners (#2739 by @koksay)
feat(operator): force trainjob name to be compliant with RFC 1035 for jobset (#2734 by @rudeigerc)
chore(ci): Add GitHub action to verify PR titles (#2724 by @andreyvelich)
feat(docs): Guide to report security vulnerability (#2718 by @andreyvelich)
chore: Upgrade JobSet to version 0.8.2 (#2726 by @astefanutti)
Add Red Hat to ADOPTERS.md (#2714 by @terrytangyuan)
chore(docs): Add Changelog for v2.0.0-rc.1 (#2709 by @andreyvelich)
chore(docs): Update Release Guide (#2710 by @andreyvelich)
chore: Copy generated CRDs into Helm charts (#2703 by @astefanutti)
feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670 by @Electronic-Waste)
feat: Mutable PodSpecOverrides for suspended TrainJob (#2683 by @astefanutti)
chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2695 by @tenzen-y)
chore: Remove the vendor specific parameters (#2691 by @tenzen-y)
KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2382 by @Doris-xm)
chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2685 by @andreyvelich)
chore(helm): Sync ClusterRule in Helm chart (#2686 by @astefanutti)
Add Changelog for Trainer v2.0.0-rc.0 (#2666 by @kramaranya)
feat(initializer): Updated base image to Debian image and changed install commands compatible with Debian image (#2528 by @Debabrata47)

Full Changelog

v2.0.1 (2025-09-29)

New Features

[release-2.0] feat: Add a public function to create runtime info objects (#2846 by @kaisoz)

Bug Fixes

[release-2.0] fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2863 by @andreyvelich)
[release-2.0] fix(ci): Add latest image tag only for the master branch (#2862 by @andreyvelich)
[release-2.0] fix: update examples to reflect func_args now being unpacked (#2815) (#2853 by @astefanutti)
[release-2.0] fix(examples): Update get_job_logs() API in examples (#2813) (#2852 by @astefanutti)
[release-2.0] feat(runtimes): Add Framework Label to the Runtimes (#2761) (#2851 by @astefanutti)
[release-2.0] fix(examples): Update the argument for Runtime framework (#2766) (#2850 by @astefanutti)
[release-2.0] fix: update kubeflow sdk reference (#2780) (#2847 by @astefanutti)
[release-2.0] fix(api): Fix license path for Kubeflow Trainer Python API (#2772 by @andreyvelich)

Full Changelog

v2.0.0 (2025-07-17)

This is the major release of the Kubeflow Trainer 2.0 project.

For more information, please see the

Breaking Changes

Migrate SDK to the kubeflow/sdk repository (#2657 by @eoinfennessy)
KEP-2170: Change API Group Name to trainer.kubeflow.org (#2413 by @Electronic-Waste)
Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)

New Features

LLM Trainer V2

KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
KEP-2401: Create torchtune trainer image (#2516 by @Electronic-Waste)
KEP-2401: Refactor current train() API (#2513 by @Electronic-Waste)
KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)

Runtime Framework

feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

[feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
Implement MPI plugin UTs (#2481 by @tenzen-y)
Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)

JobSet

Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
KEP-2170: Deploy JobSet in kubeflow-system namespace (#2388 by @andreyvelich)
Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)

New Examples

Add question-answer example for v2 trainer (#2580 by @solanyn)
KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)

SDK Updates

feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)

Bug Fixes

[release-2.0] fix(manifests): add rbac config of events for event recorders (#2733 by @rudeigerc)
[release-2.0] fix(manifests): fix position of labels of dataset-initializer from pod to job (#2720 by @rudeigerc)
[release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
[cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
[release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
[release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)
Revert "fix(sdk): Fix type annotation for train method's trainer parameter" (#2651 by @Electronic-Waste)
fix(sdk): Fix bad arg passed to get_args_using_torchtune_config (#2647 by @eoinfennessy)
fix(sdk): Fix type annotation for train method's trainer parameter (#2646 by @eoinfennessy)
fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
Fix MPI Test runnable errors (#2570 by @tenzen-y)
Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
fix(ci): update test-go coverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo)
fix(doc): Update train() API in KEP-2401 (#2536 by @Electronic-Waste)
fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
[hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
[hotfix] fix docker cred (#2530 by @mahdikhashan)
fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
fix type in model initializer entrypoint (#2489 by @szaher)
fix(runtime): fix error label name. (#2487 by @Electronic-Waste)
fix(sdk): resolve errors in deserialization (#2457 by @Electronic-Waste)
Fix missing external types in apply configurations (#2429 by @astefanutti)
Fix API Group for Torch Runtime (#2424 by @andreyvelich)
Fix Kustomize patchesStrategicMerge deprecation warning (#2405 by @astefanutti)
ControlPlane: Fix flaky integraion testings due to missing the latest version of object (#2414 by @tenzen-y)

Misc

[release-2.0] chore: update github runners to oci gh arc runners (#2741 by @koksay)
[release-2.0] feat(operator): force trainjob name to be compliant with RFC 1035 for jobset (#2736 by @rudeigerc)
[release-2.0] chore: Upgrade JobSet to version 0.8.2 (#2727 by @google-oss-robot)
[release-2.0] chore: Copy generated CRDs into Helm charts (#2704 by @astefanutti)
[release-2.0] feat: Add schedulingGates to PodSpecOverrides (#2705 by @astefanutti)
[cherry-pick] feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670) (#2702 by @Electronic-Waste)
[release-2.0] feat: Mutable PodSpecOverrides for suspended TrainJob (#2698 by @astefanutti)
[release-2.0] chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2697 by @tenzen-y)
[release-2.0] chore: Remove the vendor specific parameters (#2694 by @tenzen-y)
[Release 2.0] KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2692 by @Doris-xm)
[release-2.0] chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2687 by @andreyvelich)
[release-2.0] chore(helm): Sync ClusterRule in Helm chart (#2688 by @astefanutti)
Tag Docker images with GitHub release tags (#2662 by @kramaranya)
feat(controller): Implement PodSpecOverride API (#2614 by @andreyvelich)
Nominate @Electronic-Waste as approver and @astefanutti as reviewer (#2659 by @andreyvelich)
chore(build): Support Podman to run OpenAPI generator (#2656 by @astefanutti)
chore(docs): Add OpenSSF Best Practices Badge (#2611 by @andreyvelich)
[chore] update stale action version to latest (#2642 by @mahdikhashan)
Remove TrainJobCreated condition (#2621 by @astefanutti)
ci: refactor build-push-images workflow (#2607 by @milinddethe15)
Update Go to v1.24 (#2615) (#2620 by @vzamboulingame)
test(runtime): add UT for IndexTrainJobTrainingRuntime (#2603 by @Harshal292004)
ci: add k8s v1.32 for tests env (#2613 by @milinddethe15)
chore(deps): bump torch from 2.5.0 to 2.6.0 in /cmd/runtimes/deepspeed (#2606 by @dependabot[bot])
chore(deps): bump golang.org/x/net from 0.36.0 to 0.38.0 (#2602 by @dependabot[bot])
test(runtime): add UT for jobset runtime valid function. (#2562 by @Harshal292004)
Add Helm chart for kubeflow trainer (#2435 by @ChenYi015)
chore(test): Removed the no longer needed github-trigger-rerun-test.yaml (#2589 by @hbelmiro)
Add PodNetwork plugin to KEP-2170 Job Pipeline Framework description (#2578 by @tenzen-y)
chore(docs): Update Slack channel (#2569 by @andreyvelich)
docs: update CONTRIBUTING.md for Kubeflow Trainer V2 (#2561 by @muzzlol)
test(runtime): add UT for torch runtime valid function. (#2560 by @IRONICBo)
feat(doc): add Runtime API design in KEP-2401. (#2501 by @Electronic-Waste)
Update CONTRIBUTING.md (#2512 by @MuhammedgitAli)
feat: add replicatedJobs.replicas validations in validateReplicatedJobs function. (#2533 by @IRONICBo)
Construct Trainer based on trainer.kubeflow.org/trainjob-ancestor-step label (#2548 by @tenzen-y)
chore: Enable GCI for golangci-lint (#2540 by @tenzen-y)
[feature] merge GHCR and DockerHub CI jobs (#2537 by @ashwinr64)
feat(controller): Refactor the Initializer APIs of TrainJob (#2523 by @andreyvelich)
Migrate InfoOptions.podSpecReplias and info.Scheduler.TotalRequests to info.TemplateSpec.PodSet (#2524 by @tenzen-y)
[feature] pull images in manifest from ghcr (#2529 by @mahdikhashan)
[feature] migrate images to ghcr (#2455 by @mahdikhashan)
KEP-2170: Adding validation webhook for v2 trainjob (#2307 by @akshaychitneni)
Migrate Info.Trainer to Info.TemplateSpec.PodSet (#2520 by @tenzen-y)
Implement E2E for OpenMPI workload (#2500 by @tenzen-y)
Bump golang.org/x/net from 0.33.0 to 0.36.0 (#2514 by @dependabot[bot])
Move TrainJob marker defaulting and validation integration tests to test/integration/webhooks pkg (#2486 by @tenzen-y)
feat(controller): Integrate DependsOn API (#2484 by @andreyvelich)
Store E2E manifests to artifacts directory (#2478 by @tenzen-y)
Use large runner for building container image (#2475 by @tenzen-y)
chore(test): Upload artifacts from dir (#2473 by @andreyvelich)
Implement UTs for PlainML plugin (#2469 by @tenzen-y)
chore(test): Add E2E tests for Kubeflow Trainer (#2470 by @andreyvelich)
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design (#2439 by @tenzen-y)
Replace Kueue PodRequests helper with core k/k one (#2461 by @tenzen-y)
KEP-2170: Use SSA to reconcile TrainJob components (#2431 by @astefanutti)
Bump golang.org/x/net from 0.30.0 to 0.33.0 (#2451 by @dependabot[bot])
Use the correct apiVersion name (#2444 by @runzhen)
Add 'KEP Usage' KEP and template link (#2423 by @anishasthana)
KEP-2170: Add validation to Torch numProcPerNode field (#2409 by @astefanutti)
update migration url on readme file (#2436 by @varodrig)
IntegraionTests: Waiting for expected conditions before emulate JobSet controller manager (#2425 by @tenzen-y)
Nominate @Electronic-Waste as a reviewer (#2427 by @andreyvelich)
Update the naming conventions for Kubeflow Trainer (#2415 by @andreyvelich)
Rename paddlepaddle_defaults.go file name (#2399 by @ChristianZaccaria)
Bump golang.org/x/net from 0.30.0 to 0.33.0 (#2391 by @dependabot[bot])
KEP-2170: Add unit and Integration tests for model and dataset initializers (#2323 by @seanlaii)
Testing CI in JAX example (#2385 by @saileshd1402)
Upgrade huggingface_hub to v0.27.x in dataset initializer v2 (#2379 by @astefanutti)
Add Changelog for Training Operator v1.9.0-rc.0 (#2380 by @andreyvelich)
Add release branch to the image push trigger (#2376 by @andreyvelich)

Full Changelog

v2.0.0-rc.1 (2025-07-03)

New Features

[release-2.0] feat: Add schedulingGates to PodSpecOverrides (#2705 by @astefanutti)
[release-2.0] feat: Mutable PodSpecOverrides for suspended TrainJob (#2698 by @astefanutti)
[Release 2.0] KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2692 by @Doris-xm)

Bug Fixes

[release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
[cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
[release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
[release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)

Misc

[release-2.0] chore: Copy generated CRDs into Helm charts (#2704 by @astefanutti)
[cherry-pick] feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670) (#2702 by @Electronic-Waste)
[release-2.0] chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2697 by @tenzen-y)
[release-2.0] chore: Remove the vendor specific parameters (#2694 by @tenzen-y)
[release-2.0] chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2687 by @andreyvelich)
[release-2.0] chore(helm): Sync ClusterRule in Helm chart (#2688 by @astefanutti)

Full Changelog

v2.0.0-rc.0 (2025-06-10)

Breaking Changes

KEP-2170: Change API Group Name to trainer.kubeflow.org (#2413 by @Electronic-Waste)
Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)

New Features

LLM Trainer V2

KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
KEP-2401: Create torchtune trainer image (#2516 by @Electronic-Waste)
KEP-2401: Refactor current train() API (#2513 by @Electronic-Waste)
KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)

Runtime Framework

feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

[feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
Implement MPI plugin UTs (#2481 by @tenzen-y)
Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)

JobSet

Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
KEP-2170: Deploy JobSet in kubeflow-system namespace (#2388 by @andreyvelich)
Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)

New Examples

Add question-answer example for v2 trainer (#2580 by @solanyn)
KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)

SDK Updates

Remove SDK (#2657 by @eoinfennessy)
feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)

Bug Fixes

Revert "fix(sdk): Fix type annotation for train method's trainer parameter" (#2651 by @Electronic-Waste)
fix(sdk): Fix bad arg passed to get_args_using_torchtune_config (#2647 by @eoinfennessy)
fix(sdk): Fix type annotation for train method's trainer parameter (#2646 by @eoinfennessy)
fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
Fix MPI Test runnable errors (#2570 by @tenzen-y)
Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
fix(ci): update test-go coverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo)
fix(doc): Update train() API in KEP-2401 (#2536 by @Electronic-Waste)
fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
[hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
[hotfix] fix docker cred (#2530 by @mahdikhashan)
fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
fix type in model initializer entrypoint (#2489 by @szaher)
fix(runtime): fix error label name. (#2487 by @Electronic-Waste)
fix(sdk): resolve errors in deserialization (#2457 by @Electronic-Waste)
Fix missing external types in apply configurations (#2429 by @astefanutti)
Fix API Group for Torch Runtime (#2424 by @andreyvelich)
Fix Kustomize patchesStrategicMerge deprecation warning (#2405 by @astefanutti)
ControlPlane: Fix flaky integraion testings due to missing the latest version of object (#2414 by @tenzen-y)

Misc

Tag Docker images with GitHub release tags (#2662 by @kramaranya)
feat(controller): Implement PodSpecOverride API (#2614 by @andreyvelich)
Nominate @Electronic-Waste as approver and @astefanutti as reviewer (#2659 by @andreyvelich)
chore(build): Support Podman to run OpenAPI generator (#2656 by @astefanutti)
chore(docs): Add OpenSSF Best Practices Badge (#2611 by @andreyvelich)
[chore] update stale action version to latest (#2642 by @mahdikhashan)
Remove TrainJobCreated condition (#2621 by @astefanutti)
ci: refactor build-push-images workflow (#2607 by @milinddethe15)
Update Go to v1.24 (#2615) (#2620 by @vzamboulingame)
test(runtime): add UT for IndexTrainJobTrainingRuntime (#2603 by @Harshal292004)
ci: add k8s v1.32 for tests env (#2613 by @milinddethe15)
chore(deps): bump torch from 2.5.0 to 2.6.0 in /cmd/runtimes/deepspeed (#2606 by @dependabot[bot])
chore(deps): bump golang.org/x/net from 0.36.0 to 0.38.0 (#2602 by @dependabot[bot])
test(runtime): add UT for jobset runtime valid function. (#2562 by @Harshal292004)
Add Helm chart for kubeflow trainer (#2435 by @ChenYi015)
chore(test): Removed the no longer needed github-trigger-rerun-test.yaml (#2589 by @hbelmiro)
Add PodNetwork plugin to KEP-2170 Job Pipeline Framework description (#2578 by @tenzen-y)
chore(docs): Update Slack channel (#2569 by @andreyvelich)
docs: update CONTRIBUTING.md for Kubeflow Trainer V2 (#2561 by @muzzlol)
test(runtime): add UT for torch runtime valid function. (#2560 by @IRONICBo)
feat(doc): add Runtime API design in KEP-2401. (#2501 by @Electronic-Waste)
Update CONTRIBUTING.md (#2512 by @MuhammedgitAli)
feat: add replicatedJobs.replicas validations in validateReplicatedJobs function. (#2533 by @IRONICBo)
Construct Trainer based on trainer.kubeflow.org/trainjob-ancestor-step label (#2548 by @tenzen-y)
chore: Enable GCI for golangci-lint (#2540 by @tenzen-y)
[feature] merge GHCR and DockerHub CI jobs (#2537 by @ashwinr64)
feat(controller): Refactor the Initializer APIs of TrainJob (#2523 by @andreyvelich)
Migrate InfoOptions.podSpecReplias and info.Scheduler.TotalRequests to info.TemplateSpec.PodSet (#2524 by @tenzen-y)
[feature] pull images in manifest from ghcr (#2529 by @mahdikhashan)
[feature] migrate images to ghcr (#2455 by @mahdikhashan)
KEP-2170: Adding validation webhook for v2 trainjob (#2307 by @akshaychitneni)
Migrate Info.Trainer to Info.TemplateSpec.PodSet (#2520 by @tenzen-y)
Implement E2E for OpenMPI workload (#2500 by @tenzen-y)
Bump golang.org/x/net from 0.33.0 to 0.36.0 (#2514 by @dependabot[bot])
Move TrainJob marker defaulting and validation integration tests to test/integration/webhooks pkg (#2486 by @tenzen-y)
feat(controller): Integrate DependsOn API (#2484 by @andreyvelich)
Store E2E manifests to artifacts directory (#2478 by @tenzen-y)
Use large runner for building container image (#2475 by @tenzen-y)
chore(test): Upload artifacts from dir (#2473 by @andreyvelich)
Implement UTs for PlainML plugin (#2469 by @tenzen-y)
chore(test): Add E2E tests for Kubeflow Trainer (#2470 by @andreyvelich)
KEP-2170: Add Kubeflow Trainer Pipeline Framework Design (#2439 by @tenzen-y)
Replace Kueue PodRequests helper with core k/k one (#2461 by @tenzen-y)
KEP-2170: Use SSA to reconcile TrainJob components (#2431 by @astefanutti)
Bump golang.org/x/net from 0.30.0 to 0.33.0 (#2451 by @dependabot[bot])
Use the correct apiVersion name (#2444 by @runzhen)
Add 'KEP Usage' KEP and template link (#2423 by @anishasthana)
KEP-2170: Add validation to Torch numProcPerNode field (#2409 by @astefanutti)
update migration url on readme file (#2436 by @varodrig)
IntegraionTests: Waiting for expected conditions before emulate JobSet controller manager (#2425 by @tenzen-y)
Nominate @Electronic-Waste as a reviewer (#2427 by @andreyvelich)
Update the naming conventions for Kubeflow Trainer (#2415 by @andreyvelich)
Rename paddlepaddle_defaults.go file name (#2399 by @ChristianZaccaria)
Bump golang.org/x/net from 0.30.0 to 0.33.0 (#2391 by @dependabot[bot])
KEP-2170: Add unit and Integration tests for model and dataset initializers (#2323 by @seanlaii)
Testing CI in JAX example (#2385 by @saileshd1402)
Upgrade huggingface_hub to v0.27.x in dataset initializer v2 (#2379 by @astefanutti)
Add Changelog for Training Operator v1.9.0-rc.0 (#2380 by @andreyvelich)
Add release branch to the image push trigger (#2376 by @andreyvelich)

Full Changelog

v1.9.0 (2025-01-21)

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)
JAX example for MNIST SPMD and add CI testing (#2390 by @saileshd1402)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Training V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @google-oss-robot)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @google-oss-robot)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken (#2356 by @Veer0x1)
Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
Upgrade Go version to v1.23 (#2302 by @tenzen-y)
Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
Added test for create-pytorchjob.ipynb python notebook (#2274 by @saileshd1402)
Remove zw0610 from approvers (#2343 by @zw0610)
Upgrade kustomization files to Kustomize v5 (#2326 by @oksanabaza)
Add openapi-generator CLI option to skip SDK v2 test generation (#2338 by @astefanutti)
Refine the server-side apply installation args (#2337 by @tenzen-y)
Ignore cache exporting errors in the image building workflows (#2336 by @tenzen-y)
Pin Gloo repository in JAX Dockerfile to a specific commit (#2329 by @sandipanpanda)
Update tf job examples to tf v2 (#2270 by @YosiElias)
Remove Prometheus Monitoring doc (#2301 by @sophie0730)
Upgrade Deepspeed demo dependencies (#2294 by @Syulin7)
[SDK] test: add unit test for list_jobs method of the training_client (#2267 by @seanlaii)
[SDK] Training Client Conditions related unit tests (#2253 by @Bobbins228)
[SDK] test: add unit test for get_job_logs method of the training_client (#2275 by @seanlaii)
[SDK] test: add unit test for get_job method of the training_client (#2205 by @Bobbins228)
[SDK] test: add unit tests for delete_job() method (#2232 by @Bobbins228)
[SDK] Add UTs for wait_for_job_conditions (#2196 by @Electronic-Waste)
[SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job (#2192 by @YosiElias)
[SDK] Add more unit tests for TrainingClient APIs - get_job_pods (#2175 by @YosiElias)
Update JAX image to use image published by Kubeflow (#2264 by @sandipanpanda)
Update README and out-of-date docs (#2252 by @andreyvelich)
Clean up Go modules (#2238 by @tenzen-y)
Change isort profile to black for full compatibility (#2234 by @Ygnas)
Enhance pre-commit hooks with flake8 linting (#2195 by @Ygnas)
Implement pre-commit hooks (#2184 by @droctothorpe)
Add command to re-run GitHub Actions tests (#2167 by @andreyvelich)
Update JAX integration proposal (#2165 by @sandipanpanda)
Update release document (#2153 by @andreyvelich)
update volcano to v1.9.0 (#2148 by @lowang-bh)
Update Slack Invitation (#2142 by @andreyvelich)
Refine the integration tests for the immutable PyTorchJob queueName (#2130 by @tenzen-y)
Add GitHub Issue Template (#2129 by @andreyvelich)
Update the images to the latest tag in master branch (#2128 by @johnugeorge)
Updated Github Action Workflows as per issue #2117 (#2123 by @hkiiita)
changed package name to flake8 to fix pytests pip install (#2109 by @ChristopheBrown)
chore(fix): isort xgboost (#2098 by @harshithbelagur)
Fix isort on examples/pytorch (#2094 by @marcmaliar)

Full Changelog

v1.9.0-rc.0 (2025-01-07)

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Training V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @google-oss-robot)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @google-oss-robot)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken (#2356 by @Veer0x1)
Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
Upgrade Go version to v1.23 (#2302 by @tenzen-y)
Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
Added test for create-pytorchjob.ipynb python notebook (#2274 by @saileshd1402)
Remove zw0610 from approvers (#2343 by @zw0610)
Upgrade kustomization files to Kustomize v5 (#2326 by @oksanabaza)
Add openapi-generator CLI option to skip SDK v2 test generation (#2338 by @astefanutti)
Refine the server-side apply installation args (#2337 by @tenzen-y)
Ignore cache exporting errors in the image building workflows (#2336 by @tenzen-y)
Pin Gloo repository in JAX Dockerfile to a specific commit (#2329 by @sandipanpanda)
Update tf job examples to tf v2 (#2270 by @YosiElias)
Remove Prometheus Monitoring doc (#2301 by @sophie0730)
Upgrade Deepspeed demo dependencies (#2294 by @Syulin7)
[SDK] test: add unit test for list_jobs method of the training_client (#2267 by @seanlaii)
[SDK] Training Client Conditions related unit tests (#2253 by @Bobbins228)
[SDK] test: add unit test for get_job_logs method of the training_client (#2275 by @seanlaii)
[SDK] test: add unit test for get_job method of the training_client (#2205 by @Bobbins228)
[SDK] test: add unit tests for delete_job() method (#2232 by @Bobbins228)
[SDK] Add UTs for wait_for_job_conditions (#2196 by @Electronic-Waste)
[SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job (#2192 by @YosiElias)
[SDK] Add more unit tests for TrainingClient APIs - get_job_pods (#2175 by @YosiElias)
Update JAX image to use image published by Kubeflow (#2264 by @sandipanpanda)
Update README and out-of-date docs (#2252 by @andreyvelich)
Clean up Go modules (#2238 by @tenzen-y)
Change isort profile to black for full compatibility (#2234 by @Ygnas)
Enhance pre-commit hooks with flake8 linting (#2195 by @Ygnas)
Implement pre-commit hooks (#2184 by @droctothorpe)
Add command to re-run GitHub Actions tests (#2167 by @andreyvelich)
Update JAX integration proposal (#2165 by @sandipanpanda)
Update release document (#2153 by @andreyvelich)
update volcano to v1.9.0 (#2148 by @lowang-bh)
Update Slack Invitation (#2142 by @andreyvelich)
Refine the integration tests for the immutable PyTorchJob queueName (#2130 by @tenzen-y)
Add GitHub Issue Template (#2129 by @andreyvelich)
Update the images to the latest tag in master branch (#2128 by @johnugeorge)
Updated Github Action Workflows as per issue #2117 (#2123 by @hkiiita)
changed package name to flake8 to fix pytests pip install (#2109 by @ChristopheBrown)
chore(fix): isort xgboost (#2098 by @harshithbelagur)
Fix isort on examples/pytorch (#2094 by @marcmaliar)

Full Changelog

v1.8.1 (2024-09-10)

Bug Fixes

[Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)

Full Changelog

v1.8.0 (2024-07-15)

Breaking Changes

[SDK] Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Rename monitoring-port flag to webook-server-port (#1925 by @afritzler)

New Features

LLM Fine-Tuning API

Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
[SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
Train api dataset download changes (#1959 by @deepanker13)
Train api init container creation (#1958 by @deepanker13)
[SDK] Add docstring for Train API (#2075 by @andreyvelich)

Control Plane Updates

Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
Implement webhook validation for the TFJob (#2051 by @tenzen-y)
Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
Upgrade Go version to v1.22 (#2046 by @tenzen-y)

SDK Improvements

[SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
[SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
[SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
Training operator SDK unit test (#1938 by @deepanker13)
[SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)

Bug Fixes

[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
[SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
[SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
Fix volcano podgroup update issue (#2079 by @ckyuto)
Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
Updated examples for train API (#2077 by @shruti2522)
Fail job for non-retryable exit codes (#2071 by @kellyaa)
E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
fix wrong filepath in the simple example command (#2062 by @qzoscar)
fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
Fix URL in python SDK setup.py (#2011 by @garymm)
Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
train api jupyternotebook fix (#1984 by @deepanker13)
fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
[fix] replace ${go env GOPATH} with $(go env GOPATH) to get the prope… (#1952 by @double12gzh)
Fixing issues with providing existing service account (#1918 by @rpemsel)

Misc

Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)
Update training operator image to latest (#2089 by @johnugeorge)
Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
docs: updating docs for local development (#2074 by @franciscojavierarceo)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
[docs] development guide update (#1995 by @shashank-iitbhu)
Add Kubeflow Website links to README (#1983 by @andreyvelich)
publish trainer hugging face image (#1985 by @deepanker13)
Adding Training image needed for train api (#1963 by @deepanker13)
Add test to create PyTorchJob from func (#1979 by @andreyvelich)
Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
utils changes needed to add train api (#1954 by @deepanker13)
Adding parallel support for coveralls (#1956 by @johnugeorge)
chore: pkg import only once (#1950 by @testwill)
fix nproc env in elastic mode for pytorchjob (#1948 by @kuizhiqing)
Avoid modifying log level globally (#1944 by @droctothorpe)
Add @andreyvelich to Approvers (#1941 by @andreyvelich)
Merge v1.7 branch changes to Main (#1940 by @johnugeorge)
Increase the root volume size on the github runner when building container images (#1931 by @tenzen-y)
Check podGroup CRD for the volcano and the scheudler-plugins as default. (#1929 by @Syulin7)
Use a community hosted image in MXJob E2E (#1928 by @tenzen-y)
Build MXJob examples in CI (#1927 by @tenzen-y)
Bump k8s.io/* deps to 1.28 (#1920 by @afritzler)
Replace XGBoost image for E2E with community hosted (#1922 by @tenzen-y)
Creating service account where approriate for MPI Job (#1917 by @rpemsel)
Build XGBoostJob example images in CI (#1913 by @tenzen-y)
Manage kube-delivery image from training-operator and update it (#1909 by @rpemsel)
Adding Yuki to Approvers (#1901 by @johnugeorge)
docs: Remove reference to tf-operator specific design doc (#1903 by @terrytangyuan)
Add Training WG Community Call (#1900 by @andreyvelich)
update full change list in changelog (#1895 by @lowang-bh)
update volcano scheduler to 1.8.0 (#1894 by @lowang-bh)
Changelog updated for 1.7.0 rc0 release (#1892 by @johnugeorge)
Add Stale GitHub Action (#1893 by @andreyvelich)
Refactor core/pod tests (#1890 by @tenzen-y)
Remove klog v1 (#1886 by @tenzen-y)

Full Changelog

v1.8.0-rc.1 (2024-06-25)

Breaking Changes

[SDK] Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

Bug Fixes

[SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
[SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
[SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
Fix volcano podgroup update issue (#2079 by @ckyuto)

Misc

Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)

Full Changelog

v1.8.0-rc.0 (2024-04-28)

Breaking Changes

Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Rename monitoring-port flag to webook-server-port (#1925 by @afritzler)

New Features

LLM Fine-Tuning API

Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
[SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
Train api dataset download changes (#1959 by @deepanker13)
Train api init container creation (#1958 by @deepanker13)
[SDK] Add docstring for Train API (#2075 by @andreyvelich)

Control Plane Updates

Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
Implement webhook validation for the TFJob (#2051 by @tenzen-y)
Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
Upgrade Go version to v1.22 (#2046 by @tenzen-y)

SDK Improvements

[SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
[SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
[SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
Training operator SDK unit test (#1938 by @deepanker13)
[SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)

Bug Fixes

Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
Updated examples for train API (#2077 by @shruti2522)
Fail job for non-retryable exit codes (#2071 by @kellyaa)
E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
fix wrong filepath in the simple example command (#2062 by @qzoscar)
fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
Fix URL in python SDK setup.py (#2011 by @garymm)
Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
train api jupyternotebook fix (#1984 by @deepanker13)
fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
[fix] replace ${go env GOPATH} with $(go env GOPATH) to get the prope… (#1952 by @double12gzh)
Fixing issues with providing existing service account (#1918 by @rpemsel)

Misc

Update training operator image to latest (#2089 by @johnugeorge)
Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
docs: updating docs for local development (#2074 by @franciscojavierarceo)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
[docs] development guide update (#1995 by @shashank-iitbhu)
Add Kubeflow Website links to README (#1983 by @andreyvelich)
publish trainer hugging face image (#1985 by @deepanker13)
Adding Training image needed for train api (#1963 by @deepanker13)
Add test to create PyTorchJob from func (#1979 by @andreyvelich)
Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
utils changes needed to add train api (#1954 by @deepanker13)
Adding parallel support for coveralls (#1956 by @johnugeorge)
chore: pkg import only once (#1950 by @testwill)
fix nproc env in elastic mode for pytorchjob (#1948 by @kuizhiqing)
Avoid modifying log level globally (#1944 by @droctothorpe)
Add @andreyvelich to Approvers (#1941 by @andreyvelich)
Merge v1.7 branch changes to Main (#1940 by @johnugeorge)
Increase the root volume size on the github runner when building container images (#1931 by @tenzen-y)
Check podGroup CRD for the volcano and the scheudler-plugins as default. (#1929 by @Syulin7)
Use a community hosted image in MXJob E2E (#1928 by @tenzen-y)
Build MXJob examples in CI (#1927 by @tenzen-y)
Bump k8s.io/* deps to 1.28 (#1920 by @afritzler)
Replace XGBoost image for E2E with community hosted (#1922 by @tenzen-y)
Creating service account where approriate for MPI Job (#1917 by @rpemsel)
Build XGBoostJob example images in CI (#1913 by @tenzen-y)
Manage kube-delivery image from training-operator and update it (#1909 by @rpemsel)
Adding Yuki to Approvers (#1901 by @johnugeorge)
docs: Remove reference to tf-operator specific design doc (#1903 by @terrytangyuan)
Add Training WG Community Call (#1900 by @andreyvelich)
update full change list in changelog (#1895 by @lowang-bh)
update volcano scheduler to 1.8.0 (#1894 by @lowang-bh)
Changelog updated for 1.7.0 rc0 release (#1892 by @johnugeorge)
Add Stale GitHub Action (#1893 by @andreyvelich)
Refactor core/pod tests (#1890 by @tenzen-y)
Remove klog v1 (#1886 by @tenzen-y)

Full Changelog

v1.7.0-rc.0 (2023-07-07)

Full Changelog

Breaking Changes

Upgrade Scheduler Plugins version to v0.25.7 https://github.com/kubeflow/training-operator/pull/1824 (tenzen-y)
Upgrade the kubernetes dependencies to v1.27 https://github.com/kubeflow/training-operator/pull/1834 (tenzen-y)

New Features

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Merge kubeflow/common to training-operator #1813 (johnugeorge)
Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
Implement suspend semantics #1859 (tenzen-y)
Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)

Bug Fixes

Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

Removing reconciler code #1879 (johnugeorge)
Make Condition and ReplicaStatus optional #1862 (tenzen-y)
Use the same reasons for Condition and Event #1854 (tenzen-y)
Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
Clean up /pkg/common/util/v1 #1845 (tenzen-y)
Refactoring tests in common/controller.v1 #1843 (tenzen-y)
remove duplicate code of add task spec annotation #1839 (lowang-bh)
fetch volcano log when e2e failed #1837 (lowang-bh)
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
Replace dummy client with fake client #1818 (tenzen-y)
Add default Intel MPI env variables to MPIJob #1804 (tkatila)
Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
make timeout configurable from e2e tests #1787 (nagar-ajay)

v1.6.0 (2023-03-21)

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1769

Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702

Full Changelog

New Features

Support for k8s v1.25 in CI #1684 (johnugeorge)
HPA support for PyTorch Elastic #1701 (johnugeorge)
Adopting coschduling plugin #1724 (tenzen-y)
Support for Paddlepaddle #1675 (kuizhiqing)
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
[SDK] Use Training Client without Kube Config #1740 (andreyvelich)
[SDK] Create Unify Training Client #1719 (andreyvelich)

Bug Fixes

[SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
Fix XGBoost conditions bug #1737 (tenzen-y)
To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)
fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
Fix status lost #1697 (ggaaooppeenngg)
handle all restart policies #1649 (abin-thomas-by)
[chore] fix typo #1648 (tenzen-y)

Misc

Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
Configure controller worker threads #1707 (HeGaoYuan)
Validation Spec consistency #1705 (HeGaoYuan)
[SDK] Remove Final Keyword from constants #1676 (andreyvelich)
Fix Python installation in CI #1759 (tenzen-y)
Update mpijob_controller.go #1755 (yshalabi)
Set the default value of CleanPodPolicy to None #1754 (Syulin7)
Update join Slack link #1750 (Syulin7)
Update latest operator image #1742 (johnugeorge)
Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
Add Yuki to reviewer group #1739 (johnugeorge)
Trim down CRD descriptions #1735 (tenzen-y)
Add CI to build example images #1731 (tenzen-y)
Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
Fix indents on examples for tensorflow #1726 (tenzen-y)
docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
Removing deprecated Job Labels #1702 (johnugeorge)
Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
Add myself to reviewer. #1689 (kuizhiqing)
Upgrade the envtest version #1687 (tenzen-y)
[chore] Upgrade some actions version #1686 (tenzen-y)
Upgrade Golangci-lint #1685 (johnugeorge)
Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
Update deployment.yaml #1668 (OmriShiv)
Upgrade Go version to v1.19 #1663 (tenzen-y)
Upgrade kubernetes versoin for test #1667 (tenzen-y)
Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
style: Refine name and signature of 2 replicaName functions #1660 (houz42)
Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
Add finalizers to cluster-role #1646 (ArangoGutierrez)
Update the cmd to support MPI operator in ReadME #1656 (denkensk)

Closed issues

The default value for CleanPodPolicy is inconsistent. #1753
HPA support for PyTorch Elastic #1751
Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
*job API(master) cannot compatible with old job #1725
Support coscheduling plugin #1722
Number of worker threads used by the controller can't be configured #1706
Conformance: Training tests #1698
PyTorch and MPI Operator pulls hardcoded initContainer #1696
PaddlePaddle Training: why can't find pods #1694
Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
[SDK] Create unify client for all Training Job types #1691
Support Kubernetes v1.25 #1682
panic happened when add podgroup watch #1679
OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
Change Kubernetes version for test #1665
Support for multiplatform container imege (amd64 and arm64) #1664
Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
After setting hostNetwork to true, mpi does not work #1657
What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
When will MPIJob support v2beta1 version? #1653
Kubernetes HPA doesn't work with elastic PytorchJob #1645
training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
Training operator fails to create HPA for TorchElastic jobs #1626
Release v1.5.0 tracking #1622
upgrade client-go #1599
trainning-operator may need to monitor PodGroup #1574
Error: invalid memory address or nil pointer dereference #1553
The pytorchJob training is slow #1532
pytorch elastic scheduler error #1504

v1.4.0-rc.0 (2022-01-26)

Full Changelog

Features and Improvements

Display coverage % in GitHub actions list #1442
Add Go test to CI #1436

Fixed Bugs

[bug] Missing init container in PyTorchJob #1482
Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381

Closed Issues

Restore KUBEFLOW_NAMESPACE options #1522
Improve test coverage #1497
swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
PytorchJob DDP training will stop if I delete a worker pod #1478
Write down e2e failure debug process #1467
How can i add the Priorityclass to the TFjob？ #1466
github.com/go-logr/zapr.(*zapLogger).Error #1444
Podgroup is constantly created and deleted after tfjob is success or failure #1426
Cut official release of 1.3.0 #1425
Add "not maintained" notice to other operator repos #1423
Python SDK for Kubeflow Training Operator #1380

Merged Pull Requests

Update manifests with latest image tag #1527 (johnugeorge)
add option for mpi kubectl delivery #1525 (zw0610)
restore option namespace in launch arguments #1524 (zw0610)
remove unused scripts #1521 (zw0610)
remove ChanYiLin from approvers #1513 (ChanYiLin)
add StacktraceLevel for zapr #1512 (qiankunli)
add unit tests for tensorflow controller #1511 (zw0610)
add the example of MPIJob #1508 (hackerboy01)
Added 2022 roadmap and migrated previous roadmap from kubeflow/common #1500 (terrytangyuan)
Fix a typo in mpi controller log #1495 (LuBingtan)
feat(pytorch): Add init container config to avoid DNS lookup failure #1493 (gaocegege)
chore: Fix GitHub Actions script #1491 (tenzen-y)
chore: Fix missspell in tfjob #1490 (tenzen-y)
chore: Update OWNERS #1489 (gaocegege)
Bump jinja2 from 2.10.1 to 2.11.3 in /py/kubeflow/tf_operator #1487 (dependabot[bot])
fix comments for mpi-controller #1485 (hackerboy01)
add expectation-related functions for other resources used in mpi-controller #1484 (zw0610)
Add MPI job to README now that it's supported #1480 (terrytangyuan)
add mpi doc #1477 (zw0610)
Set Go version of base image to 1.17 #1476 (tenzen-y)
update label for tf-controller #1474 (zw0610)
Add Akuity to the list of adopters #1473 (terrytangyuan)
Add PR template with doc checklist #1470 (andreyvelich)
Add e2e failure debugging guidance #1469 (Jeffwan)
chore: Add .gitattributes to ignore Jsonnet test code for linguist #1463 (terrytangyuan)
Migrate additional examples from xgboost-operator #1461 (terrytangyuan)
Minor edits to README.md #1460 (terrytangyuan)
add mpi-operator(v1) to the unified operator #1457 (hackerboy01)
fix tfjob status when enableDynamicWorker set true #1455 (zw0610)
feat(pytorch): Support elastic training #1453 (gaocegege)
fix: generate printer columns for job crds #1451 (henrysecond1)
Fix README typo #1450 (davidxia)
consistent naming for better readability #1449 (pramodrj07)
Fix set scheduler error #1448 (qiankunli)
Add CI to run the tests for Go #1440 (tenzen-y)
fix: Add missing retrying package that failed the import #1439 (terrytangyuan)
Generate a single swagger.json file for all frameworks #1437 (alembiewski)
Update links and files with the new URL #1434 (andreyvelich)
chore: update CHANGELOG.md #1432 (Jeffwan)
Add acknowledgement section in README to credit all contributors #1422 (terrytangyuan)
Add Cisco to Adopters List #1421 (andreyvelich)
Add Python SDK for Kubeflow Training Operator #1420 (alembiewski)
docs: Move myself to approvers #1419 (terrytangyuan)
fix hyperlinks in the 'overview' section #1418 (pramodrj07)
docs: Migrate adopters of all operators to this repo #1417 (terrytangyuan)
Feature/support pytorchjob set queue of volcano #1415 (qiankunli)
Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta #1409 (Jeffwan)
Update scripts to generate sdk for all frameworks #1389 (Jeffwan)