Skip to content

Latest commit

 

History

History
1606 lines (1317 loc) · 186 KB

File metadata and controls

1606 lines (1317 loc) · 186 KB

Changelog

v2.1.0 (2025-11-07)

This is Kubeflow Trainer v2.1.0 release.

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0"
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0"

You can now install controller manager with Helm charts 🚀

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0

For more information, please see the Kubeflow Trainer docs

Breaking Changes

  • feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
  • feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
  • chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
  • chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
  • Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

LLM Post-Training

Kueue Enhancements

Volcano Scheduler

  • feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
  • feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

  • feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
  • feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
  • feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
  • feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
  • feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

Misc

Full Changelog

v2.1.0-rc.1 (2025-11-03)

New Features

  • feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
  • [release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
  • [release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)

Bug Fixes

  • [release-2.1] fix(manifests): Fix boolean values defaulting in Helm charts (#2914 by @astefanutti)
  • [release-2.1] fix(runtimes): Update pip version in the MLX runtime (#2910 by @andreyvelich)

Full Changelog

v2.1.0-rc.0 (2025-10-21)

Breaking Changes

  • feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
  • feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
  • chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
  • chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
  • Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

LLM Post-Training

Kueue Enhancements

Volcano Scheduler

  • feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
  • feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

  • feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
  • feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
  • feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
  • feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
  • feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

Misc

Full Changelog

v2.0.1 (2025-09-29)

New Features

  • [release-2.0] feat: Add a public function to create runtime info objects (#2846 by @kaisoz)

Bug Fixes

  • [release-2.0] fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2863 by @andreyvelich)
  • [release-2.0] fix(ci): Add latest image tag only for the master branch (#2862 by @andreyvelich)
  • [release-2.0] fix: update examples to reflect func_args now being unpacked (#2815) (#2853 by @astefanutti)
  • [release-2.0] fix(examples): Update get_job_logs() API in examples (#2813) (#2852 by @astefanutti)
  • [release-2.0] feat(runtimes): Add Framework Label to the Runtimes (#2761) (#2851 by @astefanutti)
  • [release-2.0] fix(examples): Update the argument for Runtime framework (#2766) (#2850 by @astefanutti)
  • [release-2.0] fix: update kubeflow sdk reference (#2780) (#2847 by @astefanutti)
  • [release-2.0] fix(api): Fix license path for Kubeflow Trainer Python API (#2772 by @andreyvelich)

Full Changelog

v2.0.0 (2025-07-17)

This is the major release of the Kubeflow Trainer 2.0 project.

For more information, please see the

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

  • feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
  • feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
  • feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
  • Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
  • Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
  • KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Misc

Full Changelog

v2.0.0-rc.1 (2025-07-03)

New Features

  • [release-2.0] feat: Add schedulingGates to PodSpecOverrides (#2705 by @astefanutti)
  • [release-2.0] feat: Mutable PodSpecOverrides for suspended TrainJob (#2698 by @astefanutti)
  • [Release 2.0] KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2692 by @Doris-xm)

Bug Fixes

  • [release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
  • [cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
  • [release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
  • [release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)

Misc

  • [release-2.0] chore: Copy generated CRDs into Helm charts (#2704 by @astefanutti)
  • [cherry-pick] feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670) (#2702 by @Electronic-Waste)
  • [release-2.0] chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2697 by @tenzen-y)
  • [release-2.0] chore: Remove the vendor specific parameters (#2694 by @tenzen-y)
  • [release-2.0] chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2687 by @andreyvelich)
  • [release-2.0] chore(helm): Sync ClusterRule in Helm chart (#2688 by @astefanutti)

Full Changelog

v2.0.0-rc.0 (2025-06-10)

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

  • feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
  • feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
  • feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
  • Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
  • Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
  • KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Misc

Full Changelog

v1.9.0 (2025-01-21)

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Training V2

Bug Fixes

Misc

Full Changelog

v1.9.0-rc.0 (2025-01-07)

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Training V2

Bug Fixes

Misc

Full Changelog

v1.8.1 (2024-09-10)

Bug Fixes

  • [Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
  • [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
  • Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)

Full Changelog

v1.8.0 (2024-07-15)

Breaking Changes

New Features

LLM Fine-Tuning API

Control Plane Updates

SDK Improvements

Bug Fixes

Misc

Full Changelog

v1.8.0-rc.1 (2024-06-25)

Breaking Changes

Bug Fixes

Misc

  • Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)

Full Changelog

v1.8.0-rc.0 (2024-04-28)

Breaking Changes

New Features

LLM Fine-Tuning API

Control Plane Updates

SDK Improvements

Bug Fixes

Misc

Full Changelog

v1.7.0-rc.0 (2023-07-07)

Full Changelog

Breaking Changes

New Features

Bug Fixes

  • Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
  • Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
  • Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

v1.6.0 (2023-03-21)

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1769

Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702

Full Changelog

New Features

Bug Fixes

Misc

Closed issues

  • The default value for CleanPodPolicy is inconsistent. #1753
  • HPA support for PyTorch Elastic #1751
  • Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
  • paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
  • *job API(master) cannot compatible with old job #1725
  • Support coscheduling plugin #1722
  • Number of worker threads used by the controller can't be configured #1706
  • Conformance: Training tests #1698
  • PyTorch and MPI Operator pulls hardcoded initContainer #1696
  • PaddlePaddle Training: why can't find pods #1694
  • Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
  • [SDK] Create unify client for all Training Job types #1691
  • Support Kubernetes v1.25 #1682
  • panic happened when add podgroup watch #1679
  • OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
  • There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
  • Change Kubernetes version for test #1665
  • Support for multiplatform container imege (amd64 and arm64) #1664
  • Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
  • After setting hostNetwork to true, mpi does not work #1657
  • What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
  • When will MPIJob support v2beta1 version? #1653
  • Kubernetes HPA doesn't work with elastic PytorchJob #1645
  • training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
  • Training operator fails to create HPA for TorchElastic jobs #1626
  • Release v1.5.0 tracking #1622
  • upgrade client-go #1599
  • trainning-operator may need to monitor PodGroup #1574
  • Error: invalid memory address or nil pointer dereference #1553
  • The pytorchJob training is slow #1532
  • pytorch elastic scheduler error #1504

v1.4.0-rc.0 (2022-01-26)

Full Changelog

Features and Improvements

  • Display coverage % in GitHub actions list #1442
  • Add Go test to CI #1436

Fixed Bugs

  • [bug] Missing init container in PyTorchJob #1482
  • Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381

Closed Issues

  • Restore KUBEFLOW_NAMESPACE options #1522
  • Improve test coverage #1497
  • swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
  • PytorchJob DDP training will stop if I delete a worker pod #1478
  • Write down e2e failure debug process #1467
  • How can i add the Priorityclass to the TFjob? #1466
  • github.com/go-logr/zapr.(*zapLogger).Error #1444
  • Podgroup is constantly created and deleted after tfjob is success or failure #1426
  • Cut official release of 1.3.0 #1425
  • Add "not maintained" notice to other operator repos #1423
  • Python SDK for Kubeflow Training Operator #1380

Merged Pull Requests

v1.3.0 (2021-10-03)

Full Changelog

Fixed Bugs

  • Unable to specify pod template metadata for TFJob #1403

v1.3.0-rc.2 (2021-09-21)

Full Changelog

Fixed Bugs

  • Missing Pod label for Service selector #1399

v1.3.0-rc.1 (2021-09-15)

Full Changelog

Fixed Bugs

  • [bug] Reconcilation fails when upgrading common to 0.3.6 #1394

Merged Pull Requests

v1.3.0-rc.0 (2021-08-31)

Full Changelog

Merged Pull Requests

v1.3.0-alpha.3 (2021-08-29)

Full Changelog

Closed Issues

  • Update guidance to install all-in-one operator in README.md #1386

Merged Pull Requests

v1.2.1 (2021-08-27)

Full Changelog

v1.3.0-alpha.2 (2021-08-15)

Full Changelog

v1.3.0-alpha.1 (2021-08-13)

Full Changelog

v1.2.0 (2021-08-03)

Full Changelog

v1.1.0 (2021-03-20)

Full Changelog

v1.0.1-rc.5 (2021-02-09)

Full Changelog

v1.0.1-rc.4 (2021-02-04)

Full Changelog

v1.0.1-rc.3 (2021-01-27)

Full Changelog

v1.0.1-rc.2 (2021-01-27)

Full Changelog

v1.0.1-rc.1 (2021-01-18)

Full Changelog

v1.0.1-rc.0 (2020-12-22)

Full Changelog

v1.0.0-rc.0 (2019-06-24)

Full Changelog

v0.5.3 (2019-06-03)

Full Changelog

v0.5.2 (2019-05-23)

Full Changelog

v0.5.1 (2019-05-15)

Full Changelog

v0.5.0 (2019-03-26)

Full Changelog

v0.4.0 (2019-02-13)

Full Changelog

v0.4.0-rc.1 (2018-11-28)

Full Changelog

v0.4.0-rc.0 (2018-11-19)

Full Changelog

v0.3.0 (2018-09-22)

Full Changelog

v0.2.0-rc1 (2018-06-21)

Full Changelog

v0.1.0 (2018-03-29)

Full Changelog