[SDK] Add resources per worker for Create Job API by andreyvelich · Pull Request #1990 · kubeflow/trainer

andreyvelich · 2024-01-16T13:35:09Z

Blocked by: #1988.
/hold

I added resources_per_worker parameter to create_job API.
Also, this has some refactoring for our SDK utils functions:

I removed validation from train API for resource per worker. Let's add the validation in the future if that is required. We might have users who want to do fine-tuning with train API on CPUs.
We have 3 new functions in utils. get_pod_template_spec to return Pod template spec, get_container_spec to return Container Spec, get_command_using_train_func to return args and command for train function.
I made a few changes to reduce number of typing errors in Pylance.

Please take a look.
/assign @deepanker13 @johnugeorge @tenzen-y @droctothorpe @kuizhiqing

google-oss-prow · 2024-01-16T13:35:13Z

@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

Blocked by: #1988.
/hold

I added resources_per_worker parameter to create_job API.
Also, this has some refactoring for our SDK utils functions:

I removed validation from train API for resource per worker. Let's add the validation in the future if that is required. We might have users who want to do fine-tuning with train API on CPUs.

We have 3 new functions in utils. get_pod_template_spec to return Pod template spec, get_container_spec to return Container Spec, get_command_using_train_func to return args and command for train function.

I made a few changes to reduce number of typing errors in Pylance.

Please take a look.
/assign @deepanker13 @johnugeorge @tenzen-y @droctothorpe @kuizhiqing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow · 2024-01-16T13:35:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~sdk/python/OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2024-01-16T20:14:33Z

Pull Request Test Coverage Report for Build 7571809122

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall first build on sdk-resource-per-worker at 42.873%

Totals
Change from base Build 7546373933:	42.9%
Covered Lines:	3754
Relevant Lines:	8756

💛 - Coveralls

andreyvelich · 2024-01-16T20:30:37Z

/hold cancel
This PR is ready.

johnugeorge · 2024-01-17T08:48:55Z

sdk/python/kubeflow/training/constants/constants.py


 # PyTorchJob constants
 PYTORCHJOB_KIND = "PyTorchJob"
+PYTORCHJOB_MODEL = "KubeflowOrgV1PyTorchJob"


Any reason to override?

@johnugeorge What do you mean by override here ?
I just make string type rather than object to reduce number of typing errors in Pylance.

johnugeorge · 2024-01-17T08:51:36Z

sdk/python/test/e2e/utils.py

    conditions = client.get_job_conditions(job=job)
-    if len(conditions) != 3:
+    # If Job is complete fast, it has 2 conditions: Created and Succeeded.
+    if len(conditions) != 3 and len(conditions) != 2:


This looks bit odd. Can we clean up this check?

@johnugeorge Do you want to remove this check ?
I noticed that PyTorchJob has just 2 conditions (e.g. Created and Succeeded), if you run it with image that executes very fast. E.g. docker.io/hello-world.

Isn't that a bug which needs to be resolved separately?

Probably. @tenzen-y @kuizhiqing @tenzen-y What are your thoughts here ?
The main problem is that when reconciliation loop starts, the training Pod is already Succeeded, so we didn't add Running status for PyTorchJob.

deepanker13 · 2024-01-17T09:17:20Z

sdk/python/kubeflow/training/utils/utils.py

+    storage_config: Dict[str, Optional[str]],
 ):
-    if pvc_name is None or namespace is None or storage_size is None:
+    if pvc_name is None or namespace is None or "size" not in storage_config is None:


Last condition needs correction

Good catch!

deepanker13 · 2024-01-17T10:04:50Z

sdk/python/kubeflow/training/api/training_client.py


        namespace = namespace or self.namespace

-        if isinstance(resources_per_worker, dict):


@andreyvelich how are these validations stopping the user from running the training on cpus?

I was wrong. I isn't stopping user of running train API on CPUs, but we validate if cpu and memory is set in the resources_per_worker parameter. Which might be not required.
E.g. user can only specify number of GPUs in resources_per_worker.

@andreyvelich then shall we change the default value of resources_per_worker as it is None currently, what if the user passes an empty dict

@deepanker13 If will be fine if user passes the empty dict, since Kubernetes will assign the resources automatically.
E.g. this works for me:

TrainingClient().create_job( resources_per_worker={}, name="test-empty", num_workers=1, base_image="docker.io/hello-world", )

As I said, if we understand that we need additional validation in the future, we can always do it in a separate PRs.

johnugeorge · 2024-01-18T14:41:31Z

@deepanker13 Can you complete review?

andreyvelich · 2024-01-18T15:32:09Z

@johnugeorge I've made changes for the condition check in test that we discussed.

deepanker13 · 2024-01-18T15:42:11Z

sdk/python/kubeflow/training/utils/utils.py

        spec=models.KubeflowOrgV1PyTorchJobSpec(
            run_policy=models.KubeflowOrgV1RunPolicy(clean_pod_policy=None),
            pytorch_replica_specs={},
+            elastic_policy=elastic_policy,


should we check if elastic policy is not an empty dict, else default env variables will get appended
https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/pkg/controller.v1/pytorch/envvar.go#L115

Actually, if elastic_policy is None, Python doesn't assign value to the PyTorchJob spec, and we don't set default values.
If user accidentally set elastic_policy={}, our controller will fail with invalid spec error:

E0118 15:47:43.737955 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 398 [running]:

As you can see, elastic_policy has this type: KubeflowOrgV1ElasticPolicy, so user should set the appropriate instance value similar to other parameters (e.g. worker_pod_template_spec).
Right now, we don't even use elastic_policy in our public APIs: https://github.com/kubeflow/training-operator/blob/174b050cce88342b29bf3e098a67e5afa9d3fb9a/sdk/python/kubeflow/training/api/training_client.py#L237.

deepanker13 · 2024-01-18T16:05:18Z

/lgtm
Thanks @andreyvelich !

google-oss-prow · 2024-01-18T16:05:22Z

@deepanker13: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm
Thanks @andreyvelich !

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

johnugeorge · 2024-01-18T16:38:34Z

/lgtm

* [SDK] Add resources for create Job API * Fix unbound var * Assign values in get pod template * Add torchrun issue * Test to create PyTorchJob from Image * Fix e2e to create from image * Fix condition * Modify check test conditions

google-oss-prow bot added the do-not-merge/hold label Jan 16, 2024

google-oss-prow bot assigned johnugeorge and kuizhiqing Jan 16, 2024

google-oss-prow bot assigned tenzen-y Jan 16, 2024

google-oss-prow bot requested review from jinchihe and kuizhiqing January 16, 2024 13:35

google-oss-prow bot added size/L approved labels Jan 16, 2024

andreyvelich added 3 commits January 16, 2024 20:10

[SDK] Add resources for create Job API

d509aa9

Fix unbound var

fbb23fb

Assign values in get pod template

64039fc

andreyvelich force-pushed the sdk-resource-per-worker branch from aef4735 to 64039fc Compare January 16, 2024 20:11

Add torchrun issue

5ffb32a

google-oss-prow bot removed the do-not-merge/hold label Jan 16, 2024

andreyvelich added 2 commits January 16, 2024 20:46

Test to create PyTorchJob from Image

5031fee

Fix e2e to create from image

f8dfdc1

johnugeorge reviewed Jan 17, 2024

View reviewed changes

deepanker13 reviewed Jan 17, 2024

View reviewed changes

Fix condition

3dd7ab7

Modify check test conditions

174b050

deepanker13 reviewed Jan 18, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Jan 18, 2024

google-oss-prow bot merged commit 07d1a61 into kubeflow:master Jan 18, 2024

andreyvelich deleted the sdk-resource-per-worker branch January 18, 2024 16:39

andreyvelich mentioned this pull request Jan 24, 2024

[Release] Training Operator 1.8 Roadmap #1994

Closed

11 tasks

andreyvelich added this to the v0.8.0 Release milestone Jan 24, 2024

andreyvelich added the release/1.8 label Jan 24, 2024


		namespace = namespace or self.namespace

		if isinstance(resources_per_worker, dict):

Conversation

andreyvelich commented Jan 16, 2024

Uh oh!

google-oss-prow bot commented Jan 16, 2024

Uh oh!

google-oss-prow bot commented Jan 16, 2024

Uh oh!

coveralls commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 7571809122

💛 - Coveralls

Uh oh!

andreyvelich commented Jan 16, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnugeorge Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnugeorge commented Jan 18, 2024

Uh oh!

andreyvelich commented Jan 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deepanker13 commented Jan 18, 2024

Uh oh!

google-oss-prow bot commented Jan 18, 2024

Uh oh!

johnugeorge commented Jan 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

coveralls commented Jan 16, 2024 •

edited

Loading

johnugeorge Jan 17, 2024 •

edited

Loading

andreyvelich Jan 18, 2024 •

edited

Loading

andreyvelich Jan 18, 2024 •

edited

Loading

andreyvelich Jan 18, 2024 •

edited

Loading