[SDK] Add resources per worker for Create Job API#1990
[SDK] Add resources per worker for Create Job API#1990google-oss-prow[bot] merged 8 commits intokubeflow:masterfrom
Conversation
|
@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
aef4735 to
64039fc
Compare
Pull Request Test Coverage Report for Build 7571809122
💛 - Coveralls |
|
/hold cancel |
|
|
||
| # PyTorchJob constants | ||
| PYTORCHJOB_KIND = "PyTorchJob" | ||
| PYTORCHJOB_MODEL = "KubeflowOrgV1PyTorchJob" |
There was a problem hiding this comment.
@johnugeorge What do you mean by override here ?
I just make string type rather than object to reduce number of typing errors in Pylance.
sdk/python/test/e2e/utils.py
Outdated
| conditions = client.get_job_conditions(job=job) | ||
| if len(conditions) != 3: | ||
| # If Job is complete fast, it has 2 conditions: Created and Succeeded. | ||
| if len(conditions) != 3 and len(conditions) != 2: |
There was a problem hiding this comment.
This looks bit odd. Can we clean up this check?
There was a problem hiding this comment.
@johnugeorge Do you want to remove this check ?
I noticed that PyTorchJob has just 2 conditions (e.g. Created and Succeeded), if you run it with image that executes very fast. E.g. docker.io/hello-world.
There was a problem hiding this comment.
Isn't that a bug which needs to be resolved separately?
There was a problem hiding this comment.
Probably. @tenzen-y @kuizhiqing @tenzen-y What are your thoughts here ?
The main problem is that when reconciliation loop starts, the training Pod is already Succeeded, so we didn't add Running status for PyTorchJob.
| storage_config: Dict[str, Optional[str]], | ||
| ): | ||
| if pvc_name is None or namespace is None or storage_size is None: | ||
| if pvc_name is None or namespace is None or "size" not in storage_config is None: |
There was a problem hiding this comment.
Last condition needs correction
|
|
||
| namespace = namespace or self.namespace | ||
|
|
||
| if isinstance(resources_per_worker, dict): |
There was a problem hiding this comment.
@andreyvelich how are these validations stopping the user from running the training on cpus?
There was a problem hiding this comment.
I was wrong. I isn't stopping user of running train API on CPUs, but we validate if cpu and memory is set in the resources_per_worker parameter. Which might be not required.
E.g. user can only specify number of GPUs in resources_per_worker.
There was a problem hiding this comment.
@andreyvelich then shall we change the default value of resources_per_worker as it is None currently, what if the user passes an empty dict
There was a problem hiding this comment.
@deepanker13 If will be fine if user passes the empty dict, since Kubernetes will assign the resources automatically.
E.g. this works for me:
TrainingClient().create_job(
resources_per_worker={},
name="test-empty",
num_workers=1,
base_image="docker.io/hello-world",
)
As I said, if we understand that we need additional validation in the future, we can always do it in a separate PRs.
|
@deepanker13 Can you complete review? |
|
@johnugeorge I've made changes for the condition check in test that we discussed. |
| spec=models.KubeflowOrgV1PyTorchJobSpec( | ||
| run_policy=models.KubeflowOrgV1RunPolicy(clean_pod_policy=None), | ||
| pytorch_replica_specs={}, | ||
| elastic_policy=elastic_policy, |
There was a problem hiding this comment.
should we check if elastic policy is not an empty dict, else default env variables will get appended
https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/pkg/controller.v1/pytorch/envvar.go#L115
There was a problem hiding this comment.
Actually, if elastic_policy is None, Python doesn't assign value to the PyTorchJob spec, and we don't set default values.
If user accidentally set elastic_policy={}, our controller will fail with invalid spec error:
E0118 15:47:43.737955 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 398 [running]:
There was a problem hiding this comment.
As you can see, elastic_policy has this type: KubeflowOrgV1ElasticPolicy, so user should set the appropriate instance value similar to other parameters (e.g. worker_pod_template_spec).
Right now, we don't even use elastic_policy in our public APIs: https://github.com/kubeflow/training-operator/blob/174b050cce88342b29bf3e098a67e5afa9d3fb9a/sdk/python/kubeflow/training/api/training_client.py#L237.
|
/lgtm |
|
@deepanker13: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/lgtm |
* [SDK] Add resources for create Job API * Fix unbound var * Assign values in get pod template * Add torchrun issue * Test to create PyTorchJob from Image * Fix e2e to create from image * Fix condition * Modify check test conditions
* [SDK] Add resources for create Job API * Fix unbound var * Assign values in get pod template * Add torchrun issue * Test to create PyTorchJob from Image * Fix e2e to create from image * Fix condition * Modify check test conditions
Blocked by: #1988.
/hold
I added
resources_per_workerparameter tocreate_jobAPI.Also, this has some refactoring for our SDK utils functions:
trainAPI for resource per worker. Let's add the validation in the future if that is required. We might have users who want to do fine-tuning withtrainAPI on CPUs.get_pod_template_specto return Pod template spec,get_container_specto return Container Spec,get_command_using_train_functo return args and command for train function.Please take a look.
/assign @deepanker13 @johnugeorge @tenzen-y @droctothorpe @kuizhiqing