Adding Training image needed for train api#1963
Adding Training image needed for train api#1963google-oss-prow[bot] merged 14 commits intokubeflow:masterfrom
Conversation
Pull Request Test Coverage Report for Build 7493861403
💛 - Coveralls |
sdk/python/kubeflow/training/training_container/hf_llm_training.py
Outdated
Show resolved
Hide resolved
sdk/python/kubeflow/training/training_container/hf_llm_training.py
Outdated
Show resolved
Hide resolved
sdk/python/kubeflow/training/training_container/hf_llm_training.py
Outdated
Show resolved
Hide resolved
|
@andreyvelich @tenzen-y if it is good to go, can we merge this? |
854d6fd to
4642b9e
Compare
tenzen-y
left a comment
There was a problem hiding this comment.
Then, can you update the following line?
platforms: ${{ matrix.platforms }}
tenzen-y
left a comment
There was a problem hiding this comment.
@deepanker13 Thanks!
/lgtm
/assign @andreyvelich
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you @deepanker13!
I left a few comments
| @@ -0,0 +1,18 @@ | |||
| # Use an official Pytorch runtime as a parent image | |||
| FROM nvcr.io/nvidia/pytorch:23.12-py3 | |||
There was a problem hiding this comment.
Do we need to use PyTorch image from NVIDIA for this trainer ?
Would it be better to take official PyTorch image similar to what we use in SDK ?
docker.io/pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
There was a problem hiding this comment.
as suggested by @tenzen-y
#1963 (comment)
There was a problem hiding this comment.
I see. @tenzen-y Do you know if PyTorch has any official image that we can use that is supported on all platforms ?
There was a problem hiding this comment.
@andreyvelich As I remember correctly, the PyTorch doesn't provide images with multiple architecture platforms with GPU. So, we need to use the NVIDIA official images.
| def setup_peft_model(model, lora_config): | ||
| # Set up the PEFT model | ||
| lora_config = LoraConfig(**json.loads(lora_config)) | ||
| print(lora_config) | ||
| model = get_peft_model(model, lora_config) | ||
| return model |
There was a problem hiding this comment.
Are we are going to have PEFT config always for this trainer ?
@johnugeorge @deepanker13
There was a problem hiding this comment.
loraconfig can be omitted by user, it is handled by setting empty loraconfig as default value in the data class
There was a problem hiding this comment.
Sounds good, @deepanker13 Should we verify if lora_config is set ?
| parser.add_argument("--transformer_type", help="model transformer type") | ||
| parser.add_argument("--model_dir", help="directory containing model") | ||
| parser.add_argument("--dataset_dir", help="directory contaning dataset") | ||
| parser.add_argument("--dataset_name", help="dataset name") |
There was a problem hiding this comment.
We add dataset_name argument for users who want to use this Trainer without SDK client ?
I am asking because in SDK client we always download dataset in storage initializer and store it in Trainer volume.
So we don't need to provide name.
There was a problem hiding this comment.
in the same dataset_dir there can be multiple datasets, right?
There was a problem hiding this comment.
But can we use train API to download more than one dataset ?
E.g. in your example, you just download ultrachat_10k dataset.
There was a problem hiding this comment.
yes, if I run with a different datasetname, it will work fine.
@andreyvelich
There was a problem hiding this comment.
Yeah, but for every API execution you create a new PyTorchJob and a new Trainer image will be spin up.
So dataset is always represent single name, isn't ?
examples/sdk/train_api.py
Outdated
| client.train( | ||
| name="hf-test", | ||
| num_workers=2, | ||
| num_procs_per_worker=0, |
There was a problem hiding this comment.
for cpu only training
There was a problem hiding this comment.
Hmm, but can torchrun be used with CPUs ?
E.g. maybe I want to run torchrun --nproc-per-node=2 where I use 2 CPU per node.
cc @johnugeorge
…lly, adding jupyter notebook
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
7f76de3 to
f520329
Compare
andreyvelich
left a comment
There was a problem hiding this comment.
That's amazing, thank you @deepanker13!
/lgtm
/assign @johnugeorge
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deepanker13, johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Partially Fixes Train/Fine-tune API Proposal for LLMs #1945
Checklist: