[Feature] Safe save with FSDP, slurm examples #1863

zhisbug · 2023-07-05T20:34:55Z

Related issue number (if applicable)

Checks

I've run format.sh to lint the changes in this PR.
I've included any doc changes needed.
I've made sure the relevant tests are passing (if applicable).

merrymercy · 2023-07-19T20:53:58Z

@zhisbug I did some minor changes for the train/test split, could you rebase? #2018

likejazz · 2023-08-02T02:26:23Z

c84a002
Does it work correctly?
I've used this script on 4 nodes in my slurm cluster, but I see the same loss on all nodes. Is this correct?

{'loss': 2.3814, 'learning_rate': 0.0, 'epoch': 0.97}
{'loss': 2.3814, 'learning_rate': 0.0, 'epoch': 0.97}
{'loss': 2.3814, 'learning_rate': 0.0, 'epoch': 0.97}
{'loss': 2.3814, 'learning_rate': 0.0, 'epoch': 0.97}

And after 1 epoch of training, I see different losses at 2 nodes and 4 nodes.

# after 1 epoch of training
2 nodes, loss: 2.0670151710510254
4 nodes, loss: 2.393033027648926

Sure, 4 nodes trained faster, but I don't think it saw all the data. 4 nodes shows higher loss compare to 2 nodes.

luffycodes · 2023-08-05T15:59:03Z

hello, I am trying to use the changes suggested to fine-tine vicuna-33b.
just curious, why this change is necessary in transformers internal code line 1498 of transformers/src/trainer.py

if model.dtype == torch.float16:
    self.model = model = model.float()

Code: https://github.com/lm-sys/FastChat/blob/hao-fix-fsdp-save/fastchat/train/train_30b_patch.md?plain=1#L13

I am curious also since it refers to change in internal transformers (main branch) which has changed so I cannot find where to change in the new transformers code.

was it this version of transformers :(https://github.com/huggingface/transformers/blob/v4.29.1/src/transformers/trainer.py#L1498)

if you could please let me know the tag (e.g. v4.29.1), I can find the right place to change.
also, this change seems to be for fsdp users. If i am using deepspeed, where should I make the relevant change?

Thank you :)

merrymercy · 2023-09-11T23:48:38Z

closed due to being stale. Most changes are already in the latest main.
#1255
#2390

zhisbug and others added 16 commits May 14, 2023 04:41

my working branch

14adfbb

x former instead of attention

8d82a31

update

ff7d3f6

monkey patch

240661b

fix

dbbea0c

add slurm script

9193046

training job gpt-4 only

c479599

Remove train/test split

f6c6696

Update train_vicuna_30b_single_node_slurm.sh

134e408

13b training scripts

3188916

merge main

d2de83e

update scripts

3271c08

training scripts

d443171

push

c926f58

lanuch new jobs

6e9cd5b

update

fee1dd8

merrymercy assigned zhisbug Jul 6, 2023

merrymercy force-pushed the main branch from b27b454 to f50f171 Compare July 8, 2023 22:22

This was referenced Jul 13, 2023

V100 fine tuning with setting --nproc_per_node #1906

Closed

Training details for Vicuna-1.3-33B #1982

Closed

zhisbug added 2 commits July 19, 2023 10:42

add 65b training scripts

c84a002

Merge branch 'main' into hao-fix-fsdp-save

4352c18

merrymercy force-pushed the main branch from 6ec2e71 to f76f4c9 Compare August 1, 2023 18:33

merrymercy force-pushed the main branch 2 times, most recently from bf7aa7e to a81a04c Compare August 28, 2023 01:36

merrymercy force-pushed the main branch from 86ef64f to dc3dd12 Compare September 6, 2023 03:26

merrymercy closed this Sep 11, 2023

merrymercy deleted the hao-fix-fsdp-save branch September 17, 2023 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Safe save with FSDP, slurm examples #1863

[Feature] Safe save with FSDP, slurm examples #1863

Uh oh!

zhisbug commented Jul 5, 2023

Uh oh!

merrymercy commented Jul 19, 2023 •

edited

Loading

Uh oh!

likejazz commented Aug 2, 2023 •

edited

Loading

Uh oh!

luffycodes commented Aug 5, 2023 •

edited

Loading

Uh oh!

merrymercy commented Sep 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Feature] Safe save with FSDP, slurm examples #1863

[Feature] Safe save with FSDP, slurm examples #1863

Uh oh!

Conversation

zhisbug commented Jul 5, 2023

Related issue number (if applicable)

Checks

Uh oh!

merrymercy commented Jul 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

likejazz commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luffycodes commented Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Sep 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

merrymercy commented Jul 19, 2023 •

edited

Loading

likejazz commented Aug 2, 2023 •

edited

Loading

luffycodes commented Aug 5, 2023 •

edited

Loading