Skip to content

Conversation

@zhisbug
Copy link
Collaborator

@zhisbug zhisbug commented Jul 5, 2023

Related issue number (if applicable)

#166
#588
#256

Checks

  • I've run format.sh to lint the changes in this PR.
  • I've included any doc changes needed.
  • I've made sure the relevant tests are passing (if applicable).

@merrymercy
Copy link
Member

merrymercy commented Jul 19, 2023

@zhisbug I did some minor changes for the train/test split, could you rebase? #2018

@likejazz
Copy link

likejazz commented Aug 2, 2023

c84a002
Does it work correctly?
I've used this script on 4 nodes in my slurm cluster, but I see the same loss on all nodes. Is this correct?

{'loss': 2.3814, 'learning_rate': 0.0, 'epoch': 0.97}
{'loss': 2.3814, 'learning_rate': 0.0, 'epoch': 0.97}
{'loss': 2.3814, 'learning_rate': 0.0, 'epoch': 0.97}
{'loss': 2.3814, 'learning_rate': 0.0, 'epoch': 0.97}

And after 1 epoch of training, I see different losses at 2 nodes and 4 nodes.

# after 1 epoch of training
2 nodes, loss: 2.0670151710510254
4 nodes, loss: 2.393033027648926

Sure, 4 nodes trained faster, but I don't think it saw all the data. 4 nodes shows higher loss compare to 2 nodes.

@luffycodes
Copy link

luffycodes commented Aug 5, 2023

hello, I am trying to use the changes suggested to fine-tine vicuna-33b.
just curious, why this change is necessary in transformers internal code line 1498 of transformers/src/trainer.py

if model.dtype == torch.float16:
    self.model = model = model.float()

Code: https://github.com/lm-sys/FastChat/blob/hao-fix-fsdp-save/fastchat/train/train_30b_patch.md?plain=1#L13

I am curious also since it refers to change in internal transformers (main branch) which has changed so I cannot find where to change in the new transformers code.

was it this version of transformers :(https://github.com/huggingface/transformers/blob/v4.29.1/src/transformers/trainer.py#L1498)

if you could please let me know the tag (e.g. v4.29.1), I can find the right place to change.
also, this change seems to be for fsdp users. If i am using deepspeed, where should I make the relevant change?

Thank you :)

@merrymercy
Copy link
Member

closed due to being stale. Most changes are already in the latest main.
#1255
#2390

@merrymercy merrymercy closed this Sep 11, 2023
@merrymercy merrymercy deleted the hao-fix-fsdp-save branch September 17, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants