enable FSDP example for model `hugging-quants/Meta-Llama-3.1-8B-Instr…#2626
enable FSDP example for model `hugging-quants/Meta-Llama-3.1-8B-Instr…#2626BenjaminBossan merged 6 commits intohuggingface:mainfrom kaixuanliu:sft-fsdp
Conversation
|
It depends on latest GPTQModel implementations. And 1 reminder: pls use transformers 4.52.4, the latest transformers has bug for this example, I am looking at this. |
|
Hi, @BenjaminBossan , pls help review, thx! |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the PR. I haven't tested GPTQmodel training yet, but it's nice to know that you could make it work.
Before proceeding, I wanted to discuss the renaming of use_4bit_quantization > use_bnb_4bit_quantization. I understand the idea, but if the model is not already pre-quantized, bnb is the only supported option (there is no use_gptq_4bit_quantization), so I think we can keep the name. If we wanted to change it, it would have to be a consistent change (e.g. use_8bit_quantization would also need renaming).
|
Thx for advice, let's change back to original naming, as I noticed a lot of other places use |
There was a problem hiding this comment.
Thanks for adjusting the PR. I can confirm that the original error is fixed through this PR.
I still cannot successfully train, as I get an NCCL error, but that is most likely unrelated to this PR or GPTQ. But still, just in case, could you please share your package versions for:
- torch
- transformers
- trl
- accelerate
- CUDA version
examples/sft/utils.py
Outdated
| uses_fsdp = os.environ.get("ACCELERATE_USE_FSDP", "false").lower() == "true" | ||
| if (bnb_config is not None) and uses_fsdp and uses_transformers_4_46: | ||
| if ( | ||
| (bnb_config is not None or (hasattr(model, "hf_quantizer") and model.hf_quantizer is not None)) |
There was a problem hiding this comment.
For better readability, let's assign this line to a variable like is_quantized, WDYT?
There was a problem hiding this comment.
Good advice! Have adjusted the code
…uct-GPTQ-INT4` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
|
I use Intel XPU to do the finetune. And here is my package version info: |
|
Along with accelerate config file I used: |
|
@BenjaminBossan ,Hi, I tested the case on A100, it can work as well. Here is my package version info: And cmd line: |
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
BenjaminBossan
left a comment
There was a problem hiding this comment.
i, I tested the case on A100, it can work as well. Here is my package version info:
Thanks for sharing your settings and invocation. I indeed had an issue in my accelerate settings. After some adjustments, I ended up with:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: NO_PREFETCH
fsdp_cpu_ram_efficient_loading: false
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_reshard_after_forward: FULL_SHARD
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: ''
fsdp_use_orig_params: false
fsdp_version: 1
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: falseThis allowed me to make some progress but encountered an error in GPTQmodel:
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/work/forks/transformers/src/transformers/models/llama/modeling_llama.py", line 242, in forward
[rank1]: query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/work/forks/peft/src/peft/tuners/lora/gptq.py", line 84, in forward
[rank1]: result = self.quant_linear_module(x)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/gptqmodel/nn_modules/qlinear/torch.py", line 154, in forward
[rank1]: out = self._forward(x, out_shape)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/gptqmodel/nn_modules/qlinear/torch.py", line 160, in _forward
[rank1]: weights = self.dequantize_weight(num_itr=num_itr).to(x.dtype)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
[rank1]: return fn(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/gptqmodel/nn_modules/qlinear/__init__.py", line 441, in dequantize_weight
[rank1]: zeros = t.bitwise_right_shift(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!
Anyway, if it works for you, we can still merge, I'm just posting this here in case you or anyone else who reads this may know a solution :)
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
Hi, I met the same issue with you and I have fixed it in ModelCloud/GPTQModel#1642, pls use the latest GPTQModel(built from src code) by |
I'm glad that I asked, this indeed resolved my problem. Given this, I have a suggestion: How about extending the README of this example with a new section on GPTQModel training. It should mention the min GPTQmodel version required (or installing from source) and include a working invocation (like you provided above). Possibly let's add a working |
|
@BenjaminBossan Hi, I have added the README part, please help review. I double checked the existing |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for the addition. I have a suggestion for a different wording in the README paragraph, WDYT?
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
@BenjaminBossan ,Hi using transformers with version 4.53.1 is OK. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for enabling GPTQ in the sft example and iterating through the changes. The PR LGTM.
(Note: CI issues are unrelated and examples are not covered anyway, so the PR is good to merge)
Besides fixes, includes an example script that uses `hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4` --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Example cmd line:
accelerate launch --config_file "fsdp_config.yaml" train.py --seed 100 --model_name_or_path "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4" --dataset_name "smangrul/ultrachat-10k-chatml" --chat_template_format "chatml" --add_special_tokens False --append_concat_token False --splits "train,test" --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level "info" --logging_strategy "steps" --eval_strategy "epoch" --save_strategy "epoch" --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type "cosine" --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir "llama-sft-lora-fsdp" --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field "content" --use_flash_attn False --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "q_proj,k_proj,v_proj,o_proj,up_proj,gate_proj" --use_4bit_quantization False