Skip to content

enable FSDP example for model `hugging-quants/Meta-Llama-3.1-8B-Instr…#2626

Merged
BenjaminBossan merged 6 commits intohuggingface:mainfrom
kaixuanliu:sft-fsdp
Jul 7, 2025
Merged

enable FSDP example for model `hugging-quants/Meta-Llama-3.1-8B-Instr…#2626
BenjaminBossan merged 6 commits intohuggingface:mainfrom
kaixuanliu:sft-fsdp

Conversation

@kaixuanliu
Copy link
Contributor

Example cmd line:
accelerate launch --config_file "fsdp_config.yaml" train.py --seed 100 --model_name_or_path "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4" --dataset_name "smangrul/ultrachat-10k-chatml" --chat_template_format "chatml" --add_special_tokens False --append_concat_token False --splits "train,test" --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level "info" --logging_strategy "steps" --eval_strategy "epoch" --save_strategy "epoch" --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type "cosine" --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir "llama-sft-lora-fsdp" --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field "content" --use_flash_attn False --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "q_proj,k_proj,v_proj,o_proj,up_proj,gate_proj" --use_4bit_quantization False

@kaixuanliu
Copy link
Contributor Author

It depends on latest GPTQModel implementations. And 1 reminder: pls use transformers 4.52.4, the latest transformers has bug for this example, I am looking at this.

@kaixuanliu
Copy link
Contributor Author

Hi, @BenjaminBossan , pls help review, thx!

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I haven't tested GPTQmodel training yet, but it's nice to know that you could make it work.

Before proceeding, I wanted to discuss the renaming of use_4bit_quantization > use_bnb_4bit_quantization. I understand the idea, but if the model is not already pre-quantized, bnb is the only supported option (there is no use_gptq_4bit_quantization), so I think we can keep the name. If we wanted to change it, it would have to be a consistent change (e.g. use_8bit_quantization would also need renaming).

@kaixuanliu
Copy link
Contributor Author

Thx for advice, let's change back to original naming, as I noticed a lot of other places use use_4bit_quantization. It's better to keep the same.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting the PR. I can confirm that the original error is fixed through this PR.

I still cannot successfully train, as I get an NCCL error, but that is most likely unrelated to this PR or GPTQ. But still, just in case, could you please share your package versions for:

  • torch
  • transformers
  • trl
  • accelerate
  • CUDA version

uses_fsdp = os.environ.get("ACCELERATE_USE_FSDP", "false").lower() == "true"
if (bnb_config is not None) and uses_fsdp and uses_transformers_4_46:
if (
(bnb_config is not None or (hasattr(model, "hf_quantizer") and model.hf_quantizer is not None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better readability, let's assign this line to a variable like is_quantized, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good advice! Have adjusted the code

…uct-GPTQ-INT4`

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
@kaixuanliu
Copy link
Contributor Author

I use Intel XPU to do the finetune. And here is my package version info:

torch                     2.9.0.dev20250629+xpu(nightly build)
transformers         4.52.4
trl                          0.19.0
accelerate             1.8.1

@kaixuanliu
Copy link
Contributor Author

Along with accelerate config file I used:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: NO_PREFETCH
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_reshard_after_forward: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_transformer_layer_cls_to_wrap: ''
  fsdp_use_orig_params: false
  fsdp_version: 1
ipex_config:
  ipex: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

@kaixuanliu
Copy link
Contributor Author

@BenjaminBossan ,Hi, I tested the case on A100, it can work as well. Here is my package version info:

torch                                         2.7.1+cu128
transformers                             4.52.4
trl                                              0.19.0
accelerate                                 1.8.1
CUDA                                        12.8

And cmd line:
accelerate launch --config_file "configs/fsdp_config.yaml" train.py --seed 100 --model_name_or_path "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4" --dataset_name "smangrul/ultrachat-10k-chatml" --chat_template_format "chatml" --add_special_tokens False --append_concat_token False --splits "train,test" --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level "info" --logging_strategy "steps" --eval_strategy "epoch" --save_strategy "epoch" --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type "cosine" --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir "llama-sft-lora-fsdp" --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field "content" --use_flash_attn False --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "q_proj,k_proj,v_proj,o_proj,up_proj,gate_proj" --use_4bit_quantization False

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i, I tested the case on A100, it can work as well. Here is my package version info:

Thanks for sharing your settings and invocation. I indeed had an issue in my accelerate settings. After some adjustments, I ended up with:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: NO_PREFETCH
  fsdp_cpu_ram_efficient_loading: false
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_reshard_after_forward: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_transformer_layer_cls_to_wrap: ''
  fsdp_use_orig_params: false
  fsdp_version: 1
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

This allowed me to make some progress but encountered an error in GPTQmodel:

[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/work/forks/transformers/src/transformers/models/llama/modeling_llama.py", line 242, in forward
[rank1]:     query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/work/forks/peft/src/peft/tuners/lora/gptq.py", line 84, in forward
[rank1]:     result = self.quant_linear_module(x)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/gptqmodel/nn_modules/qlinear/torch.py", line 154, in forward
[rank1]:     out = self._forward(x, out_shape)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/gptqmodel/nn_modules/qlinear/torch.py", line 160, in _forward
[rank1]:     weights = self.dequantize_weight(num_itr=num_itr).to(x.dtype)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/gptqmodel/nn_modules/qlinear/__init__.py", line 441, in dequantize_weight
[rank1]:     zeros = t.bitwise_right_shift(
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!

Anyway, if it works for you, we can still merge, I'm just posting this here in case you or anyone else who reads this may know a solution :)

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@kaixuanliu
Copy link
Contributor Author

kaixuanliu commented Jul 3, 2025

Hi, I met the same issue with you and I have fixed it in ModelCloud/GPTQModel#1642, pls use the latest GPTQModel(built from src code) by pip install git+https://github.com/ModelCloud/GPTQModel.git and it should can run successfully.

@BenjaminBossan
Copy link
Member

pls use the latest GPTQModel(built from src code) by pip install git+https://github.com/ModelCloud/GPTQModel.git and it should can run successfully.

I'm glad that I asked, this indeed resolved my problem.

Given this, I have a suggestion: How about extending the README of this example with a new section on GPTQModel training. It should mention the min GPTQmodel version required (or installing from source) and include a working invocation (like you provided above). Possibly let's add a working config.yaml if the ones that are provided don't work. I think this will make it much more likely that users find this possibility and use it successfully. WDYT?

@kaixuanliu
Copy link
Contributor Author

@BenjaminBossan Hi, I have added the README part, please help review. I double checked the existing configs/fsdp_config.yaml file can work in this case.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition. I have a suggestion for a different wording in the README paragraph, WDYT?

kaixuanliu and others added 2 commits July 4, 2025 07:31
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@kaixuanliu
Copy link
Contributor Author

@BenjaminBossan ,Hi using transformers with version 4.53.1 is OK.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for enabling GPTQ in the sft example and iterating through the changes. The PR LGTM.

(Note: CI issues are unrelated and examples are not covered anyway, so the PR is good to merge)

@BenjaminBossan BenjaminBossan merged commit b960d25 into huggingface:main Jul 7, 2025
2 of 14 checks passed
efraimdahl pushed a commit to efraimdahl/peft that referenced this pull request Jul 12, 2025
Besides fixes, includes an example script that uses
`hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4`

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants