enable FSDP example for model `hugging-quants/Meta-Llama-3.1-8B-Instr… by kaixuanliu · Pull Request #2626 · huggingface/peft

kaixuanliu · 2025-07-02T08:08:48Z

Example cmd line:
accelerate launch --config_file "fsdp_config.yaml" train.py --seed 100 --model_name_or_path "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4" --dataset_name "smangrul/ultrachat-10k-chatml" --chat_template_format "chatml" --add_special_tokens False --append_concat_token False --splits "train,test" --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level "info" --logging_strategy "steps" --eval_strategy "epoch" --save_strategy "epoch" --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type "cosine" --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir "llama-sft-lora-fsdp" --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field "content" --use_flash_attn False --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "q_proj,k_proj,v_proj,o_proj,up_proj,gate_proj" --use_4bit_quantization False

kaixuanliu · 2025-07-02T08:19:30Z

It depends on latest GPTQModel implementations. And 1 reminder: pls use transformers 4.52.4, the latest transformers has bug for this example, I am looking at this.

kaixuanliu · 2025-07-02T08:21:26Z

Hi, @BenjaminBossan , pls help review, thx!

BenjaminBossan

Thanks for the PR. I haven't tested GPTQmodel training yet, but it's nice to know that you could make it work.

Before proceeding, I wanted to discuss the renaming of use_4bit_quantization > use_bnb_4bit_quantization. I understand the idea, but if the model is not already pre-quantized, bnb is the only supported option (there is no use_gptq_4bit_quantization), so I think we can keep the name. If we wanted to change it, it would have to be a consistent change (e.g. use_8bit_quantization would also need renaming).

kaixuanliu · 2025-07-02T09:55:50Z

Thx for advice, let's change back to original naming, as I noticed a lot of other places use use_4bit_quantization. It's better to keep the same.

BenjaminBossan

Thanks for adjusting the PR. I can confirm that the original error is fixed through this PR.

I still cannot successfully train, as I get an NCCL error, but that is most likely unrelated to this PR or GPTQ. But still, just in case, could you please share your package versions for:

torch
transformers
trl
accelerate
CUDA version

BenjaminBossan · 2025-07-02T15:36:27Z

examples/sft/utils.py

        uses_fsdp = os.environ.get("ACCELERATE_USE_FSDP", "false").lower() == "true"
-        if (bnb_config is not None) and uses_fsdp and uses_transformers_4_46:
+        if (
+            (bnb_config is not None or (hasattr(model, "hf_quantizer") and model.hf_quantizer is not None))


For better readability, let's assign this line to a variable like is_quantized, WDYT?

Good advice! Have adjusted the code

…uct-GPTQ-INT4` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu · 2025-07-03T01:12:55Z

I use Intel XPU to do the finetune. And here is my package version info:

torch                     2.9.0.dev20250629+xpu(nightly build)
transformers         4.52.4
trl                          0.19.0
accelerate             1.8.1

kaixuanliu · 2025-07-03T01:14:35Z

Along with accelerate config file I used:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: NO_PREFETCH
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_reshard_after_forward: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_transformer_layer_cls_to_wrap: ''
  fsdp_use_orig_params: false
  fsdp_version: 1
ipex_config:
  ipex: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

kaixuanliu · 2025-07-03T06:27:15Z

@BenjaminBossan ,Hi, I tested the case on A100, it can work as well. Here is my package version info:

torch                                         2.7.1+cu128
transformers                             4.52.4
trl                                              0.19.0
accelerate                                 1.8.1
CUDA                                        12.8

And cmd line:
accelerate launch --config_file "configs/fsdp_config.yaml" train.py --seed 100 --model_name_or_path "hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4" --dataset_name "smangrul/ultrachat-10k-chatml" --chat_template_format "chatml" --add_special_tokens False --append_concat_token False --splits "train,test" --max_seq_len 2048 --num_train_epochs 1 --logging_steps 5 --log_level "info" --logging_strategy "steps" --eval_strategy "epoch" --save_strategy "epoch" --bf16 True --packing True --learning_rate 1e-4 --lr_scheduler_type "cosine" --weight_decay 1e-4 --warmup_ratio 0.0 --max_grad_norm 1.0 --output_dir "llama-sft-lora-fsdp" --per_device_train_batch_size 8 --per_device_eval_batch_size 8 --gradient_accumulation_steps 4 --gradient_checkpointing True --use_reentrant False --dataset_text_field "content" --use_flash_attn False --use_peft_lora True --lora_r 8 --lora_alpha 16 --lora_dropout 0.1 --lora_target_modules "q_proj,k_proj,v_proj,o_proj,up_proj,gate_proj" --use_4bit_quantization False

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

BenjaminBossan

i, I tested the case on A100, it can work as well. Here is my package version info:

Thanks for sharing your settings and invocation. I indeed had an issue in my accelerate settings. After some adjustments, I ended up with:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: false
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: NO_PREFETCH
  fsdp_cpu_ram_efficient_loading: false
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_reshard_after_forward: FULL_SHARD
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_transformer_layer_cls_to_wrap: ''
  fsdp_use_orig_params: false
  fsdp_version: 1
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

This allowed me to make some progress but encountered an error in GPTQmodel:

[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/work/forks/transformers/src/transformers/models/llama/modeling_llama.py", line 242, in forward
[rank1]:     query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/work/forks/peft/src/peft/tuners/lora/gptq.py", line 84, in forward
[rank1]:     result = self.quant_linear_module(x)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/gptqmodel/nn_modules/qlinear/torch.py", line 154, in forward
[rank1]:     out = self._forward(x, out_shape)
[rank1]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/gptqmodel/nn_modules/qlinear/torch.py", line 160, in _forward
[rank1]:     weights = self.dequantize_weight(num_itr=num_itr).to(x.dtype)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/name/anaconda3/envs/peft/lib/python3.12/site-packages/gptqmodel/nn_modules/qlinear/__init__.py", line 441, in dequantize_weight
[rank1]:     zeros = t.bitwise_right_shift(
[rank1]:             ^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!

Anyway, if it works for you, we can still merge, I'm just posting this here in case you or anyone else who reads this may know a solution :)

examples/sft/utils.py

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

kaixuanliu · 2025-07-03T15:27:15Z

Hi, I met the same issue with you and I have fixed it in ModelCloud/GPTQModel#1642, pls use the latest GPTQModel(built from src code) by pip install git+https://github.com/ModelCloud/GPTQModel.git and it should can run successfully.

BenjaminBossan · 2025-07-03T15:55:28Z

pls use the latest GPTQModel(built from src code) by pip install git+https://github.com/ModelCloud/GPTQModel.git and it should can run successfully.

I'm glad that I asked, this indeed resolved my problem.

Given this, I have a suggestion: How about extending the README of this example with a new section on GPTQModel training. It should mention the min GPTQmodel version required (or installing from source) and include a working invocation (like you provided above). Possibly let's add a working config.yaml if the ones that are provided don't work. I think this will make it much more likely that users find this possibility and use it successfully. WDYT?

kaixuanliu · 2025-07-04T02:30:15Z

@BenjaminBossan Hi, I have added the README part, please help review. I double checked the existing configs/fsdp_config.yaml file can work in this case.

BenjaminBossan

Thanks for the addition. I have a suggestion for a different wording in the README paragraph, WDYT?

examples/sft/README.md

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

kaixuanliu · 2025-07-07T01:32:51Z

@BenjaminBossan ,Hi using transformers with version 4.53.1 is OK.

HuggingFaceDocBuilderDev · 2025-07-07T08:59:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks for enabling GPTQ in the sft example and iterating through the changes. The PR LGTM.

(Note: CI issues are unrelated and examples are not covered anyway, so the PR is good to merge)

Besides fixes, includes an example script that uses `hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4` --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

kaixuanliu mentioned this pull request Jul 2, 2025

register buffer for wf_unsqueeze_zero and wf_unsqueeze_neg_one to… ModelCloud/GPTQModel#1642

Merged

BenjaminBossan reviewed Jul 2, 2025

View reviewed changes

kaixuanliu added 2 commits July 2, 2025 13:14

enable FSDP example for model `hugging-quants/Meta-Llama-3.1-8B-Instr…

2a54712

…uct-GPTQ-INT4` Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

revert back to use_4bit_quantization naming

853143c

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Qubitium mentioned this pull request Jul 3, 2025

Feature: Add FSDP Peft/Lora finetuning sample ModelCloud/GPTQModel#1648

Open

refine the code

d40bdcd

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

BenjaminBossan reviewed Jul 3, 2025

View reviewed changes

examples/sft/utils.py Outdated Show resolved Hide resolved

Update examples/sft/utils.py

30d170c

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

BenjaminBossan reviewed Jul 4, 2025

View reviewed changes

examples/sft/README.md Outdated Show resolved Hide resolved

kaixuanliu and others added 2 commits July 4, 2025 07:31

add related README

e51125b

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Update examples/sft/README.md

2002222

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

BenjaminBossan approved these changes Jul 7, 2025

View reviewed changes

BenjaminBossan merged commit b960d25 into huggingface:main Jul 7, 2025
2 of 14 checks passed

Conversation

kaixuanliu commented Jul 2, 2025

Uh oh!

kaixuanliu commented Jul 2, 2025

Uh oh!

kaixuanliu commented Jul 2, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

kaixuanliu commented Jul 2, 2025

Uh oh!

BenjaminBossan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

kaixuanliu commented Jul 3, 2025

Uh oh!

kaixuanliu commented Jul 3, 2025

Uh oh!

kaixuanliu commented Jul 3, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaixuanliu commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenjaminBossan commented Jul 3, 2025

Uh oh!

kaixuanliu commented Jul 4, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaixuanliu commented Jul 7, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 7, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BenjaminBossan left a comment •

edited

Loading

kaixuanliu commented Jul 3, 2025 •

edited

Loading