Skip to content

Commit 207b27e

Browse files
authored
ENH Support XPU for LoRA-FA example (#2697)
--------- Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
1 parent 68265a1 commit 207b27e

File tree

2 files changed

+26
-13
lines changed

2 files changed

+26
-13
lines changed

examples/lorafa_finetune/README.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Introduction
44

5-
[LoRA-FA](https://huggingface.co/papers/2308.03303) is a noval Parameter-efficient Fine-tuning method, which freezes the projection down layer (matrix A) during LoRA training process and thus lead to less GPU memory consumption by eliminating the need for storing the activations of input tensors (X). Furthermore, LoRA-FA narrows the gap between the update amount of pre-trained weights when using the low-rank fine-tuning method and the full fine-tuning method. In conclusion, LoRA-FA reduces the memory consumption and leads to superior performance compared to vanilla LoRA.
5+
[LoRA-FA](https://huggingface.co/papers/2308.03303) is a noval Parameter-efficient Fine-tuning method, which freezes the projection down layer (matrix A) during LoRA training process and thus lead to less accelerator memory consumption by eliminating the need for storing the activations of input tensors (X). Furthermore, LoRA-FA narrows the gap between the update amount of pre-trained weights when using the low-rank fine-tuning method and the full fine-tuning method. In conclusion, LoRA-FA reduces the memory consumption and leads to superior performance compared to vanilla LoRA.
66

77
## Quick start
88

@@ -54,7 +54,7 @@ The only change in your code is to pass the LoRA-FA optimizer to the trainer (if
5454

5555
In this dir, we also provide you a simple example for fine-tuning with LoRA-FA optimizer.
5656

57-
### Run on CPU, single-GPU or multi-GPU
57+
### Run on CPU, single-accelerator or multi-accelerator
5858

5959
This 👇 by default will load the model in peft set up with LoRA config, and train the model with LoRA-FA optimizer.
6060

@@ -66,23 +66,29 @@ You can simply run LoRA-FA as below:
6666
python lorafa_finetuning.py --base_model_name_or_path meta-llama/Meta-Llama-3-8B --dataset_name_or_path meta-math/MetaMathQA-40K --output_dir path/to/output --lorafa
6767
```
6868

69-
1. Single-GPU
69+
1. Single-accelerator
7070

71-
Run the finetuning script on 1 GPU:
71+
Run the finetuning script on 1 accelerator:
7272

7373
```bash
74-
CUDA_VISIBLE_DEVICES=0 python lorafa_finetuning.py --base_model_name_or_path meta-llama/Meta-Llama-3-8B --dataset_name_or_path meta-math/MetaMathQA-40K --output_dir path/to/output --lorafa
74+
export CUDA_VISIBLE_DEVICES=0 # force to use CUDA GPU 0
75+
export ZE_AFFINITY_MASK=0 # force to use Intel XPU 0
76+
77+
python lorafa_finetuning.py --base_model_name_or_path meta-llama/Meta-Llama-3-8B --dataset_name_or_path meta-math/MetaMathQA-40K --output_dir path/to/output --lorafa
7578
```
7679

77-
2. Multi-GPU
80+
2. Multi-accelerator
7881

79-
LoRA-FA can also be run on multi-GPU, with 🤗 Accelerate:
82+
LoRA-FA can also be run on multi-accelerator, with 🤗 Accelerate:
8083

8184
```bash
82-
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch lorafa_finetuning.py --base_model_name_or_path meta-llama/Meta-Llama-3-8B --dataset_name_or_path meta-math/MetaMathQA-40K --output_dir path/to/output --lorafa
85+
export CUDA_VISIBLE_DEVICES=0,1,2,3 # force to use CUDA GPU 0,1,2,3
86+
export ZE_AFFINITY_MASK=0,1,2,3 # force to use Intel XPU 0,1,2,3
87+
88+
accelerate launch lorafa_finetuning.py --base_model_name_or_path meta-llama/Meta-Llama-3-8B --dataset_name_or_path meta-math/MetaMathQA-40K --output_dir path/to/output --lorafa
8389
```
8490

85-
The `accelerate launch` will automatically configure multi-GPU for you. You can also utilize `accelerate launch` in single-GPU scenario.
91+
The `accelerate launch` will automatically configure multi-accelerator for you. You can also utilize `accelerate launch` in single-accelerator scenario.
8692

8793
### Use the model from 🤗
8894
You can load and use the model as any other 🤗 models.
@@ -97,7 +103,7 @@ Sometimes, achieving optimal LoRA fine-tuning can be challenging due to the larg
97103

98104
## LoRA-FA's advantages and limitations
99105

100-
By eliminating the activation of adapter A, LoRA-FA uses less memory for fine-tuning compared to LoRA. For instance, when fine-tuning Llama-2-7b-chat-hf with a batch size of 8 and a sequence length of 1024, LoRA-FA requires 36GB of memory to store activations. This allows it to run successfully on an 80GB GPU. In contrast, LoRA requires at least 60GB of memory for activations, leading to an Out of Memory (OOM) error. Additionally, the memory consumption of LoRA-FA is not sensitive to the rank, allowing for performance improvements by increasing the LoRA rank without additional memory usage. LoRA-FA further narrows the performance gap with Full-FT by minimizing the discrepancy between the low-rank gradient and the full gradient, enabling it to achieve performance that is on par with or even superior to vanilla LoRA.
106+
By eliminating the activation of adapter A, LoRA-FA uses less memory for fine-tuning compared to LoRA. For instance, when fine-tuning Llama-2-7b-chat-hf with a batch size of 8 and a sequence length of 1024, LoRA-FA requires 36GB of memory to store activations. This allows it to run successfully on an 80GB accelerator. In contrast, LoRA requires at least 60GB of memory for activations, leading to an Out of Memory (OOM) error. Additionally, the memory consumption of LoRA-FA is not sensitive to the rank, allowing for performance improvements by increasing the LoRA rank without additional memory usage. LoRA-FA further narrows the performance gap with Full-FT by minimizing the discrepancy between the low-rank gradient and the full gradient, enabling it to achieve performance that is on par with or even superior to vanilla LoRA.
101107

102108
Despite its advantages, LoRA-FA is inherently limited by its low-rank approximation nature and potential issues with catastrophic forgetting. The gradient approximation can impact training throughput. Addressing these limitations, especially in terms of approximation accuracy and forgetting phenomena, presents a promising direction for future research.
103109

@@ -112,4 +118,4 @@ Despite its advantages, LoRA-FA is inherently limited by its low-rank approximat
112118
primaryClass={cs.CL},
113119
url={https://huggingface.co/papers/2308.03303},
114120
}
115-
```
121+
```

examples/lorafa_finetune/lorafa_finetuning.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,15 @@ def train_model(
4949
):
5050
os.environ["TOKENIZERS_PARALLELISM"] = "false"
5151

52-
compute_dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
53-
device_map = "cuda" if torch.cuda.is_available() else None
52+
is_bf16_supported = False
53+
device_map = "cpu"
54+
if torch.cuda.is_available():
55+
is_bf16_supported = torch.cuda.is_bf16_supported()
56+
device_map = "cuda"
57+
elif torch.xpu.is_available():
58+
is_bf16_supported = torch.xpu.is_bf16_supported()
59+
device_map = "xpu"
60+
compute_dtype = torch.bfloat16 if is_bf16_supported else torch.float16
5461

5562
# load tokenizer
5663
tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path)

0 commit comments

Comments
 (0)