Randlora documentation and some example usage#2524
Randlora documentation and some example usage#2524githubnemo merged 14 commits intohuggingface:mainfrom
Conversation
|
Thanks for the follow up. I haven't reviewed this PR yet, as something has gone wrong when you applied your diff. There are many lines like: Could you please check and fix those? As to adding an experiment to the MetaMathQA method comparison suite, yes, that can be done and added to this PR. Please follow the steps described here. |
8cc4f35 to
196ba70
Compare
docs merge
…into randlora_docs
docs merge squash
|
Hi @BenjaminBossan, I have removed the diff lines and added the MetaMathQA config. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for adding the RandLora documentation and experiment config. The docs are really well written, well done.
I only found some minor issues that should be easily resolved, please check.
For better adoption, I would also recommend adding a full example. This can be as easy as copying one from the examples/ directory and making the necessary adjustments for RandLora. This can also be done in a later PR if you prefer.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
|
@BenjaminBossan I am still investigating the large memory usage of RandLora I observed when running randlora_finetune.py. This goes against what I have observed outside of the peft library. Please let me know in case I missed something. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Thanks for adding the examples. Overall, they look good, but they still need some "fine-tuning". Please check my comments.
Regarding the notebook, I get an error when trying to open it on GitHub. Other people seem to face the same error, maybe this fix works.
I am still investigating the large memory usage of RandLora I observed when running randlora_finetune.py. This goes against what I have observed outside of the peft library.
Thanks for investigating, please create a PR as soon as you find the underlying issue. Is the example you're comparing it to also using Trainer? In my experience, comparing a vanilla PyTorch training loop vs Trainer can be quite difficult, as there are so many things going on under the hood.
| tokenizer=tokenizer, | ||
| ) | ||
| trainer.train() | ||
| peft_model.save_pretrained("randlora-llama-3-8b") |
There was a problem hiding this comment.
The name doesn't fit the base model.
| peft_model.save_pretrained("randlora-llama-3-8b") | ||
| ``` | ||
|
|
||
| There is no additional change needed to your standard PEFT training procedure, simply swap your LoRAConfig for a RandLoraConfig. Note however that RandLora's trainable parameter count is **inversely proportional** to the rank parameter `r`. Lower `r` to increase and increase it to reduce trainable parameters of RandLora. |
There was a problem hiding this comment.
| There is no additional change needed to your standard PEFT training procedure, simply swap your LoRAConfig for a RandLoraConfig. Note however that RandLora's trainable parameter count is **inversely proportional** to the rank parameter `r`. Lower `r` to increase and increase it to reduce trainable parameters of RandLora. | |
| There is no additional change needed to your standard PEFT training procedure, simply swap your `LoraConfig` for a `RandLoraConfig`. Note however that RandLora's trainable parameter count is **inversely proportional** to the rank parameter `r`. Lower `r` to increase and increase it to reduce trainable parameters of RandLora. |
| python examples/randlora_finetuning/randlora_finetuning.py --base_model meta-llama/Meta-Llama-3-8B --data_path timdettmers/openassistant-guanaco --use_lora --randlora_alpha | ||
| ``` | ||
|
|
||
| RandLora can be made to use sparse or very sparse random bases. These sparse matrices can help reduce overfitting. To add `--very_sparse` to run with very sparse matrice or run the following for sparse matrices: |
There was a problem hiding this comment.
| RandLora can be made to use sparse or very sparse random bases. These sparse matrices can help reduce overfitting. To add `--very_sparse` to run with very sparse matrice or run the following for sparse matrices: | |
| RandLora can be made to use sparse or very sparse random bases. These sparse matrices can help reduce overfitting. Add `--very_sparse` to run with very sparse matrices or `--sparse` for sparse matrices: |
| RandLora can be made to use sparse or very sparse random bases. These sparse matrices can help reduce overfitting. To add `--very_sparse` to run with very sparse matrice or run the following for sparse matrices: | ||
|
|
||
| ```bash | ||
| python examples/randlora_finetuning/randlora_finetuning.py --base_model meta-llama/Meta-Llama-3-8B --quantize --sparse |
There was a problem hiding this comment.
| python examples/randlora_finetuning/randlora_finetuning.py --base_model meta-llama/Meta-Llama-3-8B --quantize --sparse | |
| python examples/randlora_finetuning/randlora_finetuning.py --base_model meta-llama/Meta-Llama-3-8B --sparse |
Let's remove it here as the option is discussed in the example below.
| python examples/randlora_finetuning/randlora_finetuning.py --base_model meta-llama/Meta-Llama-3-8B --quantize | ||
| ``` | ||
|
|
||
| By default the RandLora layers are the key and value layers of LLama model. Adding adapters on more layers will increase memory usage. If you whish to choose a different set of layers for RandLora to be applied on, you can simply define it using: |
There was a problem hiding this comment.
| By default the RandLora layers are the key and value layers of LLama model. Adding adapters on more layers will increase memory usage. If you whish to choose a different set of layers for RandLora to be applied on, you can simply define it using: | |
| By default the RandLora layers are the key and value layers of LLama model. Adding adapters on more layers will increase memory usage. If you wish to choose a different set of layers for RandLora to be applied on, you can simply define it using: |
| push_to_hub=push_to_hub, | ||
| hub_model_id=hub_model_id, | ||
| gradient_accumulation_steps=16, | ||
| fp16=True, |
There was a problem hiding this comment.
Should this not depend on the torch_dtype that was chosen earlier?
| This 👆🏻 by default will load the model in peft set up with RandLora config. Now if you wanna quickly compare it with Lora, all you need to do is to input ` --use_lora` in the command line and reduce `--randlora_alpha` to 2x the rank. So same above example would be 👇🏻; | ||
|
|
||
| ```bash | ||
| python examples/randlora_finetuning/randlora_finetuning.py --base_model meta-llama/Meta-Llama-3-8B --data_path timdettmers/openassistant-guanaco --use_lora --randlora_alpha |
There was a problem hiding this comment.
--randlora_alpha is missing a value.
| save_total_limit=2, | ||
| push_to_hub=push_to_hub, | ||
| hub_model_id=hub_model_id, | ||
| gradient_accumulation_steps=16, |
There was a problem hiding this comment.
I'd say either remove this argument or make it configurable.
There was a problem hiding this comment.
Changed to 16//batch_size to ensure minimum size after accumulation is 16 is that suitable ?
There was a problem hiding this comment.
Because you found that it has to be a 16 accumulation steps to work properly? Maybe it's worthwhile mentioning that as a comment.
src/peft/tuners/randlora/model.py
Outdated
|
|
||
| if module_shape != largest_shape: | ||
| largest_shape = tuple(max(a, b) for a, b in zip(largest_shape, module_shape)) | ||
| # largest_shape = tuple(max(a, b) for a, b in zip(largest_shape, module_shape)) |
src/peft/tuners/randlora/model.py
Outdated
| largest_shape = ( | ||
| max(max(module_shape), max(largest_shape)), | ||
| max(min(module_shape), min(largest_shape)), | ||
| ) |
There was a problem hiding this comment.
Could you please explain this change?
There was a problem hiding this comment.
This is a change I implemented to tried to reduce the memory usage which did not work. I didn't mean to commit so I'll revert for now.
The change constrains the bases to be as small as possible and use a transpose view if possible.
Given a two layer network with sizes (D, d) and (d, D) where D>d, the current behavior for a rank 32 is to create a randlora_B random base of size (D, d//32, 32) and randlora_A of size (32,1, D) so that the bases can be sliced and reused in both layer.
This new behavior changes to randlora_B (D, 32, d//32) and randlora_A (32, 1, d) and transposes the update to fit the size of the second matrix.
This is supposed to be the default behavior but I missed the problem in the RandLora pull request. I'll delay this change for when I find a fix to the high memory usage
149c2b6 to
0ca1d44
Compare
notebook fix and remove broken link notebook fix and remove broken link
0ca1d44 to
07e3780
Compare
|
Thanks for the feedback @BenjaminBossan, I have implemented your suggested changes. Here is a command I used in case the issue happens with other contributions: This fix is suggested in the same thread you linked: https://github.com/orgs/community/discussions/155944#discussioncomment-12856952 Let me know if there is more to improve |
githubnemo
left a comment
There was a problem hiding this comment.
Thanks for the fixes. I'm taking over the review from @BenjaminBossan but there's not much left to do as it seems :)
Just a few nitpicks from my side.
|
|
||
| RandLora is expected to increase performance over LoRA for equivalent amounts of trainable parameters, mostly for larger equivalent amounts (> LoRA rank 4). | ||
|
|
||
| RandLora's perfromance increase comes with two limitations: |
There was a problem hiding this comment.
| RandLora's perfromance increase comes with two limitations: | |
| RandLora's performance increase comes with two limitations: |
|
|
||
| Because reducing the rank of RandLora's random bases will increase their number, RandLora can become slower to train than LoRA for very small ranks where typically, ranks below 4 with result in a large training time increase. This does not affect inference though as the RandLora adapters can be merged into the pretrained weight matrices. | ||
|
|
||
| RandLora additionally supports training with sparse, unary random bases (only containing -1, 0 and 1). These bases are as described in [Bingham et al.](https://cs-people.bu.edu/evimaria/cs565/kdd-rp.pdf) and [Ping et al.](https://hastie.su.domains/Papers/Ping/KDD06_rp.pdf) and could theoretically be used to reduce compute needs by performing aggregations instead of matrix multiplications to create the weight update. This is not currently supported. Although it does not currently reduce compute, using sparse random bases in RandLora can reduce overfitting in some cases. For users intersted in using sparse unary bases, the `sparse` option is recommended over the `very_sparse` one that can reduce perfromance. |
There was a problem hiding this comment.
s/perfromance/performance :)
I'm probably missing lingo here but I haven't found confirmation from a quick search so I have to ask: Is unary correct in this case? Isn't the base ternary?
There was a problem hiding this comment.
Yes good point thanks, ternary is the correct term
|
Hi @githubnemo, thanks for your comment and catching the typos. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
githubnemo
left a comment
There was a problem hiding this comment.
This is great, thanks a lot for the thorough documentation, example and integration into the method comparison suite.
@githubnemo took over the review and all points of the review were addressed.
This is a follow up to huggingface#2464 and issue huggingface#2441. Entails documentation for RandLora and slightly updated example usage in the model.py docstring. Also adds RandLoRA to method comparison. --------- Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Hi @BenjaminBossan and others,
This is a follow up to #2464 and issue #2441.
I have drafted a documentation for RandLora and slightly updated the example usage in the model.py docstring.
Since RandLora performs well compared to Lora on the PEFT model comparison suite, is it also possible to add RandLora to a PEFT leader board or is that something you don't do at the moment ?
Happy to iterate or give more example usages.