Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
typos and comment on batch size
  • Loading branch information
PaulAlbert31 committed May 6, 2025
commit 57f5747cd621151129377d1186428bbd1eb1976e
2 changes: 1 addition & 1 deletion docs/source/package_reference/randlora.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ RandLora presents the noteworthy difference that contrary to other LoRA-like PEF

Because reducing the rank of RandLora's random bases will increase their number, RandLora can become slower to train than LoRA for very small ranks where typically, ranks below 4 with result in a large training time increase. This does not affect inference though as the RandLora adapters can be merged into the pretrained weight matrices.

RandLora additionally supports training with sparse, unary random bases (only containing -1, 0 and 1). These bases are as described in [Bingham et al.](https://cs-people.bu.edu/evimaria/cs565/kdd-rp.pdf) and [Ping et al.](https://hastie.su.domains/Papers/Ping/KDD06_rp.pdf) and could theoretically be used to reduce compute needs by performing aggregations instead of matrix multiplications to create the weight update. This is not currently supported. Although it does not currently reduce compute, using sparse random bases in RandLora can reduce overfitting in some cases. For users intersted in using sparse unary bases, the `sparse` option is recommended over the `very_sparse` one that can reduce perfromance.
RandLora additionally supports training with sparse, ternary random bases (only containing -1, 0 and 1). These bases are as described in [Bingham et al.](https://cs-people.bu.edu/evimaria/cs565/kdd-rp.pdf) and [Ping et al.](https://hastie.su.domains/Papers/Ping/KDD06_rp.pdf) and could theoretically be used to reduce compute needs by performing aggregations instead of matrix multiplications to create the weight update. This is not currently supported. Although it does not currently reduce compute, using sparse random bases in RandLora can reduce overfitting in some cases. For users intersted in using sparse ternary bases, the `sparse` option is recommended over the `very_sparse` one that can reduce perfromance.

Similarly to VeRA, when saving the RandLora's parameters, it's possible to eschew storing the low rank matrices by setting `save_projection=False` on the `VeraConfig`. In that case, these matrices will be restored based on the fixed random seed from the `projection_prng_key` argument. This cuts down on the size of the checkpoint, but we cannot guarantee reproducibility on all devices and for all future versions of PyTorch. If you want to ensure reproducibility, set `save_projection=True` (which is the default).

Expand Down
2 changes: 1 addition & 1 deletion examples/randlora_finetuning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ RandLora differs from LoRA and other related low rank approximation algorithms b

RandLora is expected to increase performance over LoRA for equivalent amounts of trainable parameters, mostly for larger equivalent amounts (> LoRA rank 4).

RandLora's perfromance increase comes with two limitations:
RandLora's performance increase comes with two limitations:

1. Performance is dependent on using a large `randlora_alpha` scaling parameter (usually 20x the basis rank). This large parameter can sometimes make training the update unstable, reduce the learning rate or the scaling parameter if this is the case.

Expand Down
3 changes: 2 additions & 1 deletion examples/randlora_finetuning/randlora_finetuning.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,8 @@ def tokenize_function(examples):
save_total_limit=2,
push_to_hub=push_to_hub,
hub_model_id=hub_model_id,
gradient_accumulation_steps=16 // batch_size,
gradient_accumulation_steps=16
// batch_size, # Maintaining a minimum batch size of 16 post accumulation is recommended to ensure good performance
learning_rate=learning_rate,
hub_token=hf_token,
label_names=["labels"],
Expand Down