typos and comment on batch size

huggingface · githubnemo · May 7, 2025 · May 1, 2025 · May 1, 2025 · May 1, 2025
commit 57f5747cd621151129377d1186428bbd1eb1976e
diff --git a/docs/source/package_reference/randlora.md b/docs/source/package_reference/randlora.md
@@ -21,7 +21,7 @@ RandLora presents the noteworthy difference that contrary to other LoRA-like PEF
 
 Because reducing the rank of RandLora's random bases will increase their number, RandLora can become slower to train than LoRA for very small ranks where typically, ranks below 4 with result in a large training time increase. This does not affect inference though as the RandLora adapters can be merged into the pretrained weight matrices.
 
-RandLora additionally supports training with sparse, unary random bases (only containing -1, 0 and 1). These bases are as described in [Bingham et al.](https://cs-people.bu.edu/evimaria/cs565/kdd-rp.pdf) and [Ping et al.](https://hastie.su.domains/Papers/Ping/KDD06_rp.pdf) and could theoretically be used to reduce compute needs by performing aggregations instead of matrix multiplications to create the weight update. This is not currently supported. Although it does not currently reduce compute, using sparse random bases in RandLora can reduce overfitting in some cases. For users intersted in using sparse unary bases, the `sparse` option is recommended over the `very_sparse` one that can reduce perfromance. 
+RandLora additionally supports training with sparse, ternary random bases (only containing -1, 0 and 1). These bases are as described in [Bingham et al.](https://cs-people.bu.edu/evimaria/cs565/kdd-rp.pdf) and [Ping et al.](https://hastie.su.domains/Papers/Ping/KDD06_rp.pdf) and could theoretically be used to reduce compute needs by performing aggregations instead of matrix multiplications to create the weight update. This is not currently supported. Although it does not currently reduce compute, using sparse random bases in RandLora can reduce overfitting in some cases. For users intersted in using sparse ternary bases, the `sparse` option is recommended over the `very_sparse` one that can reduce perfromance. 
 
 Similarly to VeRA, when saving the RandLora's parameters, it's possible to eschew storing the low rank matrices by setting `save_projection=False` on the `VeraConfig`. In that case, these matrices will be restored based on the fixed random seed from the `projection_prng_key` argument. This cuts down on the size of the checkpoint, but we cannot guarantee reproducibility on all devices and for all future versions of PyTorch. If you want to ensure reproducibility, set `save_projection=True` (which is the default).
 

diff --git a/examples/randlora_finetuning/README.md b/examples/randlora_finetuning/README.md
@@ -84,7 +84,7 @@ RandLora differs from LoRA and other related low rank approximation algorithms b
 
 RandLora is expected to increase performance over LoRA for equivalent amounts of trainable parameters, mostly for larger equivalent amounts (> LoRA rank 4).
 
-RandLora's perfromance increase comes with two limitations:
+RandLora's performance increase comes with two limitations:
 
 1. Performance is dependent on using a large `randlora_alpha` scaling parameter (usually 20x the basis rank). This large parameter can sometimes make training the update unstable, reduce the learning rate or the scaling parameter if this is the case.
 

diff --git a/examples/randlora_finetuning/randlora_finetuning.py b/examples/randlora_finetuning/randlora_finetuning.py
@@ -131,7 +131,8 @@ def tokenize_function(examples):
         save_total_limit=2,
         push_to_hub=push_to_hub,
         hub_model_id=hub_model_id,
-        gradient_accumulation_steps=16 // batch_size,
+        gradient_accumulation_steps=16
+        // batch_size,  # Maintaining a minimum batch size of 16 post accumulation is recommended to ensure good performance
         learning_rate=learning_rate,
         hub_token=hf_token,
         label_names=["labels"],