Framework for training state-of-the-art embedding models using Contrastive learning at large batch sizes.
- Create and activate a fresh conda environment
- Install required packages:
pip install -r requirements.txt
Prepare your datasets as jsonl files with the following columns:
query: strpositive_doc: strnegative_docs: List[str] (not needed for pretraining)
Sample datasets:
- Pretraining:
resources/pretraining_data/*.jsonl - Fine-tuning:
resources/finetuning_data/*.jsonl
Training requires pretokenized datasets stored as binary files. To tokenize your data:
# For pretraining data
python corgee/data/create_tokbins.py \
--tokenizer intfloat/multilingual-e5-base \
--input_dir resources/pretraining_data/ \
--output_dir resources/pretraining_data_tokenized/
# For fine-tuning data
python corgee/data/create_tokbins.py \
--tokenizer intfloat/multilingual-e5-base \
--input_dir resources/finetuning_data/ \
--output_dir resources/finetuning_data_tokenized/-
Create a
config.yamlfile with relevant parameters.- Sample pretraining and finetuning configs are provided in the
configs/directory.
- Sample pretraining and finetuning configs are provided in the
-
Start training:
For running on a single node:
source run.sh config.yamlFor running on multiple nodes (e.g., 4 nodes):
DIST_NUM_NODES=4 source run.sh config.yamlAdjust the
DIST_NUM_NODESvalue according to your setup. -
Parameter Configuration:
- Set parameters in
config.yaml - Override important parameters via command line as needed
- Set parameters in
Sample configs are provided in configs/
| Parameter | Description |
|---|---|
output_dir |
Directory for logs and saved models |
batch_size |
Training batch size |
max_forward_batch_size |
Maximum batch size for GPU forwarding |
files |
Dictionary of dataset configurations |
Each dataset in the files dictionary requires:
num_steps: Number of training batches to samplemaxlen1: Maximum tokens in querymaxlen2: Maximum tokens in positive/negative documentsfile_pattern: Regex pattern for tokbin files
Note: Batches are sampled from one dataset at a time. For language-wise sampling, make each language a separate dataset.
