Skip to content

Add Helmet data module for HELMET benchmark datasets#48

Open
Jantory wants to merge 5 commits intoawslabs:mainfrom
Jantory:main
Open

Add Helmet data module for HELMET benchmark datasets#48
Jantory wants to merge 5 commits intoawslabs:mainfrom
Jantory:main

Conversation

@Jantory
Copy link
Collaborator

@Jantory Jantory commented Mar 4, 2026

LongBenchV2 was the only supported dataset for long-context fine-tuning. This PR adds a Helmet data module (keys_values/data/helmet.py) that loads any HELMET benchmark dataset via the existing load_helmet_dev_eval function, enabling fine-tuning on a broader set of long-context tasks (RAG, summarization, ICL, synthetic retrieval, etc.).

What changed:

  • keys_values/data/helmet.py: New Helmet class inheriting SequenceLengthFilteredDataModule. It loads dev/eval splits from load_helmet_dev_eval, maps the pre-formatted input/output fields directly to SFTDataset (no prompt construction or truncation needed), and uses the dev split for train/val and the eval split as the test set. Sequence lengths are computed at load time for the sampler.
  • keys_values/data/init.py: Exports Helmet alongside LongBenchV2.
  • keys_values/finetune/longcontext_full.py: Guarded the unconditional data.metadata_dir = str(init_out_dir(...)) call so it is skipped when metadata_dir is None, allowing data modules other than LongBenchV2 to be used without crashing.

Helmet can be passed via the existing --data CLI argument using jsonargparse's class/init-args syntax, requiring no changes to the training entry point.

A use case for specifying the Helmet dataset and max sequence length can be
python keys_values/__main__.py finetune_long_lora \ "${CHECKPOINT_DIR}" \ --out_dir "${OUT_DIR}" \ --devices 2 \ --data Helmet --data.dataset_key json_kv --data.max_length 64k \ --data.max_seq_length 32768 --data.metadata_dir "${METADATA_DIR}" \ --head_model next_token_prediction \ --precision bf16-true --verbose some \ --kv_cache.name h2o-default --kv_cache.cache_length 16384 --kv_cache.chunk_size 1024 \ --train.save_interval 10 --train.micro_batch_size 4 --train.global_batch_size 8 \ --eval.interval 10


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@Jantory Jantory closed this Mar 12, 2026
@Jantory Jantory reopened this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants