LongWriter/train at main · heijligers/LongWriter

Name	Name	Last commit message	Last commit date
parent directory ..
ds_config	ds_config
patch	patch
scripts	scripts
README.md	README.md
dataset.py	dataset.py
main.py	main.py
pre_tokenize_glm4.py	pre_tokenize_glm4.py
pre_tokenize_llama3.py	pre_tokenize_llama3.py
sort_and_group.py	sort_and_group.py
trainer.py	trainer.py

Name

Last commit message

Last commit date

pre_tokenize_llama3.py

sort_and_group.py

trainer.py

🖥️ LongWriter Training

Data preprocessing

First, tokenize the raw text data using the tokenizer of the model. Run the code pre_tokenize_glm4.py for GLM-4-9B or pre_tokenize_llama3.py for Llama-3.1-8B. Remember to add your general SFT data path. please format your data as follows:

{
    "messages": [{"role": "user", "content": "..."}, 
                 {"role": "assistant", "content": "..."}, ...]
    }

We use packing strategy for more efficient training, run

python sort_and_group.py --train_file ./data/glm4/longwriter

to organize the tokenized data for packing training.

Model training

We provide training scripts under scripts/ for the GLM-4-9B and Llama-3.1-8B model series. Make sure to adjust --model_name_or_path, --train_file, and --output_dir to match your model path, data path, and output path. We also support LoRA finetuning, see scripts/llama3_longwriter_lora.sh for an example.

To support packing training, we provide patch files under patch/, please replace the original modeling files with them.

Environment: transformers==4.33.0 for GLM-4-9B and transformers==4.43.0 for Llama-3.1-8B.

FAQ

Error when running training script: ⚠️DeepSpeedZeroConfig stage3_prefetch_bucket_size Input should be a valid integer, got a number with a fractional part. This may happen if your deepspeed>=0.15.0, we suggest downgrade to deepspeed==0.14.4 to resolve this issue.
Error when training GLM-4-9b: ⚠️return self.mergeable_ranks[token] KeyError: '<|endoftext|>'. Please make sure you have replaced the tokenization_chatglm.py and modeling_chatglm.py files under the GLM-4-9b model folder by the patch files under patch/. Also make sure your environment satisfy the requirements: transformers==4.33.0 and flash-attn>=2.0.0.
Encountered during GLM-4-9b training: ⚠️RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for.... Please modify the "seq_length" in config.json of GLM-4-9b from 8192 to 131072.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

🖥️ LongWriter Training

Data preprocessing

Model training

FAQ

FilesExpand file tree

train

Directory actions

More options

Directory actions

More options

Latest commit

History

train

Folders and files

parent directory

README.md

🖥️ LongWriter Training

Data preprocessing

Model training

FAQ