First, tokenize the raw text data using the tokenizer of the model. Run the code pre_tokenize_glm4.py for GLM-4-9B or pre_tokenize_llama3.py for Llama-3.1-8B. Remember to add your general SFT data path. please format your data as follows:
{
"messages": [{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}, ...]
}We use packing strategy for more efficient training, run
python sort_and_group.py --train_file ./data/glm4/longwriterto organize the tokenized data for packing training.
We provide training scripts under scripts/ for the GLM-4-9B and Llama-3.1-8B model series. Make sure to adjust --model_name_or_path, --train_file, and --output_dir to match your model path, data path, and output path. We also support LoRA finetuning, see scripts/llama3_longwriter_lora.sh for an example.
To support packing training, we provide patch files under patch/, please replace the original modeling files with them.
Environment: transformers==4.33.0 for GLM-4-9B and transformers==4.43.0 for Llama-3.1-8B.
- Error when running training script:
⚠️ DeepSpeedZeroConfig stage3_prefetch_bucket_size Input should be a valid integer, got a number with a fractional part. This may happen if yourdeepspeed>=0.15.0, we suggest downgrade todeepspeed==0.14.4to resolve this issue. - Error when training GLM-4-9b:
⚠️ return self.mergeable_ranks[token] KeyError: '<|endoftext|>'. Please make sure you have replaced thetokenization_chatglm.pyandmodeling_chatglm.pyfiles under the GLM-4-9b model folder by the patch files underpatch/. Also make sure your environment satisfy the requirements:transformers==4.33.0andflash-attn>=2.0.0. - Encountered during GLM-4-9b training:
⚠️ RuntimeError: shape '[32768, -1, 1, 32, 2]' is invalid for.... Please modify the"seq_length"inconfig.jsonof GLM-4-9b from 8192 to 131072.