This repository contains an implementation with PyTorch of model presented in the paper "BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance" in EMNLP 2020.
The figure below illustrates a high-level view of the model's architecture.
For more details about the techniques of BERT-EMD, refer to our paper.
Run command below to install the environment (using python3).
pip install -r requirements.txt
- Get GLUE data:
python download_glue_data.py --data_dir glue_data --tasks all
BaiduYun for alternative
- Get BERT-Base offical model from here, download and unzip to directory
./model/bert_base_uncased. Convert tf model to pytorch model:
cd bert_finetune
python convert_bert_original_tf_checkpoint_to_pytorch.py \
--tf_checkpoint_path ../model/bert_base_uncased \
--bert_config_file ../model/bert_base_uncased/bert_config.json \
--pytorch_dump_path ../model/pytorch_bert_base_uncasedOr you can download the pytorch version directly from huggingface and download to ../model/pytorch_bert_base_uncased.
- Get finetune teacher model, take task MRPC for example (working dir:
./bert_finetune):
export MODEL_PATH=../model/pytorch_bert_base_uncased/
export TASK_NAME=MRPC
python run_glue.py \
--model_type bert \
--model_name_or_path $MODEL_PATH \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir ../data/glue_data/$TASK_NAME/ \
--max_seq_length 128 \
--per_gpu_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 4.0 \
--save_steps 2000 \
--output_dir ../model/$TASK_NAME/teacher/ \
--evaluate_during_training \
--overwrite_output_dir-
Get the pretrained general distillation TinyBERT v2 student model: 4-layer and 6-layer. Unzip to directory
model/student/layer4andmodel/student/layer6respectively. (This link may be temporarily unavailable, for alternative you can download from here BaiduYun). -
Distill student model, take 4-layer student model for example:
cd ../bert_emd
export TASK_NAME=MRPC
python emd_task_distill.py \
--data_dir ../data/glue_data/$TASK_NAME/ \
--teacher_model ../model/$TASK_NAME/teacher/ \
--student_model ../model/student/layer4/ \
--task_name $TASK_NAME \
--output_dir ../model/$TASK_NAME/student/ \
--beta 0.01 --theta 1We replace the layer weight update method with division by addition. In our experiments, this normalization method is better than softmax on some datasets. Wight can be in range from 1e-3 to 1e+3
We add the hyperparameters for best-performing models as bellow and fixed some bugs.
| Layer Num | Task | alpha | beta | T_emd | T | Learning Rate |
|---|---|---|---|---|---|---|
| 4 | CoLA | 1 | 0.001 | 5 | 1 | 2.00E-05 |
| 4 | MNLI | 1 | 0.005 | 1 | 3 | 5.00E-05 |
| 4 | MRPC | 1 | 0.001 | 10 | 1 | 2.00E-05 |
| 4 | QQP | 1 | 0.005 | 1 | 3 | 2.00E-05 |
| 4 | QNLI | 1 | 0.005 | 1 | 3 | 2.00E-05 |
| 4 | RTE | 1 | 0.005 | 1 | 1 | 2.00E-05 |
| 4 | SST-2 | 1 | 0.001 | 1 | 1 | 2.00E-05 |
| 4 | STS-b | 1 | 0.005 | 1 | 1 | 3.00E-05 |
| 6 | CoLA | 1 | 0.001 | 1 | 7 | 2.00E-05 |
| 6 | MNLI | 1 | 0.005 | 1 | 1 | 5.00E-05 |
| 6 | MRPC | 1 | 0.005 | 1 | 1 | 2.00E-05 |
| 6 | QQP | 1 | 0.005 | 1 | 1 | 2.00E-05 |
| 6 | QNLI | 1 | 0.001 | 1 | 1 | 5.00E-05 |
| 6 | RTE | 1 | 0.005 | 1 | 1 | 2.00E-05 |
| 6 | SST-2 | 1 | 0.001 | 1 | 1 | 2.00E-05 |
| 6 | STS-b | 1 | 0.005 | 1 | 1 | 3.00E-05 |