Finetuning transformers to Enzyme Commission (EC) number prediction and gaining insights into their inner workings with an adaptation of integrated gradients
- We run our analysis on a GPU cluster using Apptainer (version 1.1.2) and Slurm.
- SSH into the head node of the GPU cluster with
ssh <username>@<hostname> - Clone code repository with
git clone https://github.com/markuswenzel/xai-proteins - Change directory with
cd xai-proteins/ec - Build the Apptainer image file with
apptainer build --force --fakeroot image.sif image.def - Make directory for data with
mkdir data, and for models withmkdir models - Leave GPU cluster again with
exit
- Preprocess EC data as detailed on https://github.com/nstrodt/UDSMProt with https://github.com/nstrodt/UDSMProt/blob/master/code/create_datasets.sh, resulting in separate files for ec40 and ec50 on levels L0, L1, and L2. See publication for details: Nils Strodthoff, Patrick Wagner, Markus Wenzel, and Wojciech Samek (2020). UDSMProt: universal deep sequence models for protein classification. Bioinformatics, 36(8), 2401–2409. Alternatively, you can download the already preprocessed six EC-datasets with your web browser from here. In particular, we work with ec50_level1, which is EC-classification on level 1 (differentiation between six major enzyme classes), using a cluster/similarity treshold of 50 used to split train and test data.
- Upload the EC-datasets to the GPU cluster with the terminal:
scp ~/Downloads/ec50_level0.zip <username>@<hostname>:~/xai-proteins/ec/data/
scp ~/Downloads/ec50_level1.zip <username>@<hostname>:~/xai-proteins/ec/data/
scp ~/Downloads/ec50_level2.zip <username>@<hostname>:~/xai-proteins/ec/data/
scp ~/Downloads/ec40_level0.zip <username>@<hostname>:~/xai-proteins/ec/data/
scp ~/Downloads/ec40_level1.zip <username>@<hostname>:~/xai-proteins/ec/data/
scp ~/Downloads/ec40_level2.zip <username>@<hostname>:~/xai-proteins/ec/data/
You can also use Ubuntu's file manager Nautilus (a.k.a. Files) to copy and paste between your local computer and the file system of the GPU cluster. In Nautilus click on "+ Other Locations" (on the lower left) and then add the server address (next to "Connect to Server"): sftp://<hostname>/
- Run finetuning code on GPU cluster with Slurm/Apptainer:
ssh <username>@<hostname>
cd ~/xai-proteins/ec
sbatch submit_prot_ec.sh
- These commands are helpful to track the progress of your cluster jobs:
squeue --me
head -n 30 models/<jobid>_ec.out
tail -n 30 models/<jobid>_ec.out
- If you would like to switch between the six EC-datasets, you can edit the file "submit_prot_ec.sh" (with an editor like vi ) and change the variable "datasetfile" in this script (from ec50_level1.zip to ec50_level0.zip etc.). Then, launch again with:
sbatch submit_prot_ec.sh - Edit "submit_prot_ec.sh" also if you would like to switch between BERT and T5 Encoder (or if you want to re-submit a Slurm job based on a previous checkpoint).
- Enter the GPU cluster again, if you have logged out in the meantime:
ssh <username>@<hostname>
cd ~/xai-proteins/ec
- Extract the *.zip file (explainability analysis is for ec50_level1):
unzip -j -d ./data/ec50_level1 data/ec50_level1.zip
- Continue on the cluster and download the Uniprot-Swissprot 2017_03 dataset:
bash ./download_swissprot.shIf needed, give permission to the file with:chmod +x ./download_swissprot.sh - Execute
sbatch submit_code.sh processing.pyto create the filesdata/ec*_level*/test.json(andmotif_test.json/active_test.json/binding_test.json, which is a subset of test.json only with samples that are annotated.) - Make sure that any job result potentially extracted earlier has been removed:
rm -rf models/job_resultsThen, untar the selected job result withtar -xf models/zz_<?>.tar -C models/("<?>" denotes the respective job identifier) - (Remove any potential old checkpoints and) copy the new checkpoints/ folder to
./models/ec50_level1.
rm -rf ./models/ec50_level1/checkpoints/
mkdir -p ./models/ec50_level1/
cp -r ./models/job_results/lightning_logs/version_0/checkpoints/ ./models/ec50_level1/
- The resulting directory structure should now look like this (containing further files):
├── Files .md .py .sh .def .sif
├── data/ # Data files
│ ├── uniprot_sprot_2017_03.xml # UniProt Swissprot data from March 2017
│ ├── *_site.pkl # Enzyme site annotations for annotation_eval.py
│ └── ec50_level1/ # Data needed to run ig.py for ec50_level1
│ ├── test.json # [n,3] sequence, name, label
├── models/ # Models
│ └── ec50_level1/checkpoints/*.ckpt # finetuned model for ec50_level1 prediction - Run integrated gradients on embedding level:
sbatch submit_code.sh ig_embedding.py. Output: ./data/ec50_level1/ig_outputs/embedding_attribution.csv - Run ig.py for all layers:
bash call_all.sh. (Run potentially several times if cluster wall time is hit. You can also select one layer only:sbatch submit_code.sh ig.py 0for layer 0, for example. If you want to start all over again from scratch, make sure that you have deleted beforehand any potential previous output withrm -rf ./data/ec50_level1/ig_outputsandrm -rf ./data/ec50_level1/ig_outputs_combinedbefore you runbash call_all.shagain.) - Combine results of all layers and proteins:
sbatch submit_code.sh combining_ig_output_files.py. Output: ./data/ec50_level1/ig_outputs_combined/*.csv and ./data/ec50_level1/test_rel.pkl Note that explainability code integrated_gradient_helper.py, ig_embedding.py, ig.py, combining_ig_output_files.py is tailored to ec50_level1. - Dimensionality reduction with PCA and t-SNE on GPU cluster:
sbatch submit_code.sh ig_cluster.py - (Leave GPU cluster with
exitand) continue on local computer (workstation, notebook). Download the output files./data/ec50_level1/test_rel.pkl,./data/ec50_level1/ig_outputs/embedding_attribution.csv, and./data/ec50_level1/ig_outputs_combined/*.csvfrom cluster to local computer (clone same repository; use same sub-folder structure on local computer like on cluster):
git clone https://github.com/markuswenzel/xai-proteins
cd xai-proteins/ec
mkdir -p ./data/ec50_level1/ig_outputs/
mkdir -p ./data/ec50_level1/ig_outputs_combined/
scp <username>@<hostname>:~/xai-proteins/ec/data/ec50_level1/ig_outputs/embedding_attribution.csv ./data/ec50_level1/ig_outputs/
scp <username>@<hostname>:~/xai-proteins/ec/data/ec50_level1/test_rel.pkl ./data/ec50_level1/
scp <username>@<hostname>:~/xai-proteins/ec/data/ec50_level1/ig_outputs_combined/*.csv ./data/ec50_level1/ig_outputs_combined/
Also download and uncompress the EC-datasets /data/*.zip
scp <username>@<hostname>:~/xai-proteins/ec/data/ec50_level1.zip ./data/
unzip -j -d ./data/ec50_level1 data/ec50_level1.zip
- Create new conda environment
statson local computer:
conda create -n stats ipython
conda activate stats
conda install -c conda-forge ipython scipy==1.8.1 pandas seaborn pytorch scikit-learn tqdm lxml statsmodels
# respectively with
# conda create -n stats
# conda activate stats
# conda install -c conda-forge "scipy>=1.6.0" pandas==1.3.5 seaborn pytorch scikit-learn tqdm lxml statsmodels ipython
- Download Swiss-Prot to local computer with
bash ./download_swissprot.sh - Run (pre-)processing again locally (in conda env
stats) withpython processing.py - Run:
python annotation_eval.pywhich computes correlation between relevances on the sequence-level (computed with an adaptation of integrated gradients) with the active/binding/motif/transmembrane sites found in the protein database UniProt. - Run:
python stat_eval.pywhich identifies heads with a positive relevance summed along the sequence.
- Additional experiments and analyses can be run with these commands:
submit_prot_ec_esm.sh # finetune ESM-2 model
submit_prot_ec_flip.sh # conduct residue substitution experiment
submit_prot_ec_pretrained_shuffled.sh # frozen pretrained or shuffled encoder
# analyses for EC40/50 on levels L1/L2 (in addition to EC50 L1)
sbatch submit_code.sh ig_embedding_EC40L1.py
sbatch submit_code.sh ig_embedding_EC40L2.py
sbatch submit_code.sh ig_embedding_EC50L2.py
python annotation_eval_emb_EC4050L12.py
The file prot_ec.py used for finetuning was copied to prot_ec_mod.py, which was modified for the explainability analysis as follows:
- Commented out:
def __init__(self) -> None:,self.preprocess_dataset(),trainer.fit(model),trainer.test - Added
attention_mask=torch.ones_like(input_ids)to forward function - Changed
return TensorBoardLogger(save_dir="/opt/output"toreturn TensorBoardLogger(save_dir="" - Changed --train_json (below "Data arguments") from
/data/train.jsontodata/ec50_level1/train.json(same for valid.json & test.json)