Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Human Promoter Classification with Long Sequences

This series of notebooks looks at training human promoter classification models for long sequences following the ULMFiT approach. Short here is defined as -500/500 relative to known TSS sites. The dataset is constructed following the method outlined in PromID: Human Promoter Prediction by Deep Learning by Umarov et al. Promoter sequences are generated by locating TSS sites listed in the EPDnew Database and taking the sequence retion -500/500 relative to the TSS. Negative examples are taken from random regions in the genome not containing a defined TSS region. The NCBI Human Genome is used as a reference template.

Notebook 0 details preparation of the dataset.

The notebooks in this folder look at training a classification model using the different stages of the ULMFiT process.

Notebook 1 trains a naive baseline model. This model trains from scratch using only the promoter sequences dataset.

Notebook 2 trains the classification model initialized with a pre-trained human genome language model.

Notebook 3 first fine tunes the human genome language model on the promoter corpus, then trains a classification model intitialized with the fine tuned language model. The 5-mer stride 2 language model is used.

Notebook 4 follows the same training procedure to Notebook 3, except it uses the 4-mer stride 2 language model.

Notebook 5 follows the same training procedure as Notebook 3 and Notebook 4, using the 8-mer stride 3 language model.

Notebook 6 trains a 1-mer stride 1 model

Results compared to Umarov et al.:

Model DNA Size Kmer/Stride Models Accuracy Precision Recall Correlation Coefficient
Umarov et al. -1000/500 - 2 Model Ensemble - 0.636 0.802 0.714
Umarov et al. -200/400 - 2 Model Ensemble - 0.769 0.755 0.762
Naive Model -500/500 5/2 Single Model 0.858 0.877 0.772 0.708
With Pre-Training -500/500 5/2 Single Model 0.888 0.902 0.824 0.770
With Pre-Training and Fine Tuning (5mer) -500/500 5/2 Single Model 0.889 0.886 0.846 0.772
With Pre-Training and Fine Tuning (4mer) -500/500 4/2 Single Model 0.892 0.877 0.865 0.778
With Pre-Training and Fine Tuning (8mer) -500/500 8/3 Single Model 0.874 0.889 0.802 0.742
With Pre-Training and Fine Tuning (1mer) -500/500 1/1 Single Model 0.894 0.900 0.844 0.784