This series of notebooks looks at training human promoter classification models for long sequences following the ULMFiT approach. Short here is defined as -500/500 relative to known TSS sites. The dataset is constructed following the method outlined in PromID: Human Promoter Prediction by Deep Learning by Umarov et al. Promoter sequences are generated by locating TSS sites listed in the EPDnew Database and taking the sequence retion -500/500 relative to the TSS. Negative examples are taken from random regions in the genome not containing a defined TSS region. The NCBI Human Genome is used as a reference template.
Notebook 0 details preparation of the dataset.
The notebooks in this folder look at training a classification model using the different stages of the ULMFiT process.
Notebook 1 trains a naive baseline model. This model trains from scratch using only the promoter sequences dataset.
Notebook 2 trains the classification model initialized with a pre-trained human genome language model.
Notebook 3 first fine tunes the human genome language model on the promoter corpus, then trains a classification model intitialized with the fine tuned language model. The 5-mer stride 2 language model is used.
Notebook 4 follows the same training procedure to Notebook 3, except it uses the 4-mer stride 2 language model.
Notebook 5 follows the same training procedure as Notebook 3 and Notebook 4, using the 8-mer stride 3 language model.
Notebook 6 trains a 1-mer stride 1 model
Results compared to Umarov et al.:
| Model | DNA Size | Kmer/Stride | Models | Accuracy | Precision | Recall | Correlation Coefficient |
|---|---|---|---|---|---|---|---|
| Umarov et al. | -1000/500 | - | 2 Model Ensemble | - | 0.636 | 0.802 | 0.714 |
| Umarov et al. | -200/400 | - | 2 Model Ensemble | - | 0.769 | 0.755 | 0.762 |
| Naive Model | -500/500 | 5/2 | Single Model | 0.858 | 0.877 | 0.772 | 0.708 |
| With Pre-Training | -500/500 | 5/2 | Single Model | 0.888 | 0.902 | 0.824 | 0.770 |
| With Pre-Training and Fine Tuning (5mer) | -500/500 | 5/2 | Single Model | 0.889 | 0.886 | 0.846 | 0.772 |
| With Pre-Training and Fine Tuning (4mer) | -500/500 | 4/2 | Single Model | 0.892 | 0.877 | 0.865 | 0.778 |
| With Pre-Training and Fine Tuning (8mer) | -500/500 | 8/3 | Single Model | 0.874 | 0.889 | 0.802 | 0.742 |
| With Pre-Training and Fine Tuning (1mer) | -500/500 | 1/1 | Single Model | 0.894 | 0.900 | 0.844 | 0.784 |