Name	Name	Last commit message	Last commit date
parent directory ..
Human Promoters Long Sequences 0 Data Processing.ipynb	Human Promoters Long Sequences 0 Data Processing.ipynb
Human Promoters Long Sequences 1 Naive Model.ipynb	Human Promoters Long Sequences 1 Naive Model.ipynb
Human Promoters Long Sequences 2 Classification with Pretraining.ipynb	Human Promoters Long Sequences 2 Classification with Pretraining.ipynb
Human Promoters Long Sequences 3 Classification with Fine Tuning 5mer.ipynb	Human Promoters Long Sequences 3 Classification with Fine Tuning 5mer.ipynb
Human Promoters Long Sequences 4 Classification with Fine Tuning 4mer.ipynb	Human Promoters Long Sequences 4 Classification with Fine Tuning 4mer.ipynb
Human Promoters Long Sequences 5 Classification with Fine Tuning 8mer.ipynb	Human Promoters Long Sequences 5 Classification with Fine Tuning 8mer.ipynb
Human Promoters Long Sequences 6 Classification with Fine Tuning 1m1s.ipynb	Human Promoters Long Sequences 6 Classification with Fine Tuning 1m1s.ipynb
README.md	README.md

Name

Last commit message

Last commit date

Human Promoters Long Sequences 0 Data Processing.ipynb

Human Promoters Long Sequences 1 Naive Model.ipynb

Human Promoters Long Sequences 2 Classification with Pretraining.ipynb

Human Promoters Long Sequences 3 Classification with Fine Tuning 5mer.ipynb

Human Promoters Long Sequences 4 Classification with Fine Tuning 4mer.ipynb

Human Promoters Long Sequences 5 Classification with Fine Tuning 8mer.ipynb

Human Promoters Long Sequences 6 Classification with Fine Tuning 1m1s.ipynb

README.md

Human Promoter Classification with Long Sequences

This series of notebooks looks at training human promoter classification models for long sequences following the ULMFiT approach. Short here is defined as -500/500 relative to known TSS sites. The dataset is constructed following the method outlined in PromID: Human Promoter Prediction by Deep Learning by Umarov et al. Promoter sequences are generated by locating TSS sites listed in the EPDnew Database and taking the sequence retion -500/500 relative to the TSS. Negative examples are taken from random regions in the genome not containing a defined TSS region. The NCBI Human Genome is used as a reference template.

Notebook 0 details preparation of the dataset.

The notebooks in this folder look at training a classification model using the different stages of the ULMFiT process.

Notebook 1 trains a naive baseline model. This model trains from scratch using only the promoter sequences dataset.

Notebook 2 trains the classification model initialized with a pre-trained human genome language model.

Notebook 3 first fine tunes the human genome language model on the promoter corpus, then trains a classification model intitialized with the fine tuned language model. The 5-mer stride 2 language model is used.

Notebook 4 follows the same training procedure to Notebook 3, except it uses the 4-mer stride 2 language model.

Notebook 5 follows the same training procedure as Notebook 3 and Notebook 4, using the 8-mer stride 3 language model.

Notebook 6 trains a 1-mer stride 1 model

Results compared to Umarov et al.:

Model	DNA Size	Kmer/Stride	Models	Accuracy	Precision	Recall	Correlation Coefficient
Umarov et al.	-1000/500	-	2 Model Ensemble	-	0.636	0.802	0.714
Umarov et al.	-200/400	-	2 Model Ensemble	-	0.769	0.755	0.762
Naive Model	-500/500	5/2	Single Model	0.858	0.877	0.772	0.708
With Pre-Training	-500/500	5/2	Single Model	0.888	0.902	0.824	0.770
With Pre-Training and Fine Tuning (5mer)	-500/500	5/2	Single Model	0.889	0.886	0.846	0.772
With Pre-Training and Fine Tuning (4mer)	-500/500	4/2	Single Model	0.892	0.877	0.865	0.778
With Pre-Training and Fine Tuning (8mer)	-500/500	8/3	Single Model	0.874	0.889	0.802	0.742
With Pre-Training and Fine Tuning (1mer)	-500/500	1/1	Single Model	0.894	0.900	0.844	0.784

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Human Promoter Classification with Long Sequences

FilesExpand file tree

Promoter Classification Long Sequences

Directory actions

More options

Directory actions

More options

Latest commit

History

Promoter Classification Long Sequences

Folders and files

parent directory

README.md

Human Promoter Classification with Long Sequences