Each human gene is a blueprint for multiple distinct RNA transcripts that can be produced in differing quantities based on the localisation of the transcribing cell within the body, this is known as alternative splicing and it explains how the human body can produce ~190,000 mRNA transcripts and ~121,000 proteins from only ~22,000 protein coding genes (Ensembl database release 109, accessed April 2023). Recent large-scale research projects using a combination of DNA and RNA sequencing have now identified genetic variants which modify the abundance of alternatively spliced transcripts (sQTLs) in 12,828 (~67%) protein coding genes across a broad variety of human tissues. Interestingly, the vast majority of genetic variants linked to changes in alternative splicing are mostly found in intronic and intergenic regions, because of this it is not currently well understood how sQTL variants influence alternative splicing. This is a key gap in our understanding, we know which genetic variants modify alternative splicing and which are linked to disease, but we do not know how they modify alternative splicing. This project will build a deep-learning model that will accept DNA-sequence as input and will output an estimate of alternative splicing across the sequence (expressed as the position and number of gapped reads).
Project aims
1. Collect publicly available gene expression data across all human tissues with sufficient sample size from the Genotype Tissue Expression project (GTEx).
2. Create input and output data sets of DNA sequence and RNA-sequencing data for training and testing deep-learning models.
3. Implement and train transformer-based architecture to model the relationship between DNA-sequence and alternative splicing.
4. Determine the optimal deep-learning model by validation against publicly available splicing-QTL data sets.
5. Create a public resource of tissue-specific and gene-specific regulatory sequences / regions most relevant to alternative splicing.
Training/techniques to be provided
This project will provide the student with multiple opportunities to develop cutting-edge skills in deep-learning, computational biology, statistics and high-performance computation. Training and support will be provided from a diverse team of researchers and academics in the areas of computational biology, deep-learning and applied mathematics. Training in R/python and modern multiomics methods will be provided by the Eales group. Training in the Keras/TensorFlow deep-learning framework will be provided jointly by both supervisory research groups. This project is suitable for someone with an active research interest in deep-learning, multiomics or computational biology.
Entry Requirements
Candidates are expected to hold a minimum upper second class honours degree (or equivalent) in a related area. A relevant master’s degree (or equivalent) is preferred, but not essential. Candidates with experience in artificial neural networks, computational biology or with an interest in multiomics are encouraged to apply.
How to Apply
For information on how to apply for this project, please visit the Faculty of Biology, Medicine and Health Doctoral Academy website (https://www.bmh.manchester.ac.uk/study/research/apply/). Informal enquiries may be made directly to the primary supervisor. On the online application form select the PhD Cardiovascular Sciences.
For international students, we also offer a unique 4 year PhD programme that gives you the opportunity to undertake an accredited Teaching Certificate whilst carrying out an independent research project across a range of biological, medical and health sciences. For more information please visit https://www.bmh.manchester.ac.uk/study/research/international/
Equality, Diversity & Inclusion
Equality, diversity and inclusion is fundamental to the success of The University of Manchester, and is at the heart of all of our activities. The full Equality, diversity and inclusion statement can be found on the website
https://www.bmh.manchester.ac.uk/study/research/apply/equality-diversity-inclusion/