Transformers, a recently invented class of deep neural network (DNN) models, have been successfully applied to a wide range of machine learning tasks ranging from natural language processing and image recognition to molecular biology. The core idea that led to these unprecedented successes of the Transformer model is its attention mechanism, which allows the network to learn to identify and prioritise informative aspects of the data. In molecular biology, this mechanism allows a network to identify the amino acid substitution matrix, binding sites and other biologically relevant positions in protein sequences. Google's AlphaFold 2 exploited this idea to accurately predict the tertiary structure of a protein given its amino acid sequence, one of the most important problems in molecular biology. Facebook’s MSA Transformer extends the network’s architecture from sequence data to multiple sequence alignments (MSAs), a typical data type in molecular biology applications. The MSA Transformer, trained on MSAs of proteins in an unsupervised way, is capable of predicting the structure and function of proteins. A key aspect of the MSA Transformer is that attention interleaves between rows (protein sequences) and columns (amino acid positions of the alignment) to leverage information from evolutionary related sequences across many protein families. This advance beyond the sequence-based data which characterised the original Transformer architectures offers great promise for a wide range of applications in computational evolutionary and systems biology.
In this project our overall aim is to develop an interpretable, attention-based approach to genome-wide association studies (GWAS) which accounts for genetic interactions, is capable of modelling multidimensional phenotypes, and scales to the entire human genome. We will develop our approach specifically for gout GWAS data, train it on large-scale UKBB data, and use this to focus on gout-related genetic variation in Māori and Pasifika populations. The project requires to:
- Design, implement, and train an attention-based DNN suitable for large-scale genotype-phenotype data
- Interpret and analyse the results obtained with attention-based DNN for the UKBB gout data from both molecular biology and clinical points of view; identify aspects of the network architecture and data analysis pipeline that can be improved; extend the pipeline to account for biases due to European ancestry of UKBB data
- Unify our results with the natural language understanding domains.
The student’s role will be to contribute to some (or all) of these tasks.
The project will be supervised by Associate Professor Alex Gavryushkin (University of Canterbury) and Professor Michael Witbrock (University of Auckland) and the student will be hosted by University of Canterbury (Christchurch, NZ). The project is a collaboration with Dr Megan Leask and Professor Tony Merriman (University of Alabama at Birmingham), and the student will be expected to be involved in this collaboration.
The starting date will be negotiated with the successful applicant but our preference is to start soon. The students will be supported by a Marsden Fund grant including the scholarship (NZ$35K p.a., non-taxable), tuition fee, conference- and collaboration-related travel expenses, and publication fees.
Please submit your application including a motivation letter, CV, and any other information you would like to include to support your application by email to Alex Gavryushkin ([Email Address Removed]). Please include [MRSDPHD] in the subject line of your email, otherwise your application might be discarded.
All applications received by Sep 31, 2022 will receive full consideration. All longlisted applicants will be advised by Oct 15, 2022.