Mathematical models of disease evolution seek to specify the complex mechanistic relationships between many measurable quantities over space and time. However, as our capability to generate experimental data improves, our ability to conceive of appropriate mathematical descriptions to describe biological phenomena has become a bottleneck.
The goal of Machine Learning is to “learn” complex dependencies automatically within data sets and neural networks (NNs) have been at the forefront of recent advances. Novel computational machinery have enabled NN approaches to scale to the analysis of unprecedentedly large datasets and provide versatility over standard modelling approaches. However, despite their successes in a range of applications, in biomedical research, NNs are often derided for their “black box” discoveries, lack of interpretability and the need for unrealistic quantities of training data. Such criticisms often overlook the fact that default NN structures express no explicit assumptions about the problems on which they are applied and they are often used as generic data mining devices for discovery purposes. Nonetheless, there is a clear opportunity for the development of novel NN approaches that combine scalability and versatility with the capability of learning the physically realistic constraints that are embedded in hand-crafted mathematical models.
This work will be motivated by three distinct application areas:
Cancer evolution modelling from ‘omics data;
Epidemic modelling of antibiotic resistance;
Biomarker modelling in multimorbid populations (joint with MRC Biostatistics Unit, Cambridge) for which we have already established data sets and pre-existing statistical models [Auguet et. al. (2016), Lopez-Garcia & Kypraios (2018), Kypraios & O’Neill (2017), Hu et. al. (2017), Campbell & Yau (2018)].
Our objective is to develop a neural process approach (https://kasparmartens.rbind.io/post/np/
) that automatically learns to emulate these statistical models (and the physical constraints they are based upon) whilst also providing specific adaptations to adjust to novel experimental settings. In doing so, we will allow non-experts access to advanced modelling techniques, through an Automated BioData Scientist (AutoBioDataSci).
Experimental Methods, Research Plan:
Neural Processes (NP) use neural networks to describe a probability distribution over functional relationships. These models are akin to well-studied Gaussian Processes but are more readily scalable to large datasets. NPs project complex high-dimensional observations on to low-dimensional latent function spaces.
For example, suppose we are interested in gene expression changes during bladder cancer progression, for which we have no suitable mathematical model but a large data set of 500 bladder tumours. However, there exists a published model that describes expression changes in breast cancer development but its complex and can only be applied to a data set no more than 100 samples. Instead of constructing a bespoke bladder specific model, we could train an NP to emulate the published breast cancer model and “learn” abstractions of expression changes during cancer development. This gives us a data-driven “prior” of gene expression behaviours and one that can be computed faster than the original model itself. Using the same NP model, we can then apply this model to the bladder cancer expression data knowing that the prior training gives the NP realistic constraints.
Our primary methodological innovation is that at the second stage we allow the NP to learn a second set of abstractions describing behaviours that are unique to the new experiment data (e.g. bladder-specific expression behaviours). This will be achieved by conditioning the NP on shared (e.g. common expression behaviours) and private latent spaces (e.g. breast and bladder specific). This process will allow us to “transfer” knowledge from the breast cancer model to the bladder cancer problem.
Expected outcomes and Impact:
There is a substantial skills shortage in computational biology, with few scientists possessing sufficient training in the biological and mathematical sciences to allow state-of-the-art modelling to be applied to current biological problems. These issues are unlikely to be overcome through improved recruitment and training initiatives alone. This project proposes to build computational software to give biomedical scientists access to state-of-the-art modelling tools that requires light or moderate training.
The candidate should have a strong mathematical background acquired through first degrees in mathematics, physics, computer science or engineering. They will be inspired by the methodological development of algorithmic techniques for data science and interested in pursuing specific applications and delivering impactful research.
Campbell, K. R., Yau, C. (2018). Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data. Nature Communications 9 (1), 2442.
Z Hu, C Yau, AA Ahmed (2017). A pan-cancer genome-wide analysis reveals tumour dependencies by induction of nonsense-mediated decay. Nature Communications 8, 15943.
Lopez-Garcia, M., Kypraios, T. (2018) A unified stochastic modelling framework for the spread of nosocomial infections. J Royal Society Interface 15: 20180060.
Kypraios, T., O'Neill, P.D. (2017) Bayesian nonparametrics for stochastic epidemic models. Statistical Science.
Auguet, O.T., Betley, J.R., Stabler, R.A., Patel, A., Ioannou, A., Marbach, H., Hern, P., Aryee, A., Desai, N., Karadag, T., Grundy, C., Gaunt, M., Cooper, B.S., Edgeworth, J.D., Kypraios, T. (2015) Evidence for Community Transmission of Community-Associated but not Healthcare-Associated Methicillin-Resistant Staphylococcus Aureus Strains Linked to Social and Material Deprivation. PLoS Medicine, 13(1): e1001944.