Disease prediction years prior to onset is a major goal for biomedical research. This may be possible through the development of cutting edge statistical and machine learning tools and their application to millions of biological data points from thousands of individuals. We have led efforts to integrate genetic and epigenetic (chemical changes to DNA that turn genes on/off) data to develop cross-sectional correlates of health and ageing. However, a lack of longitudinal follow-up data has somewhat hampered our ability to predict future disease risk. Here, by utilizing data linkage to NHS Scotland electronic health records in the Generation Scotland cohort (n>22,000), we will have access to ~15 years of prospective hospital admissions and disease diagnoses. Using omics data generated at the start of the study, we will apply a variety of algorithms to predict disease outcomes and determine if clinically meaningful risk signatures of can be generated decades before disease onset. Genetic data (GWAS) are available on all participants. Epigenetic data (DNA methylation) are currently available on 10,000 individuals with a further 10,000 being processed. This is, by some distance, the largest resource of its kind in the world. We are also in the process of generating proteomic data at scale, which will allow us to integrate multiple omics resources to better predict disease onset and potentially understand disease mechanisms and pathways.
1. To identify commonly diagnosed disease outcomes and comorbidities in the Generation Scotland cohort, through data linkage to electronic health records with up to 15 years of follow-up since the start of the study.
2. To apply (and potentially develop) statistical and machine learning algorithms to predict incident disease. In the most basic form, this will involve logistic regression with standard demographic and lifestyle predictors e.g., age, sex, smoking status, BMI. This will then be extended to include genetic, epigenetic, and proteomic markers – this will result in >1m features/predictors. Feature selection approaches such as elastic net/LASSO/Bayesian variable selection will be used to reduce the dimensionality of the data and to select features of greatest importance. These features can then be taken forward to machine learning models to identify the best predictor for each disease.
3. Once we have derived optimal predictors, as described in (2), we will annotate the genetic, epigenetic, and proteomic features that are included. This will include functional enrichment analyses to determine the pathways and processes that these features are linked to. This step may inform our biological understanding of disease pathogenesis.
1. Data science skills for the manipulation/analysis of big datasets
2. Gain expertise in the epidemiology of ageing, disease progression, and an understanding of electronic health records.
3. Apply (and potentially develop/modify) a host of statistical and machine learning approaches for prediction. Methods that we have currently applied include: logistic regression, elastic net, Bayesian Additive Regression Trees, Random Forest, K-Nearest Neighbours, Support Vector Machine.
This project would suit a motivated student with statistics, data science, machine learning, bioinformatics, epidemiology or related backgrounds. In this cross-disciplinary project, the student will develop, implement and apply statistical/machine learning methodologies that can efficiently combine multiple layers of molecular genetic data, in addition to lifestyle factors, in order to predict disease outcomes. This is a unique opportunity in which data science research will be translated into signatures for potential clinical use. As such, the student will be trained as a biomedical data scientist, developing analytical and computational skills that are in high demand both in academia and in industry.
For informal enquiries, please contact [email protected]
This MRC programme is joint between the Universities of Edinburgh and Glasgow. You will be registered at the host institution of the primary supervisor detailed in your project selection.
All applications should be made via the University of Edinburgh, irrespective of project location. For those applying to a University of Glasgow project, your application along with any supporting documents will be shared with University of Glasgow. http://www.ed.ac.uk/studying/postgraduate/degrees/index.php?r=site/view&id=919
Please note, you must apply to one of the projects and you must contact the primary supervisor prior to making your application. Additional information on the application process is available from the link above.
For more information about Precision Medicine visit: http://www.ed.ac.uk/usher/precision-medicine