Big data across diverse fields is both a growing resource and a growing challenge. The availability of large heterogeneous individual-level datasets which link data across diverse domains – such as happiness indicators and microbiological genomics – have the potential to unravel the complex interplay between these myriad elements in the individual life course and in society in general. To maximise this potential, there is a pressing need for the development of advanced methods for analysing these complex data.
This PhD studentship will join expertise in computational modelling and biology (main supervisor Smith) with social sciences, epidemiology demography (co-supervisor Keenan), and genomics and antibiotic resistance (co-supervisor Holden) to develop analytical approaches to tackle challenges in integrating big data about social and behavioural factors with (micro)biological and biomedical data.
In this PhD, you will explore Bayesian networks as a mechanism for integrating social sciences and biological data. Bayesian networks are a flexible statistical tool capable of modelling the most direct relationships among a collection of variables. Bayesian networks can be used both predictively, e.g. given a known set of relationships and some known values, provide estimates for unknown values (variable inference), and as a method for reverse-engineering unknown interactions in a system based on collected data (structure learning).
Social science data is typically observational, with no ability to experimentally manipulate variables, yet there is high interest in recovering functional, causative interactions to explain how certain behaviours or situations arise. Additionally, traditionally applied statistical analyses often struggle data that can be noisy, and the noise non-random. Questionnaires are often answered in a systematically biased way, and self-reporting of certain aspects are subject to poor recall, desirability bias or inaccurate self-estimation. Data also often comes in categorical, rather than numerical forms, which can be problematic for some forms of statistical modelling. Finally, missing data, in terms of missing responses and longitudinal attrition is common. All these features are poorly suited for traditional statistical methods which assume random noise and are based on equations relating numerical values.
Bayesian networks have potential to address all these issues. They can easily model in one analysis many types of variables, ranging from continuous to categorical. Their probabilistic framework is excellently suited to handling noise, including non-normally distributed noise. Bayesian networks can also include ‘hidden nodes’ which represent unmeasured latent variables, whose influence on measured variables can be learned, for example, revealing non-random bias in noise. Finally, variable inference in Bayesian networks presents a natural method of imputing missing data in a sensible and informative way during the learning process.
However, despite their theoretical suitability, Bayesian networks have yet to be widely applied in social sciences. This is partly because it is necessary to develop both practical methodology and specific theoretical advances before they are a truly usable approach. In this PhD, you will develop these methodological innovations and apply your work to two pressing issues: antibacterial resistance and healthy ageing.
The PhD will be focussed around two case studies:
(1) Healthy aging in high-income countries: You will use integrated biomarker, health and social data from the Health and Retirement Study (HRS), a representative study of the over 50s from the United States, to investigate drivers of healthy aging including cognitive and physical function. The study is an ideal case study as like most observational studies suffers from selective attrition and missing data, and contains large numbers of categorical indicators.
(2) Antibacterial resistance (ABR) in East Africa. ABR is a global challenge but especially critical in low income countries where a large proportion of health problems are still related to infectious diseases. You will exploit the large multidisciplinary linked dataset collected as part of the Holistic Approach to Unravelling ABR in East Africa (HATUA) consortium, of which all three supervisors are part (Project Lead: co-supervisor Holden).
This PhD studentship will provide essential tools for integrating big data in social sciences with medically-relevant clinical and biological data, placing you at the forefront of future data science research. You will obtain valuable experience at networking and international interdisciplinary collaboration through the HATUA consortium.
For informal enquires please contact Dr V Anne Smith at [email protected]
Dr V Anne Smith: https://synergy.st-andrews.ac.uk/vannesmithlab/
Dr Katherine Keenan: https://www.st-andrews.ac.uk/gsd/people/klk4/
Prof Matthew Holden: http://medicine.st-andrews.ac.uk/person/mtgh/
Smith (2010) Revealing structure of complex biological systems using Bayesian networks. In: Network Science: Complexity in Nature and Technology (Estrada E, Fox M, Higham DJ , Oppo G-L, eds; Springer) pp 185-204.
Keenan, Katherine, George B. Ploubidis, Richard J. Silverwood, and Emily Grundy. "Life-course partnership history and midlife health behaviours in a population-based birth cohort." J Epidemiol Community Health 71, no. 3 (2017): 232-238.
Sonnega et al (2014) Cohort Profile: the Health and Retirement Study (HRS). International Journal of Epidemiology 42:576–85
Chatterjee et al (2018) Quantifying drivers of antibiotic resistance in humans: a systematic review. The Lancet Infectious Diseases 18:e368-e78