In the era of big data analytics, statistical and machine learning (ML) tools are widely used to study the relationships and patterns in data.
Large public datasets form a unique resource for developing novel statistical and ML-based methods with the potential to support important clinical applications.
The COVID-19 pandemic caused by SARS-CoV-2 made a significant socio-economic impact on countries worldwide. However, there is still no complete understanding of the main contributors to the severity of COVID-19 and their interdependence. Some of the most challenging to include are co-morbidities, sociodemographic status, lifestyle factors and molecular data, such as polygenic risk score (PRS). PRS quantifies the combined effect of multiple common genetic variants of moderate effect.
ML tools are extensively used in health bioinformatics to identify important risk factors in data and contribute to diagnostics/precision medicine. However, the methods are extremely data-dependent and do not work uniformly well for all cases. For example, many ML models assume linear relationships between variables. While this assumption is correct for some datasets, many complex data include non-linearity, which is very difficult to capture. Data can also be highly dimensional (including many variables), making the modelling even more challenging.
This project focuses on developing novel ML-based methods to capture complex and non-linear relationships between the variables. In addition, the student will look for new patterns of lifestyle and genetic factors linked to clinical records associated with the severity of COVID-19.
The student will work in a multidisciplinary team, collaborating strongly with the School of Medicine. The student will get an exciting opportunity to work with real-life data (UK Biobank data), which are rich in demographic information and clinical records, including the history of patients' medication and genetic data.