In order to improve our defenses against the rapid spread of high-risk strains of infectious diseases in the future, we must better leverage the large scale collection and processing of genomic data that is possible with today’s technologies. A number of international initiatives are currently exploring the implementation of real-time wastewater metagenomic surveillance to track the prevalence of existing pathogens and identify emerging pathogens. These programs will produce enormous amounts of data which must be processed quickly to discover concerning trends.
The current approach to analyzing genomic data flexibly is to convert it into kmers - words of a fixed length. The presence/absence and abundance of these words can be used to represent the genetic diversity contained in an environment, patient sample, or population of a given pathogen. However, breaking genomic data into kmers can create a very large dataset that is challenging to analyze, store and share.
One way of reducing this data is to collapse overlapping kmers that are found in the same samples into larger units called unitigs. This can be done using de Bruijn graphs (DBGs), which are normally used for assembling genomes, to assemble a graph of the sequence diversity of a species or sample (a pangenome graph). This can drastically reduce the number of variables to consider in an analysis without any loss of information. For some applications, however, even this may produce too much data. Minimizer-space de Bruijn graphs (MDGBs) are an exciting new development that may provide more drastic reductions in dataset size. In MDBGs, DNA sequences are reduced to an ordered set of subsequences that only represent a subset of the total sequence. The benefit of this reduced representation is a massive speedup in analysis time and reduced storage footprint. However, with this compression of complex biological information, there is a risk we may lose vital signal that we need for a meaningful analysis of the data.
This project aims to address several key questions:
• How much overlap is there in individual kmers from longitudinal metagenomic sequencing?
• How much overlap is there in regions of a pangenome graph from longitudinal metagenomic sequencing?
• Can we speed up monitoring of genomic diversity in metagenomics experiments through the use of MDGBs? Does this result in a loss of sensitivity or specificity when monitoring for species of interest?
Applicants should have a background in bioinformatics, biology, biochemistry, physics, mathematics, computer science, or a related subject. The project will involve computational analysis of DNA sequence data and require programming skills. Prior proficiency in Python is preferred. Preference will be given to applicants with prior experience analyzing microbial genomic data or working with kmer and/or graph structured data. We want our PhD student cohorts to reflect our diverse society. UoB is therefore committed to widening the diversity of our PhD student cohorts. UoB studentships are open to all and we particularly welcome applications from under-represented groups, including, but not limited to BAME, disabled and neuro-diverse candidates. We also welcome applications for part-time study.
Eligibility: First or Upper Second Class Honours undergraduate degree and/or postgraduate degree with Distinction (or an international equivalent). We also consider applicants from diverse backgrounds that have provided them with equally rich relevant experience and knowledge. Full-time and part-time study modes are available. If your first language is not English and you have not studied in an English-speaking country, you will have to provide an English language qualification.