The project’s aim is to combine new high-throughput single cell sequencing technologies with mathematical advances in data analysis to identify combinations of genes that oscillate in a coordinated way.
Recent years have seen a significant development in single cell profiling techniques, and we are now able to profile the entire transcriptome in thousands of individual cells in a single experiment. However, there is now a crucial need for new data analysis techniques that are able to extract biological information from these inherently noisy datasets. This project will be concerned with identifying high-dimensional oscillators directly from single cell expression data, focussing particularly on patterns of expression in mouse embryonic stem cells, and their effects on stem cell potency.
In general, identifying the hallmarks of genetic oscillators from expression data is difficult because genetic oscillations may involve large numbers of genes that may covary in complex patterns, making them hard to identify using standard bioinformatic techniques. However, recent mathematical advances, which allow ideas from topology to be applied to data analysis, offer a potential way to solve this problem. The idea underlying this research project is that just as static gene expression patterns give rise to clusters in data, patterns of oscillations give rise to closed “loops” and “cavities” in expression data. While these loops cannot be easily identified using standard analysis methods they will leave a topological signature in the data that can be identified using Topological Data Analysis (TDA) methods.
Topology is the mathematical study of “shape” and the motivation behind TDA is to detect and quantify high dimensional shapes in data via so-called topological invariants. To encode high-dimensional data derived from single cell sequencing experiments into a “shape” we will convert it to a combinatorial structure known as a simplicial complex. A simplicial complex is a natural generalization of a network in which triangles (fully connected triples of nodes), tetrahedra and higher-order cliques are “filled-in”. Simplicial complexes may be constructed from gene expression data by treating each gene as a node and drawing connections betweengenes if they share similar expression patterns in the cell population. Although this process depends on a choice of scale, the topological calculations are made robust by tuning this scale parameter and only considering those features that “persist” on a wide range of length scales. This method, known as persistent homology, is now a significant research area in data science, and a number of computational packages for these computations are now available. As homology naturally detects “loops”, “cavities” and their higher-dimensional analogues, this technique is particularly suitable for detecting the topological structures associated with high-dimensional oscillations in very high-dimensional data.
In summary, the principal aim of this PhD project is to tailor topologically-inspired data analytic techniques to single cell sequencing experiments and implement TDA routines in a robust way for this particular type of data.
This is a challenging project and the prospective candidate should have either a first class degree in mathematics, computer science, physics or related field, and a demonstrated interest in biology, or a first class degree in a biological science and a demonstrated interest and expertise in computational methods. Applications from candidates with strong experimental and computational expertise are particularly welcome. Some programming experience in a numerical computing environment such as Matlab, R, or Python is desirable.
The successful student will form part of a close and successful interdisciplinary team, and will be expected to work closely with colleagues from a variety of disciplines including experimentalists and theoreticians. More details of the lab’s work can be found at www.personal.soton.ac.uk/bdm. Good communication skills, an enthusiasm for applications of complex mathematical ideas to the life sciences and a positive attitude towards interdisciplinary research are essential.
At the end of the PhD programme, the successful student will have gained proficiency in advanced data analysis with applications to biology and will have worked in a collaborative and interdisciplinary environment, thereby increasing her/his employability in highly sought-after analytical skills relevant to future academic or industry jobs.
Supervisors Dr. Ben MacArthur and Dr. Ruben Sanchez-Garcia
General enquiries should be made to Ben MacArthur at [email protected]