The last decade has seen an explosion of -omics data from biological samples at single-cell resolution (e.g. single-cell RNA-seq, Flow/Mass/Image Cytometry). Compared to previous methodologies in which data was generated by averaging bulk populations of cells in a biological sample, single-cell technologies are able to provide an unbiased “high resolution” view of the cellular heterogeneity present in all biological systems.
Machine Learning algorithms have been crucial in the analysis of single-cell datasets to extract new biological insights. Haniffa has recently interrogated blood dendritic cell heterogeneity using single cell RNA-sequencing explouting unsupervised machine learning algorithms for data analysis (Villani et al., Science (2017) 356:6335.. There are existing software packages for the analysis of single cell datasets which incorporate machine learning algorithms e.g. Seurat, Scanpy and Monocle. The algorithms within these packages are used to cluster cells by transcriptional similarities and define variables (expression of marker genes) to characterise cell clusters.
However, there are limitations to existing ‘off the shelf’ packages. Many of them perform suboptimally when analysing large amounts of data e.g. >100K+ cells. Furthermore, these algorithms can cluster cells by transcriptome profile but annotating cluster identity/label still requires considerable manual input. Automated annotation using machine learning methods, e.g. support vector machines, can assist manual annotation but is unable to provide explanations for the basis of annotations, greatly limiting the insight that can be gained from the data and their extrapolation to other systems. Finally, there is no single package that excels at all of these different analytical questions.
This PhD project will focus on creating reliable, scalable and interpretable algorithms for the automated annotation of individual cells in single-cell data, mainly focusing on single cell RNA-sequencing datasets.
The objectives include (1) comparing the performance of the commonly used software packages for single cell transcriptome data analysis and (2) design supervised machine learning strategies which can automatically assign cell/cluster identities which also provide the rationale used for these assignments.
To do this, we will exploit knowledge extractions strategies (e.g. iterative feature reduction, network generation and partial dependency plots) that we have developed and found to be an effective analytical tool across a broad range of biomedical omics datasets (e.g. Lazzarini et al., BioData Mining 2016, 9:28; Lazzarini et al., Osteoarthritis and Cartilage, 25(12):2014-2021, 2017). Finally, (3) we will use our expertise in improving the scalability of machine learning algorithms (Franco and Bacardit, Information Sciences 2016, 330:385-402; Garcia-Piquer et al., Pattern Recognition Letters, 93:69-77, 2017) to ensure that all methods are ready to handle large datasets that are becoming increasingly common using single-cell technologies. These methods will be evaluated using a collection of public datasets (Human Cell Atlas datasets: https://preview.data.humancellatlas.org
) as well as unpublished data from human development tissues (~200 000 cells) generated by the Haniffa lab.
Villani, Alexandra-Chloé, et al. ""Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors."" Science 356.6335 (2017): eaah4573.
Lazzarini, Nicola, and Jaume Bacardit. ""RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers."" BMC bioinformatics 18.1 (2017): 322.
Franco, María A., and Jaume Bacardit. ""Large-scale experimental evaluation of GPU strategies for evolutionary machine learning."" Information Sciences 330 (2016): 385-402.