The terminology “big data” is generally used to refer to datasets that are too large or complex for traditional data-processing technics to adequately deal with them. The application to such datasets of modern machine learning techniques therefore raises many theoretical and numerical challenges. The numerical complexity inherent to the processing of a dataset indeed generally grows polynomially with its size, compromising de facto the analysis of very large datasets. In addition, the treatment of complex datasets often results in models involving a large number of parameters, making such models difficult to train while increasing the risk of overfitting and limiting their interpretability. Since such large-scale datasets are more and more common, their efficient processing is of great importance, not only at a purely scientific level, but also for many industrial and real-life applications.
In parallel with the use of high-performance computing solutions (e.g., parallelisation, computation using graphic processing units), many alternatives exist to try to overcome the difficulties inherent to the learning-with-big-data framework. For instance, problems related to the size of the datasets might be addressed through sample-size and dimension reduction techniques, while feature extraction, low-dimensional approximation or sparsity-inducing penalisation techniques might be used to prevent the model complexity to explode. Such operations need however to be applied with great care since they might have a significant impact on the quality of the final model, their effects being in addition often intrinsically connected. To make matters worse, the existing theory surrounding such approximation techniques is generally quite modest.
The main objective of this project is to investigate the design of efficient approaches to scale-up and improve state-of-the-art machine learning techniques, while providing theoretical guarantees on their behaviour. A special emphasis will be drawn on sample size reduction and feature extraction procedures based on the notion of kernel discrepancy (also referred to as maximum mean discrepancy). Thanks to its ability to characterise representative samples, this notion has recently emerged as a powerful concept in machine learning, statistics and approximation theory (cf. reference 1. in section 4.2); combined with auto-encoder techniques, it is for instance at the core of recent developments in Generative Adversarial Networks (the MMD-GAN method). Investigating to what extent this type of approaches can be generalised is one of the main motivations behind this project.
HOW TO APPLY
Applicants should submit an application for postgraduate study via the online application service: http://www.cardiff.ac.uk/study/postgraduate/research/programmes/programme/mathematics
In the research proposal section of your application, please specify the project title and supervisors of this project.