Developing computational methods and statistical models for intelligent analysis of high dimensional big data
Our Centre for Advanced Computational Science is recruiting new students for a number of PhD projects across a range of topics to start in April 2020. Funding is being considered for these projects, and will be competitively assessed and awarded on a case-by-case basis determined by the strength of your application.
This is an exciting PhD project for a student who wishes to develop advanced skills and acquire/produce new knowledge in big data analytics using statistics, machine learning and computational methods. Big data is becoming the hottest research topic worldwide. The project summary is as follows.
Due to the development of technological advances, new types of big data are arising in different disciplines, bringing new challenges and opportunities in both academia and industry. As the computational power increases and the expense of data collection decreases significantly, the dimension of datasets is continuously becoming large. “High dimensional big data” refers to the situations where not only the number of data points is big but also the dimension of predictor variables can be as large as the number of data points. There is currently no methodology available for analysing such challenging big data. A main challenge is that any proposed model should account for a huge number of parameters (e.g., at least one parameter for each predictor variable), so there would be millions of parameters to estimate. This imposes major mathematical and computational challenges since model development and parameter estimation have to be done in the presence of high volume data. So, the first research question is: how can we model high dimensional big data and estimate a huge number of parameters in the presence of high volume data? A second and related challenge concerns computing algorithms for this problem as big data are too large to be processed on a single machine. So, the second research question is: what computing algorithms must be designed for implementing the developed models to process high dimensional big data and produce predictions?
There is an urgent need for the development of computational methods and statistical models to extract valuable insight and knowledge from such big data. As a motivating industrial application, the UK Thales Group collect a huge amount of data on submarine/sonar - 1 array on 1 platform generates 4 petabytes of data in an 8-hour period. Not only is the volume of data extremely big, but also there are millions of variables in the data. Another application arises in genomics where the data contains information on billion bases of DNA with thousands of genes associated with a disease.
This project will develop novel statistical and probabilistic models and implement them using advanced distributed computing algorithms to overcome the above mathematical and computational challenges. Among a huge number of predictor variables often a smaller number of them are informative. This project will specifically utilise regularisation methods such as Lasso and sparse PCA to reduce the dimensionality of predictor variables in order to build novel models using Gaussian processes that, effectively and efficiently, handle the huge number of parameters, enabling parameter estimation and predictions.
Specific Requirements of the Project
In addition to the standard entry requirements, the ideal candidate should be interested in data analysis and particularly in big data analytics. He or she will be able to learn and use statistics, machine learning and computer programming to analyse real world data available from industrial partners.
This project is based on collaborations between Manchester Metropolitan University, Imperial College London, and industrial partners including UK Thales Group in Manchester.
Project Aims and Objectives
The project objectives are summarised below:
Conduct a comprehensive literature review on the state-of-the-art in analysis of big data and high dimensional data, particularly lasso regularisation, sparse PCA, and Gaussian processes.
Develop novel models for high dimensional big data using dimensionality reduction techniques.
Develop distributed computing algorithms for implementing the developed model in (2) to process high dimensional big data and use for predictions.
Validate the methods with real datasets from the industry partner (Thales Group).
Software development/implementation using Python, Spark or Hadoop.
Planned research outcomes:
Research publications in top journals including Journal of Machine Learning Research, IEEE Transactions on Pattern Analysis and Machine Intelligence, and Journal of Statistical Software.
Disseminating and presenting the project findings in major conferences and workshops/seminars.
Developing relevant research ideas for grant applications, for example, for the EPSRC or the Royal Society research grants.
This opportunity is open to UK, EU and Overseas applicants.
This project is being considered for funding, but this is not guaranteed. Should funding be attached to this project, this will be for the equivalent of UK/EU fees plus an annual stipend - Overseas applicants will, therefore, need to pay the difference in fees in this instance.