This is a joint project between School of Mathematics, Cardiff University and the Data Science Campus, Office for National Statistics, funded by an EPSRC Industrial Case Studentship.
The Office for National Statistics (ONS) is the UK’s largest independent producer of official statistics, and is the recognised national statistics institute of the UK. It produces a wide range of statistics on the UK’s economy, society and population. These are used in policy decisions across government, in the allocation of billions of pounds of funding, and by the private, academic and third sector to inform decision-making.
ONS is currently undergoing ambitious transformation, moving away from the traditional use of survey data to compile statistics, to significantly increased use of administrative and other non-survey data to better meet the needs of users. The Independent Review of UK Economic Statistics (Bean, 2016) recommended in Strategic Recommendation D that ONS ‘make the most of existing and new data sources and the technologies for dealing with them’, and in Recommended Action 13, that ONS ‘build ONS’s capacity to clean, match and analyse very large datasets’. It also recommends that ONS build its capacity to do this through collaboration with the academic sector.
Large, complex, multi-variable and multiple data type data sources present a new challenge for anomaly detection as part of the statistical production process. Simple parametric models used for outlier detection in survey data are no longer suitable. They require model assumptions that would become prohibitively complex, are not efficient in processing large data sets, and do not allow for mixed variable types.
Anomaly detection in statistical production is key to ensuring the quality of statistics, and the challenge has not yet been fully addressed in official statistics. Working with ONS, the UK’s national statistics institute, would offer the student access to sensitive, record-level data which is not usually easily available to researchers. Although some record-level survey data are available to academic researchers, non-survey data not collected by ONS is not generally accessible, and where it is, the environments are not usually suitable for big data processing. This project therefore offers the student the novel opportunity not only to work on datasets not usually available to academia, but also to do so in a state-of-the art distributed processing environment.
Datasets that the student would work on may include HMRC’s turnover and expenditure data from value added tax returns and HMRC payroll data. ONS is exploring the potential to use these in the compilation of headline economic statistics including gross domestic product (GDP). Robust understanding of these new datasets is crucial in ensuring the quality of market-moving statistics.
This PhD is designed to develop novel mathematics which bridges linear algebra, statistics and optimization, and to introduce new modern techniques for anomaly detection. Linear algebra has seen applications in a wide variety of areas in multivariate statistics but the last decade has generated a number of new settings in which such techniques are being applied in statistics. Examples include the developments in compressed sensing, and matrix completion, work pioneered by prominent mathematicians such as Candès (Candès & Tao, 2010), Donoho (Donoho, 2006), Tao (Candès & Tao, 2007) and Tsybakov (Rohde & Tsybakov, 2011). The escalation of ‘big data’ has given rise to more considered thought on how optimization can inform statistical procedure as the dimensions of the problem grow. A modern trend has been to form statistical problems as (approximate) convex optimization problems, where the technology is such that existing routines can solve such problems in huge dimensions fairly quickly (Boyd & Vandenberghe, 2004). An interesting question is how close the solution to the approximate convex optimization problem is to the solution of the original statistical problem. This PhD is set in this context outlined, to tackle the problem of anomaly detection.
This is a four-year EPSRC funded studentship for UK/EU nationals who meet UK residency criteria, covering tuition fees and giving an annual maintenance allowance.
UK Research Council eligibility conditions apply
How good is research at Cardiff University in Mathematical Sciences?
FTE Category A staff submitted: 24.05
Research output data provided by the Research Excellence Framework (REF)
Click here to see the results for all UK universities