Don't miss our weekly PhD newsletter | Sign up now Don't miss our weekly PhD newsletter | Sign up now

  Self-supervised Machine Learning from Multiple Sensory Data

   School of Computer Science

This project is no longer listed on and may not be available.

Click here to search for PhD studentship opportunities
  Dr Jianbo Jiao  No more applications being accepted  Funded PhD Project (UK Students Only)

About the Project

Machine Learning or more specifically, Deep Learning has made great progress in many areas including but not limited to computer vision, natural language processing, audio data processing, healthcare, science etc., demonstrating performance even better than human experts. Many of these deep learning-based techniques have been successfully applied in our real life, e.g. face recognition, background replacement in the online meeting, autonomous driving, virtual assistant, drug development and medical diagnosis etc.

However, success heavily relies on manually annotated ground-truth data from human experts to train the deep models. Obtaining high-quality labelled data requires a huge amount of manpower and financial resources and in many situations needs domain knowledge. This is rather difficult to scale but the success of deep neural networks is powered by large-scale datasets. Such limitations also restrict any developed deep model to a particular application scenario and prevents its power from being generalised or transferred to other applications.

An ability to learn without large amounts of manual annotation is crucial for generalised representation learning and could possibly be the key to general artificial intelligence - we humans rarely rely on many annotations. Self-supervised learning, which means learning by the model itself purely from the data itself while without reference to external human annotations, is a path towards the goal. To this end, self-supervised learning has shown its effectiveness in image [1] and video [2, 3] understanding, medical data analysis [4, 5], etc.

Although the major information acquisition comes from visual data for our humans, as we all know and experience, the world around us consists of many other different data sources for us humans to perceive and is essential to understanding the world. For instance, the audio sound, text/language, tactile sense, and sense of smell, to name a few. How to adequately leverage such information from multiple data modalities would be the key question to approaching a more natural learning framework as well as more general intelligence. Preliminary study [6] has shown the possibility of learning self-supervised representations from multi-modal data, but it is still under-explored with many challenging problems to address.

This project aims to study and explore the potential of learning general transferable representations from multi-modal data in a self-supervised manner. The multi-modal data here means data from multiple sensors. For example, potential data modalities could be image, video, audio, text, 3D depth, multi-view, geographical information and other metadata. The target general transferable representations indicate the knowledge learned by the deep model that can be well transferred to downstream tasks. For example, a model was pre-trained on a large-scale dataset for task A, and then applied to tasks B, C, etc. without requiring additional effort on data from tasks B and C. This is important as it can greatly alleviate the cost of building AI models. Multi-modal data is also beneficial for self-supervised representation learning as it provides more constraints and consistency among different modalities. The student is expected to start by working on public datasets available in the community. Deep learning models will be developed by the student to take multi-modal data as input and generate the high-quality representations as mentioned above. In a later stage, a new dataset would be constructed and novel algorithms will be developed upon that to answer the challenging questions within this topic and beyond.

Eligibility: First or Upper Second Class Honours undergraduate degree and/or postgraduate degree with Distinction (or an international equivalent). We also consider applicants from diverse backgrounds that have provided them with equally rich relevant experience and knowledge. Full-time and part-time study modes are available.

A successful candidate will have strong mathematical and programming skills Python, C, Matlab, etc). It would be an advantage to have experience with deep learning frameworks (e.g. PyTorch and TensorFlow). Even better, some experience of one of more of the following: computer vision, audio or natural language processing, self-supervised learning models

Computer Science (8)

Funding Notes

The position offered is for three and a half years full-time study. The current (2022-23) value of the award is stipend; £17,668 pa; tuition fee: £4,596 pa. Awards are usually incremented on 1 October each following year.

We will consider applications from students wishing to start during the 2022-23 academic year or who wish to begin their studies in autumn 2023.


[1] Ge, C., Liang, Y., Song, Y., Jiao, J., Wang, J., & Luo, P. (2021). Revitalizing CNN attention via transformers in self-supervised visual representation learning. Advances in Neural Information Processing Systems, 34, 4193-4206.
[2] Wang, J., Jiao, J., Bao, L., He, S., Liu, W., & Liu, Y. H. (2021). Self-supervised video representation learning by uncovering spatio-temporal statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[3] Wang, J., Jiao, J., & Liu, Y. H. (2020, August). Self-supervised video representation learning by pace prediction. In European conference on computer vision (pp. 504-521). Springer, Cham.
[4] Jiao, J., Namburete, A. I., Papageorghiou, A. T., & Noble, J. A. (2020). Self-supervised ultrasound to MRI fetal brain image synthesis. IEEE Transactions on Medical Imaging, 39(12), 4413-4424.
[5] Jiao, J., Droste, R., Drukker, L., Papageorghiou, A. T., & Noble, J. A. (2020, April). Self-supervised representation learning for ultrasound video. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) (pp. 1847-1850). IEEE.
[6] Jiao, J., Cai, Y., Alsharid, M., Drukker, L., Papageorghiou, A. T., & Noble, J. A. (2020, October). Self-supervised contrastive video-speech representation learning for ultrasound. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 534-543). Springer, Cham.

How good is research at University of Birmingham in Computer Science and Informatics?

Research output data provided by the Research Excellence Framework (REF)

Click here to see the results for all UK universities

Where will I study?

Search Suggestions
Search suggestions

Based on your current searches we recommend the following search filters.