FindAPhD Weekly PhD Newsletter | JOIN NOW FindAPhD Weekly PhD Newsletter | JOIN NOW

This project is no longer listed on and may not be available.

Click here to search for PhD studentship opportunities
  Dr A Bors  Applications accepted all year round  Self-Funded PhD Students Only

About the Project

The aims of this project are to develop new methods for processing videos, leading to applications such as video classification, human activity recognition or video generation [1].

The approaches for video processing have recently been to decompose the video into two processing streams, corresponding to motion or content, or into more complex spatio-temporal features [2]. Recently, Transformers [3,4], have been developed as a new, more efficient architecture than Convolutional Neural Networks (CNN), modelling attention mechanisms aiming to find global dependencies in a sequence of data. Particularly, the processing the video data, which is characterized by regions of significant redundancy as well as by fast changing information, would benefit from the processing by attention mechanisms such as those modelled by transformers. Both TimeSformer [5] and the Video Transformer Network (VTN) [6], apply attention mechanisms separately in temporal and spatial dimensions. VTN is particularly suitable for modelling long videos where interactions between entities are spread throughout the video length. The Video Action Transformer model [7] was shown to being able to recognize and localize human actions in video clips. Using transformers on features representing projections of spatio-temporal video data could open new directions for video processing.

This project will develop and analyse a Video Transformer model in the join spatio-temporal video space which will enable novel approaches to video analysis.

Objectives: define spatio-temporal spaces or their projections in video data for applying transformers, improve the efficiency of video analysis, apply to video classification, activity recognition or video generation

Research areas: Deep Learning; Computer Vision and Image Processing; Neural networks.

Applications: Video understanding and classification, Syntactic scene representation from video, Activity Recognition.

The candidate should be familiar or willing to learn about deep learning tools such as PyTorch or TensorFlow.


[1] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. Proc. NIPS, 2016, pp. 613–621.
[2] G. Huang and A.G. Bors, Busy-Quiet Video Disentangling for Video Classification, Proc. IEEE WACV 2022.
[3] A. Dosovitskiy, et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proc. ICLR,, 2021.
[4] S. Khan, et al., Transformers in Vision: A Survey,, 2021.
[5] G. Bertasius, H. Wang, L. Torresani, “Is space-time attention all you need for video understanding?” Proc. Int. Conf. on Machine Learning (ICML),, 2021.
[6] D. Neimark, et. al,. Video Transformer Network,, 2021.
[7] R. Girdhar, et al. “Video action transformer network,” in Proc. CVPR, pp. 244-253, 2019.

How good is research at University of York in Computer Science and Informatics?

Research output data provided by the Research Excellence Framework (REF)

Click here to see the results for all UK universities
Search Suggestions
Search suggestions

Based on your current searches we recommend the following search filters.

PhD saved successfully
View saved PhDs