The aims of this project are to develop new methods for processing videos, leading to applications such as video classification, human activity recognition or video generation [1].
The approaches for video processing have recently been to decompose the video into two processing streams, corresponding to motion or content, or into more complex spatio-temporal features [2]. Recently, Transformers [3,4], have been developed as a new, more efficient architecture than Convolutional Neural Networks (CNN), modelling attention mechanisms aiming to find global dependencies in a sequence of data. Particularly, the processing the video data, which is characterized by regions of significant redundancy as well as by fast changing information, would benefit from the processing by attention mechanisms such as those modelled by transformers. Both TimeSformer [5] and the Video Transformer Network (VTN) [6], apply attention mechanisms separately in temporal and spatial dimensions. VTN is particularly suitable for modelling long videos where interactions between entities are spread throughout the video length. The Video Action Transformer model [7] was shown to being able to recognize and localize human actions in video clips. Using transformers on features representing projections of spatio-temporal video data could open new directions for video processing.
This project will develop and analyse a Video Transformer model in the join spatio-temporal video space which will enable novel approaches to video analysis.
Objectives: define spatio-temporal spaces or their projections in video data for applying transformers, improve the efficiency of video analysis, apply to video classification, activity recognition or video generation
Research areas: Deep Learning; Computer Vision and Image Processing; Neural networks.
Applications: Video understanding and classification, Syntactic scene representation from video, Activity Recognition.
The candidate should be familiar or willing to learn about deep learning tools such as PyTorch or TensorFlow.