Video Transformers at University of York on FindAPhD.com

This project is no longer listed on FindAPhD.com and may not be available.

Click here to search FindAPhD.com for PhD studentship opportunities

Dr A Bors Applications accepted all year round Self-Funded PhD Students Only

York United Kingdom Computer Vision Machine Learning Networks

About the Project

The aims of this project are to develop new methods for processing videos, leading to applications such as video classification, human activity recognition or video generation [1].

The approaches for video processing have recently been to decompose the video into two processing streams, corresponding to motion or content, or into more complex spatio-temporal features [2]. Recently, Transformers [3,4], have been developed as a new, more efficient architecture than Convolutional Neural Networks (CNN), modelling attention mechanisms aiming to find global dependencies in a sequence of data. Particularly, the processing the video data, which is characterized by regions of significant redundancy as well as by fast changing information, would benefit from the processing by attention mechanisms such as those modelled by transformers. Both TimeSformer [5] and the Video Transformer Network (VTN) [6], apply attention mechanisms separately in temporal and spatial dimensions. VTN is particularly suitable for modelling long videos where interactions between entities are spread throughout the video length. The Video Action Transformer model [7] was shown to being able to recognize and localize human actions in video clips. Using transformers on features representing projections of spatio-temporal video data could open new directions for video processing.

This project will develop and analyse a Video Transformer model in the join spatio-temporal video space which will enable novel approaches to video analysis.

Objectives: define spatio-temporal spaces or their projections in video data for applying transformers, improve the efficiency of video analysis, apply to video classification, activity recognition or video generation

Research areas: Deep Learning; Computer Vision and Image Processing; Neural networks.

Applications: Video understanding and classification, Syntactic scene representation from video, Activity Recognition.

The candidate should be familiar or willing to learn about deep learning tools such as PyTorch or TensorFlow.

References

[1] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. Proc. NIPS, 2016, pp. 613–621.
[2] G. Huang and A.G. Bors, Busy-Quiet Video Disentangling for Video Classification, Proc. IEEE WACV 2022.
[3] A. Dosovitskiy, et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proc. ICLR, https://arxiv.org/abs/2010.11929, 2021.
[4] S. Khan, et al., Transformers in Vision: A Survey, https://arxiv.org/pdf/2101.01169.pdf, 2021.
[5] G. Bertasius, H. Wang, L. Torresani, “Is space-time attention all you need for video understanding?” Proc. Int. Conf. on Machine Learning (ICML), https://arxiv.org/abs/2102.05095, 2021.
[6] D. Neimark, et. al,. Video Transformer Network, https://arxiv.org/abs/2102.00719, 2021.
[7] R. Girdhar, et al. “Video action transformer network,” in Proc. CVPR, pp. 244-253, 2019.

How good is research at University of York in Computer Science and Informatics?

Research output data provided by the Research Excellence Framework (REF)

Click here to see the results for all UK universities

Where will I study?

University of York

A prestigious Russell Group university with a global reputation for inspirational and life-changing research

We’ve won a FindAMasters and FindAPhD Postgrad Award!

As a previous winner of a Postgrad Award, our university community has been recognised for their excellence in, and dedication to, postgraduate education. Studying with us means you’ll be part of a supportive environment that has been celebrated for its achievements and positive impact on our postgraduate students.

Find out more