Guest Editorial Introduction to the Special Issue on Video Transformers

Liqiang Nie,Jianlong Wu,Nicu Sebe,Kiyoharu Aizawa
DOI: https://doi.org/10.1109/TCSVT.2023.3294789
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Currently, Transformer has been widely used in natural language and image processing and has achieved excellent results. Benefiting from the self-attention operation and global interaction, Transformer has demonstrated more powerful spatiotemporal modeling capabilities than traditional convolutional and recurrent neural networks. However, research on video Transformer is still in its infancy. Specifically, with the development of internet technology, video data has become a commonly used medium, playing a critical role in many areas such as entertainment, education, healthcare, security, etc. Different from static data such as images and text, video data consists of a series of image frames and is more concerned with temporal and motion information, which makes it necessary to employ some adaptations and well-designed network architectures to capture the discriminative features. In addition, the multi-modal information attached to video data further increases the difficulty of applying Transformer to videos.
What problem does this paper attempt to address?