MEViT: Motion Enhanced Video Transformer for Video Classification

Li Li,Liansheng Zhuang
DOI: https://doi.org/10.1007/978-3-030-98355-0_35
2022-01-01
Abstract:Due to the advantages in extracting the long-range dependencies, self-attention based transformers are widely used to model the spatio-temporal features for video classification, which achieves competitive performance compared to 3D CNNs. To reduce the computational complexity, existing methods divide the frames into patches and factorize the spatial and temporal domains. However, most existing methods globally connect the patches at the same position in different frames to extract the temporal features, and ignore the patch motion due to video objects moving, which might hurt the performance of transformers. This paper proposes a novel architecture called Motion Enhanced Video Transformer (MEViT) for video classification, which captures patch motion information via a new module named Motion self-attention. Different from existing self-attention operation on the temporal dimension, motion self-attention globally connects the query patch and the neighborhood patches in other frames along the temporal dimension when modelling the patch temporal dependencies. Furthermore, this paper also discusses how attention blocks are stacked and how to use the spatiotemporal feature to get the classification feature. Experiments on popular public datasets (including Kinetics-400/600 and Something-Something-v2) demonstrate that our MEViT model outperforms existing dominant video transformer models.
What problem does this paper attempt to address?