Multi-Layer Transformer for Video Classification.

Xing Wu ,Chenjie Tao,Junfeng Yao,Quan Qian
DOI: https://doi.org/10.3233/FAIA230237
2023-01-01
Abstract:Video classification is a challenging task because of the intricate spatiotemporal information present within videos. Current models often rely on 2D or 3D convolutional neural networks. However, convolutional neural networks are difficult to solve the long-range dependency problem. In addition, they are computationally expensive and memory-intensive. To address the challenges, a Multi-layer Transformer is proposed for video classification. The proposed method takes advantage of the high correlation between adjacent frames by grouping them and learning local and global information with a multi-layer structure based on Transformer. First, different frame sampling rates and grouping strategies are tested in the experiments, then comparing the method with state-of-the-art models. The results demonstrate that the proposed method has advanced performance with TOP1 accuracy of 77.8% on the Kinetics-400 dataset and 64.9% on the Something-Something v2 dataset.
What problem does this paper attempt to address?