STCA: an action recognition network with spatio-temporal convolution and attention

Qiuhong Tian,Weilun Miao,Lizao Zhang,Ziyu Yang,Yang Yu,Yanying Zhao,Lan Yao
DOI: https://doi.org/10.1007/s13735-024-00350-8
2024-12-06
International Journal of Multimedia Information Retrieval
Abstract:Convolution and self-attention mechanisms are two commonly used methods in the field of video understanding. Convolution preserves spatiotemporal relationships in video data while reducing the number of parameters and computations. The self-attention mechanism captures global and long-distance dependencies in sequence data. To address the challenges of low accuracy and excessive parameters in networks for action recognition, we propose a new network that combines convolution and self-attention mechanisms (STCA). STCA consists of two modules: efficient spatiotemporal convolution (ESTConv) and spatiotemporal self-attention (STA). ESTConv extracts local spatiotemporal features of actions, enabling fast reasoning. STA consists of two sub-modules: the spatial self-attention (SA) and the temporal self-attention (TA). SA analyzes the spatial characteristics of actions, while TA analyzes their temporal characteristics. We conducted experiments on the Kinetics400, UCF101, HMDB51, and Something-Something V2 datasets to evaluate our network. Results show that STCA achieves accuracy comparable to the leading action recognition models while reducing parameters by over 20%, making it more lightweight than current best-performing models.
computer science, artificial intelligence, software engineering
What problem does this paper attempt to address?